Django Blog Project #9: Migrating Blogger posts with Beautiful Soup



This is my OLD blog. I've copied this post over to my NEW blog at:

http://www.saltycrane.com/blog/2008/07/django-blog-project-9-migrating-blogger/

You should be redirected in 2 seconds.



Last post, I talked about adding comments to my new sample blog application. This was about the last basic feature I needed to add before I started actually using it for real. Of course there are still a number of features I'd like to add, such as automatic syntax highlighting with Pygments, and incorporating django-tagging and some more intersting views, not to mention comment moderation. But I think those will have to wait-- I want to start using my new blog for real sometime.

So for the past few days, I've been working on my Beautiful Soup screen scraper script to copy all my Blogger posts over to my new Django blog. Initial results came quickly (it's pretty cool to see such a huge data dump after only a few lines of Beautiful Soup'ing) but the details (especially with the comments) kind of slowed me down. I've finally got everything copied over to my satisfaction. Below is the script I used to do it. Note, I realize it's not pretty-- just a one time use hack. But hopefully someone else doing the same thing might find it useful.

#!/usr/bin/env python

import datetime
import os
import re
import urllib2
from BeautifulSoup import BeautifulSoup
from myblogapp.models import Post, LegacyComment
from django.contrib.comments.models import FreeComment

URL = ''.join([
        'http://iwiwdsmi.blogspot.com/search?',
        'updated-min=2006-01-01T00%3A00%3A00-08%3A00&'
        'updated-max=2009-01-01T00%3A00%3A00-08%3A00&',
        'max-results=1000'
        ])
html = urllib2.urlopen(URL).read()
soup = BeautifulSoup(html)

for post in soup.html.body.findAll('div', {'class': 'post'}):
    print
    print '--------------------------------------------------------------'

    # save the post title and permalink
    h3 = post.find('h3', {'class': 'post-title'})
    post_href = h3.find('a')['href']
    post_title = h3.find('a').string
    post_slug = os.path.basename(post_href).rstrip('.html')
    print post_slug
    print post_href
    print post_title

    # save the post body
    div = post.find('div', {'class': 'post-body'})
    [toremove.extract() for toremove in div.findAll('script')]
    [toremove.extract() for toremove in div.findAll('span', {'id': 'showlink'})]
    [toremove.extract() for toremove in div.findAll('div', {'style': 'clear: both;'})]
    [toremove.parent.extract() for toremove in div.findAll(text='#fullpost{display:none;}')]
    post_body = ''.join([str(item)
                         for item in div.contents
                         ]).rstrip()
    post_body = re.sub(r"iwiwdsmi\.blogspot\.com/(\d{4}/\d{2}/[\w\-]+)\.html", 
                       r"www.saltycrane.com/blog/\1/", 
                       post_body)

    # count number of highlighted code sections 
    highlight = div.findAll('div', {'class': 'highlight'})
    if highlight:
        hl_count += len(highlight)
        hl_list.append(post_title)

    # save the timestamp
    a = post.find('a', {'class': 'timestamp-link'})
    try:
        post_timestamp = a.string
    except:
        match = re.search(r"\.com/(\d{4})/(\d{2})/", post_href)
        if match:
            year = match.group(1)
            month = match.group(2)
        post_timestamp = "%s/01/%s 11:11:11 AM" % (month, year)
    print post_timestamp

    # save the tags (this is ugly, i know)
    if 'error' in post_title.lower():
        post_tags = ['error']
    else:
        post_tags = []
    span = post.find('span', {'class': 'post-labels'})
    if span:
        a = span.findAll('a', {'rel': 'tag'})
    else:
        a = post.findAll('a', {'rel': 'tag'})
    post_tags = ' '.join([tag.string for tag in a] + post_tags)
    if not post_tags:
        post_tags = 'untagged'
    print post_tags

    # add Post object to new blog
    if True:
        p = Post()
        p.title = post_title
        p.body = post_body
        p.date_created = datetime.datetime.strptime(post_timestamp, "%m/%d/%Y %I:%M:%S %p")
        p.date_modified = p.date_created
        p.tags = post_tags
        p.slug = post_slug
        p.save()

    # check if there are comments
    a = post.find('a', {'class': 'comment-link'})
    if a:
        comm_string = a.string.strip()
    else:
        comm_string = "0"
    if comm_string[0] != "0":
        print
        print "COMMENTS:"

        # get the page with comments
        html_single = urllib2.urlopen(post_href).read()
        soup_single = BeautifulSoup(html_single)

        # get comments
        comments = soup_single.html.body.find('div', {'class': 'comments'})
        cauth_list = comments.findAll('dt')
        cbody_list = comments.findAll('dd', {'class': 'comment-body'})
        cdate_list = comments.findAll('span', {'class': 'comment-timestamp'})

        if not len(cauth_list)==len(cbody_list)==len(cdate_list):
            raise "didn't get all comment data"

        for auth, body, date in zip(cauth_list, cbody_list, cdate_list):
            
            # create comment in database
            lc = LegacyComment()
            lc.body = str(body.p)

            # find author
            lc.author = "Anonymous"
            auth_a = auth.findAll('a')[-1]
            auth_no_a = auth.contents[2]
            if auth_a.string:
                lc.author = auth_a.string
            elif auth_no_a:
                match = re.search(r"\s*([\w\s]*\w)\s+said", str(auth_no_a))
                if match:
                    lc.author = match.group(1)
            print lc.author

            # find website
            try:
                lc.website = auth_a['href']
            except KeyError:
                lc.website = ''
            print lc.website

            # other info
            lc.date_created = datetime.datetime.strptime(
                date.a.string.strip(), "%m/%d/%Y %I:%M %p")
            print lc.date_created
            lc.date_modified = lc.date_created
            lc.post_id = p.id
            lc.save()

I also made some changes to my Django blog code as I migrated my Blogger posts. The main addition was a LegacyComment model along with the associated views and templates. My Blogger comments consisted of HTML markup, but I didn't want to allow arbitrary HTML in my new comments for fear of cross site scripting. So I separated my legacy Blogger comments from my new Django site comments.



models.py

Here are my model changes. I added a LegacyComment class which contains pertinent comment attributes and a ForeignKey to the post that it belongs to. I also added a lc_count (for legacy comment count) field to the Post class which stores the number of comments for the post. It is updated by the save() method in the LegacyComment class every time a comment is saved. Hmmm, I just realized the count will be wrong if I ever edit these comments. Well, since these are legacy comments, hopefully I won't have to edit them.

~/src/django/myblogsite/myblogapp/models.py:
import re
from django.db import models

class Post(models.Model):
    title = models.CharField(maxlength=200)
    slug = models.SlugField(maxlength=100)
    date_created = models.DateTimeField() #auto_now_add=True)
    date_modified = models.DateTimeField()
    tags = models.CharField(maxlength=200)
    body = models.TextField()
    body_html = models.TextField(editable=False, blank=True)
    lc_count = models.IntegerField(default=0, editable=False)

    def get_tag_list(self):
        return re.split(" ", self.tags)

    def get_absolute_url(self):
        return "/blog/%d/%02d/%s/" % (self.date_created.year,
                                      self.date_created.month,
                                      self.slug)

    def __str__(self):
        return self.title

    class Meta:
        ordering = ["-date_created"]

    class Admin:
        pass

class LegacyComment(models.Model):
    author = models.CharField(maxlength=60)
    website = models.URLField(core=False)
    date_created = models.DateTimeField()
    date_modified = models.DateTimeField()
    body = models.TextField()
    post = models.ForeignKey(Post)

    def save(self):
        p = Post.objects.get(id=self.post.id)
        p.lc_count += 1
        p.save()
        super(LegacyComment, self).save()

    class Meta:
        ordering = ["date_created"]

    class Admin:
        pass


views.py

Here is an excerpt from my views.py file showing the changes:

~/src/django/myblogsite/myblogapp/views.py:
import re
from datetime import datetime
from django.shortcuts import render_to_response
from myblogsite.myblogapp.models import Post, LegacyComment

MONTH_NAMES = ('', 'January', 'Feburary', 'March', 'April', 'May', 'June', 'July',
               'August', 'September', 'October', 'November', 'December')
MAIN_TITLE = "Sofeng's Blog 0.0.7"

def frontpage(request):
    posts, pagedata = init()
    posts = posts[:5]
    pagedata.update({'post_list': posts,
                     'subtitle': '',})
    return render_to_response('listpage.html', pagedata)

def singlepost(request, year, month, slug2):
    posts, pagedata = init()
    post = posts.get(date_created__year=year,
                            date_created__month=int(month),
                            slug=slug2,)
    legacy_comments = LegacyComment.objects.filter(post=post.id)
    pagedata.update({'post': post,
                     'lc_list': legacy_comments,})
    return render_to_response('singlepost.html', pagedata)


Templates

In the list page template I used the truncatewords_html template filter to show a 50 word post summary on the list pages instead of the full post. I also added the legacy comment count with the Django free comment count to display the total number of comments.

Excerpt from ~/src/django/myblogsite/templates/listpage.html:
{% block main %}
  <br>
  {% for post in post_list %}
    <h4><a href="/blog/{{ post.date_created|date:"Y/m" }}/{{ post.slug }}/">
        {{ post.title }}</a>
    </h4>
    {{ post.body|truncatewords_html:"50" }}
    <a href="{{ post.get_absolute_url }}">Read more...</a><br>
    <br>
    <hr>
    <div class="post_footer">
      {% ifnotequal post.date_modified.date post.date_created.date %}
        Last modified: {{ post.date_modified.date }}<br>
      {% endifnotequal %}
      Date created: {{ post.date_created.date }}<br>
      Tags: 
      {% for tag in post.get_tag_list %}
        <a href="/blog/tag/{{ tag }}/">{{ tag }}</a>{% if not forloop.last %}, {% endif %}
      {% endfor %}
      <br>

      {% get_free_comment_count for myblogapp.post post.id as comment_count %}
      <a href="{{ post.get_absolute_url }}#comments">
        {{ comment_count|add:post.lc_count }} 
        Comment{{ comment_count|add:post.lc_count|pluralize}}</a>

    </div>
    <br>
  {% endfor %}
{% endblock %}

In the single post template, I added the display of the Legacy comments in addition to the Django free comments.

Excerpt from ~/src/django/myblogsite/templates/singlepost.html:
  <a name="comments"></a>
  {% if lc_list %}
    <h4>{{ lc_list|length }} Legacy Comment{{lc_list|length|pluralize}}</h4>
  {% endif %}
  {% for legacy_comment in lc_list %}
    <br>
    <a name="lc{{ legacy_comment.id }}" href="#lc{{ legacy_comment.id }}">
      #{{ forloop.counter }}</a>
    {% if legacy_comment.website %}
      <a href="{{ legacy_comment.website }}">
        <b>{{ legacy_comment.author|escape }}</b></a> 
    {% else %}
      <b>{{ legacy_comment.author|escape }}</b>
    {% endif %}
    commented,
    on {{ legacy_comment.date_created|date:"F j, Y" }} 
    at {{ legacy_comment.date_created|date:"P" }}:
    {{ legacy_comment.body }}
  {% endfor %}
  <br>

That's it. Hopefully, I can start using my new blog soon. Please browse around on the new Django site and let me know if you run across any problems. When everything looks to be OK, I'll start posting only on my new Django site.

Here is a snapshot screenshot of version 0.0.8:


The live site can be viewed at: http://saltycrane.com/blog


Related posts:
   Django Blog Project #1: Creating a basic blog
   Django Blog Project #2: Deploying at Webfaction
   Django Blog Project #3: Using CSS and Template Inheritance
   Django Blog Project #4: Adding post metadata
   Django Blog Project #5: YUI CSS and serving static media
   Django Blog Project #6: Creating standard blog views
   Django Blog Project #7: Adding a simple Atom feed
   Django Blog Project #8: Adding basic comment functionality

2 comments:

Bulkan-Savun Evcimen said...

Why not use the Blogger API ?

I migrated from Blogger to my own blog that i wrote in Django and i just wrote a script that used the Blogger API to get all my posts and save them as Post objects.

sofeng said...

bulkan-savun evcimen,
wow, i haven't seen that Blogger API before. it seems i've done things the hard way again. this may be useful for me to update my Blogger posts to point to my new Django blog. do you mind sharing your script? thanks for the tip. btw, you have a great looking Django blog!

About

This is my *OLD* blog. I've copied all of my posts and comments over to my NEW blog at:

http://www.saltycrane.com/blog/.

Please go there for my updated posts. I will leave this blog up for a short time, but eventually plan to delete it. Thanks for reading.