Django Blog Project #9: Migrating Blogger posts with Beautiful Soup
This is my OLD blog. I've copied this post over to my NEW blog at:
http://www.saltycrane.com/blog/2008/07/django-blog-project-9-migrating-blogger/
You should be redirected in 2 seconds.
Last post, I talked about adding comments to my new sample blog application. This was about the last basic feature I needed to add before I started actually using it for real. Of course there are still a number of features I'd like to add, such as automatic syntax highlighting with Pygments, and incorporating django-tagging and some more intersting views, not to mention comment moderation. But I think those will have to wait-- I want to start using my new blog for real sometime.
So for the past few days, I've been working on my Beautiful Soup screen scraper script to copy all my Blogger posts over to my new Django blog. Initial results came quickly (it's pretty cool to see such a huge data dump after only a few lines of Beautiful Soup'ing) but the details (especially with the comments) kind of slowed me down. I've finally got everything copied over to my satisfaction. Below is the script I used to do it. Note, I realize it's not pretty-- just a one time use hack. But hopefully someone else doing the same thing might find it useful.
#!/usr/bin/env python import datetime import os import re import urllib2 from BeautifulSoup import BeautifulSoup from myblogapp.models import Post, LegacyComment from django.contrib.comments.models import FreeComment URL = ''.join([ 'http://iwiwdsmi.blogspot.com/search?', 'updated-min=2006-01-01T00%3A00%3A00-08%3A00&' 'updated-max=2009-01-01T00%3A00%3A00-08%3A00&', 'max-results=1000' ]) html = urllib2.urlopen(URL).read() soup = BeautifulSoup(html) for post in soup.html.body.findAll('div', {'class': 'post'}): print print '--------------------------------------------------------------' # save the post title and permalink h3 = post.find('h3', {'class': 'post-title'}) post_href = h3.find('a')['href'] post_title = h3.find('a').string post_slug = os.path.basename(post_href).rstrip('.html') print post_slug print post_href print post_title # save the post body div = post.find('div', {'class': 'post-body'}) [toremove.extract() for toremove in div.findAll('script')] [toremove.extract() for toremove in div.findAll('span', {'id': 'showlink'})] [toremove.extract() for toremove in div.findAll('div', {'style': 'clear: both;'})] [toremove.parent.extract() for toremove in div.findAll(text='#fullpost{display:none;}')] post_body = ''.join([str(item) for item in div.contents ]).rstrip() post_body = re.sub(r"iwiwdsmi\.blogspot\.com/(\d{4}/\d{2}/[\w\-]+)\.html", r"www.saltycrane.com/blog/\1/", post_body) # count number of highlighted code sections highlight = div.findAll('div', {'class': 'highlight'}) if highlight: hl_count += len(highlight) hl_list.append(post_title) # save the timestamp a = post.find('a', {'class': 'timestamp-link'}) try: post_timestamp = a.string except: match = re.search(r"\.com/(\d{4})/(\d{2})/", post_href) if match: year = match.group(1) month = match.group(2) post_timestamp = "%s/01/%s 11:11:11 AM" % (month, year) print post_timestamp # save the tags (this is ugly, i know) if 'error' in post_title.lower(): post_tags = ['error'] else: post_tags = [] span = post.find('span', {'class': 'post-labels'}) if span: a = span.findAll('a', {'rel': 'tag'}) else: a = post.findAll('a', {'rel': 'tag'}) post_tags = ' '.join([tag.string for tag in a] + post_tags) if not post_tags: post_tags = 'untagged' print post_tags # add Post object to new blog if True: p = Post() p.title = post_title p.body = post_body p.date_created = datetime.datetime.strptime(post_timestamp, "%m/%d/%Y %I:%M:%S %p") p.date_modified = p.date_created p.tags = post_tags p.slug = post_slug p.save() # check if there are comments a = post.find('a', {'class': 'comment-link'}) if a: comm_string = a.string.strip() else: comm_string = "0" if comm_string[0] != "0": print print "COMMENTS:" # get the page with comments html_single = urllib2.urlopen(post_href).read() soup_single = BeautifulSoup(html_single) # get comments comments = soup_single.html.body.find('div', {'class': 'comments'}) cauth_list = comments.findAll('dt') cbody_list = comments.findAll('dd', {'class': 'comment-body'}) cdate_list = comments.findAll('span', {'class': 'comment-timestamp'}) if not len(cauth_list)==len(cbody_list)==len(cdate_list): raise "didn't get all comment data" for auth, body, date in zip(cauth_list, cbody_list, cdate_list): # create comment in database lc = LegacyComment() lc.body = str(body.p) # find author lc.author = "Anonymous" auth_a = auth.findAll('a')[-1] auth_no_a = auth.contents[2] if auth_a.string: lc.author = auth_a.string elif auth_no_a: match = re.search(r"\s*([\w\s]*\w)\s+said", str(auth_no_a)) if match: lc.author = match.group(1) print lc.author # find website try: lc.website = auth_a['href'] except KeyError: lc.website = '' print lc.website # other info lc.date_created = datetime.datetime.strptime( date.a.string.strip(), "%m/%d/%Y %I:%M %p") print lc.date_created lc.date_modified = lc.date_created lc.post_id = p.id lc.save()
I also made some changes to my Django blog code as I migrated my Blogger posts.
The main addition was a LegacyComment
model along with the associated
views and templates. My Blogger comments consisted of HTML markup, but I didn't
want to allow arbitrary HTML in my new comments for fear of
cross site
scripting. So I separated my legacy Blogger comments from my new Django
site comments.
models.py
Here are my model changes. I added a LegacyComment
class which
contains pertinent comment attributes and a ForeignKey
to the
post that it belongs to. I also added a lc_count
(for legacy
comment count) field to the Post
class which stores the number
of comments for the post. It is updated by the save()
method in
the LegacyComment
class every time a comment is saved. Hmmm,
I just realized the count will be wrong if I ever edit these comments. Well,
since these are legacy comments, hopefully I won't have to edit them.
~/src/django/myblogsite/myblogapp/models.py
:
import re from django.db import models class Post(models.Model): title = models.CharField(maxlength=200) slug = models.SlugField(maxlength=100) date_created = models.DateTimeField() #auto_now_add=True) date_modified = models.DateTimeField() tags = models.CharField(maxlength=200) body = models.TextField() body_html = models.TextField(editable=False, blank=True) lc_count = models.IntegerField(default=0, editable=False) def get_tag_list(self): return re.split(" ", self.tags) def get_absolute_url(self): return "/blog/%d/%02d/%s/" % (self.date_created.year, self.date_created.month, self.slug) def __str__(self): return self.title class Meta: ordering = ["-date_created"] class Admin: pass class LegacyComment(models.Model): author = models.CharField(maxlength=60) website = models.URLField(core=False) date_created = models.DateTimeField() date_modified = models.DateTimeField() body = models.TextField() post = models.ForeignKey(Post) def save(self): p = Post.objects.get(id=self.post.id) p.lc_count += 1 p.save() super(LegacyComment, self).save() class Meta: ordering = ["date_created"] class Admin: pass
views.py
Here is an excerpt from my views.py file showing the changes:
~/src/django/myblogsite/myblogapp/views.py
:
import re from datetime import datetime from django.shortcuts import render_to_response from myblogsite.myblogapp.models import Post, LegacyComment MONTH_NAMES = ('', 'January', 'Feburary', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December') MAIN_TITLE = "Sofeng's Blog 0.0.7" def frontpage(request): posts, pagedata = init() posts = posts[:5] pagedata.update({'post_list': posts, 'subtitle': '',}) return render_to_response('listpage.html', pagedata) def singlepost(request, year, month, slug2): posts, pagedata = init() post = posts.get(date_created__year=year, date_created__month=int(month), slug=slug2,) legacy_comments = LegacyComment.objects.filter(post=post.id) pagedata.update({'post': post, 'lc_list': legacy_comments,}) return render_to_response('singlepost.html', pagedata)
Templates
In the list page template I used the truncatewords_html
template
filter to show a 50 word post summary on the list pages instead of the full post.
I also added the legacy comment count with the Django free comment count to
display the total number of comments.
~/src/django/myblogsite/templates/listpage.html
:
{% block main %} <br> {% for post in post_list %} <h4><a href="/blog/{{ post.date_created|date:"Y/m" }}/{{ post.slug }}/"> {{ post.title }}</a> </h4> {{ post.body|truncatewords_html:"50" }} <a href="{{ post.get_absolute_url }}">Read more...</a><br> <br> <hr> <div class="post_footer"> {% ifnotequal post.date_modified.date post.date_created.date %} Last modified: {{ post.date_modified.date }}<br> {% endifnotequal %} Date created: {{ post.date_created.date }}<br> Tags: {% for tag in post.get_tag_list %} <a href="/blog/tag/{{ tag }}/">{{ tag }}</a>{% if not forloop.last %}, {% endif %} {% endfor %} <br> {% get_free_comment_count for myblogapp.post post.id as comment_count %} <a href="{{ post.get_absolute_url }}#comments"> {{ comment_count|add:post.lc_count }} Comment{{ comment_count|add:post.lc_count|pluralize}}</a> </div> <br> {% endfor %} {% endblock %}
In the single post template, I added the display of the Legacy comments in addition to the Django free comments.
Excerpt from~/src/django/myblogsite/templates/singlepost.html
:
<a name="comments"></a>
{% if lc_list %}
<h4>{{ lc_list|length }} Legacy Comment{{lc_list|length|pluralize}}</h4>
{% endif %}
{% for legacy_comment in lc_list %}
<br>
<a name="lc{{ legacy_comment.id }}" href="#lc{{ legacy_comment.id }}">
#{{ forloop.counter }}</a>
{% if legacy_comment.website %}
<a href="{{ legacy_comment.website }}">
<b>{{ legacy_comment.author|escape }}</b></a>
{% else %}
<b>{{ legacy_comment.author|escape }}</b>
{% endif %}
commented,
on {{ legacy_comment.date_created|date:"F j, Y" }}
at {{ legacy_comment.date_created|date:"P" }}:
{{ legacy_comment.body }}
{% endfor %}
<br>
That's it. Hopefully, I can start using my new blog soon. Please browse around on the new Django site and let me know if you run across any problems. When everything looks to be OK, I'll start posting only on my new Django site.
Here is a snapshot screenshot of version 0.0.8:
The live site can be viewed at: http://saltycrane.com/blog
Related posts:
Django Blog Project #1: Creating a basic blog
Django Blog Project #2: Deploying at Webfaction
Django Blog Project #3: Using CSS and Template Inheritance
Django Blog Project #4: Adding post metadata
Django Blog Project #5: YUI CSS and serving static media
Django Blog Project #6: Creating standard blog views
Django Blog Project #7: Adding a simple Atom feed
Django Blog Project #8: Adding basic comment functionality
2 comments:
Why not use the Blogger API ?
I migrated from Blogger to my own blog that i wrote in Django and i just wrote a script that used the Blogger API to get all my posts and save them as Post objects.
bulkan-savun evcimen,
wow, i haven't seen that Blogger API before. it seems i've done things the hard way again. this may be useful for me to update my Blogger posts to point to my new Django blog. do you mind sharing your script? thanks for the tip. btw, you have a great looking Django blog!
Post a Comment