How I moved 350,000 blogposts from Tumblr to WordPress
I had seven blogs on Tumblr which aggregate news. Using a technique I described earlier, they take RSS feeds from over 1,000 carefully selected websites and blogs, filter them, clean them up, and feed them into the different Tumblr blogs. I used the unique feature built into Tumblr to convert RSS feeds into posts. All automatically. Pretty neat. Until it stopped working…
Two months ago Tumblr’s autoimport feature started to hiccup. Tumblr support said “We are aware and working on it”, but could not give me any estimate when it would be fixed.
A month later, still nothing. So what to do? I rely on this Tumblr feature, for my blogs. In two years, those blogs collected 350,000 news articles. Quite a resource library which I did not want to give up.
So I decided to use the Christmas holidays to migrate these blogs from Tumblr onto WordPress, on my HostGator VPS server. It was an interesting process, involving many different techniques and debugging efforts:
1. How to export a Tumblr blog into WordPress
I pretty much described the process and the technique to export a Tumblr blog in an earlier post: Using Tumblr2WordPress, a neat PHP program by Ben Ward. It uses the Tumblr API to export blogposts, and to create an .XML file, which I could import into WordPress. An API that Tumblr disables every US afternoon and evening, by the way.
While Ben’s program works well to export smaller Tumblr blogs, I had biiiiig blogs, so I had to adapt the PHP code. As the code is public domain and available on Github, I downloaded it and installed it on my server. I changed the PHP parameters to allocate a massive chunk of memory, and allow the export routine to run longer than a standard PHP program. Tip: change the parameters only for that routine, not for your whole server!
Update March 1, 2010:
Ben’s source code is still available, but the executable program is no longer available on the link I provided. You can still run Tumble to WordPress routines based on the same engine from Tumblr2WP or Tumble2WordPress – With thanks to Parneix and Aaron for the updates)
I also patched Ben’s original code to work around a smaller problem I discovered on “published dates” and “categories”. Pretty easy, even for a PHP novice like me, as the code is well documented.
As WordPress can only import .XML files smaller than 8Mb, I split up the first exported blog manually into smaller chunks. Took me two hours for the smallest of the seven blogs. I decided once again to delve into the code, and wrote a small patch that allowed me to export 5,000 blog posts at a time. Each export file now was smaller than 8 Mbyte.
Cool. Exported all blogs, and there I sat on 70 files, about half a gigabyte worth of .XML files.
2. Importing 350,000 posts into seven new WordPress blogs
For the seven blogs, I created seven new accounts on my HostGator VPS server. Some of my Tumblr blogs were using custom domains, so I changed the DNS, pointing to my HostGator VPS server, and created seven new WordPress blogs. While I was at it, I registered two new domains. ChangeThru.Info became AidResources.org and Youandusand.me became NewsOnGreen.org. Wanted to do that a long time ago, so now was the right time. I kept the old domains live on Tumblr, so I did not lose any traffic.
Installing a new WordPress blog was easy to do with the “Fantastico” program in the server’s Cpanel. I choose a neat and simple magazine template and added the usual plugins I always use for caching, automated blog backup, etc…
Then, one by one, I imported the 500 Mbyte of export .XML files. Worked flawlessly, but took about two days. Not a single error, not a single problem. I should say: WordPress impressed me once more.
Done. Well at least with importing the old posts. How to feed in new posts using my myriad of RSS feeds?
3. Implementing an “RSS to blogpost” routine in WordPress
FeedWordPress made my day. This neat plugin imports RSS feeds and converts them into WordPress blogposts. And it does so very well. The plugin is well designed, easy to use, and has a lot of options to customize the import process.
It also has add-ons that allow you limit the size of the imported post, add text in the title or in the body of the imports. Realllllly neat!
I configured the different feeds I process via Yahoo Pipes, and ran a CRON job to import the blog posts. About two hours work per blog, and the whole cake went into the oven and started cooking: FeedWordPress neatly imported the posts.
You would think I was done. The real work had not even started.
4. The need for speed
From the beginning until the end, including the customization of the template etc.. the export from Tumblr and import into WordPress took me about a week. Fine-tuning the blogs and the server took another two weeks.
As “Good is Fine, Perfect is Best”, I saw some formatting deficiencies I could not live with and needed modifications to the feeds and template. And even worse, much much much worse, my poor server went through its knees with the extra load.. The new blogs demanded so much CPU time from Apache and the MySQL server that everything slowed down to a snail’s pace.
Time to get geeky!
5. Tuning the XML sitemaps
It only takes one dumb sysadmin to make the fastest server to go slow, I realized while monitored the CPU load with the Linux “TOP” command. I saw the “load average” to peak way above “10″, meaning there were at least 10 processes queueing up for CPU time.
So I looked at several WordPress plugins that might cause the problem.
The first problem I saw was the XML sitemap generator. There was one option, which I had overlooked: “Rebuild sitemap if you change the content of your blog”. Might be fine on smaller blogs, but the automatic feedimporters were putting up new blogposts at a rate of 100 per hour. So the server was pretty much doing nothing else but generating sitemaps.
I disabled the feature, and scheduled a CRON job to regenerate a sitemap once a day, at night-time. The server load went down significantly.
Oh, by the way, if you have a huge blog, limit the number of posts to include in the sitemap to 10,000 , Google’s maximum limit for sitemaps!
6. Tuning WP Supercache
But it only takes a stupid blog administrator to make even the best plugin not to work properly. Supercache needed tuning:
- Unchecked the option “Clear all cache files when a post or page is published” (I published 100 posts per hour, so the cache was always invalidated)
- Pre-loaded the last 10% of blogposts, but put “Refresh preloaded cache” to “0″ (as once a post is imported from its RSS feed, I don’t update it anymore, so it can remain in cache forever). This means I only had to pre-load a massive amount of blogposts once, and it was done.
- For the same reason, I put the “expiry time” to “0″, as once cached after a preload cycle, I want the page to remain in cache. It generates a LOT of cached files, but I have plenty of disk space on my server.
If you put “expiry time” to a value > 30 minutes, garbage collection is done every 10 minutes, which generates a lot of load on your server.
- As now I had caches with an eternal life time, I needed to ensure the homepage, feeds, archives were NOT cached, otherwise visitors never got an updated overview of the latest posts.
As I discovered that AFTER I preloaded the posts, and had put the caching to “eternity”, I had to manually delete the cached files for the home page, searches and the running month’s archives.
- To further reduce the load on the PHP server, I choose the option to use “mod_rewrite to serve cache files”.
- And by the way, if you don’t cache the home page, the “Cache Tester” will give an error – as it tests caching on… the home page. So ignore that error, and just look at the source of any random page, to see if, at the bottom of the source, you have a date/time for the cache generation, which is in the past.
7. Trashing “Most Popular Posts”
As describe in this post, one plugin meant to show “the most read posts”, also logged every single access to the SQL database, and effectively slowed down my server. Had to trash it.
8. Tuning FeedWordPress
I spent quite a bit of time to tune FeedWordPress, to balance how often feeds were to be imported with the success rate of each import cycle. I combine 1,000+ feeds into about 20 Yahoo Pipes feeds. These are large and complex feeds, which take a lot of time to fetch. Many times the import of a feed would time out.
At first I worked around that problem, by refreshing all feed imports every 10 minutes. But once again, that put a lot of pressure on the server. As you can deal with any problem either by working around it, or by addressing the cause of it, it was time to look for the source of the problem. In the process of doing so, distinguish well between what is “a cause” and what is “a symptom”. Often we try to solve the latter, while we should address the former. Think about that. That is deeeeep!
So the symptom I saw was the feeds timing out. As I know the Yahoo Pipes’ feeds often take very long to refresh, even interactively, I had to patch FeedWordPress with a timeout of 60 seconds to deal with the Yahoo Pipes’ lack of speed. Cool. But that caused a dreaded SQL error “My SQL server has gone away” to appear more frequently in my CRON log files. Beh.
To make a long story short, the solution was to also change the PHP parameter “wait_timeout” from the default of “30″ seconds to “240″. Changed that in /etc/my.cnf and restarted SQL server. Problem solved.
As this solved the timeout when reading RSS feeds, I could also decrease the frequency of the FeedWordPress CRON jobs. And that once again made my server very happy. I like happy servers…!
9. Server tuning
Depending on what exactly you do on your blogs, for large and heavy traffic blogs like the seven I had just migrated, the SQL server might need tuning. This is not for the faint-of-heart, and requires patience and caution. With one wrong setting, you can cause more damage than good.
The first indication that my SQL parameters might need tuning, was simply the fact that SQL took up so much CPU time on my server. phpMyAdmin, a routine available to about every selfhosted server, has a neat feature called “Status”, which gave me an overview of the parameters which might need changing. But I was not sure. So I installed mySQLTuner: using SSH, I logged into my server’s root. With just three commands, I got a better overview of the parameters I needed in three commands:
wget mysqltuner.plchmod 775 mysqltuner.pl
I changed one parameter at the time, and waited for 12-24 hours to see its effect.
It seems the two most important parameters to tune are “key_buffer_size” and “table_cache”. It was advised to tune these first before touching the others. Which I did. “key_buffer_size” was ok on its default value of 48M, but “table_cache” needed 4,096 instead of the default 1,024.
The rest of the parameters I changed over time:
tmp_table_size: from 32M to 64M
max_heap_table_size: from 32M to 64M
sort_buffer_size: from 1M to 4M
While I am still observing the server and tuning bits and pieces, it looked by now that, ladies and gentlemen, we have a happy server, purring like a happy cat. Sure enough I still get peak loads, with “load average” of 3-4, most of the time it stays at “1″ or below. And that is good. I like happy servers.
10. Template tuning
I customized the CSS and some of the functions. I sinned heavily by patching the original template, rather than using a child theme, but that is just because I made so many changes. On top of that, for my main blogs, “speed” is important, I did not want every page refresh to read several CSS files. So patching, it was. I will live with the fact that I can not upgrade the template automatically later on, but hey, I learned that upgrading templates is always a pain, and often causes more problems than it is worth. So I avoid theme upgrades like the pest. My opinion, punto.
One of the main challenges I faced was that the template, by default, puts the first image if finds in the post as a thumbnail on the home and archive pages. And often that image was junk (a “Retweet” button, or a Feedburner “Email this” image).. So I had to tweak all my Yahoo Pipes feeds to delete those images. Took me a week. Some feeds were too complex to delete all images, so for some of the blogs, I just disabled the thumbnails in the template. I mean, I got to sleep too, so can’t keep on tuning all feeds to find all kinds of combinations of dummy images..
Maybe that will be work for next Xmas!
11. Installing the mobile theme
And to finish it all in beauty, I installed a plugin to enable a mobile theme to be displayed when visitors access my blog using a mobile phone.
12. And here is the end result of about four weeks of work:
Check them out on your mobile. Browse through the posts to see if you like the download speed (remember the homepage and archives are not cached, and thus a bit slower!): AidNews, AidResources, News On Green, AidBlogs, The NonProfit Blogs, The Weird Bit and Blogging Today.
Cartoon courtesy Mark Lowe