HOWTO: Backup Your WordPress Media Library

IMPORTANT: This will only work for images that are currently used in your blog.

REQUIRES: Linux. (Cygwin has a rename, but it does NOT function the same as Linux’s rename.) You can download any live image or run Linux in a virtual machine if you’re using Windows.

First, use wget to mirror your site.

wget -E -H -k -K -p -m -P ~ \
-D user.files.wordpress.com,user.wordpress.com \
http://[user].wordpress.com

By default, wget only spiders the original domain. This was fine for WordPress blogs back in 2007 when images were stored in /user.wordpress.com/files. Sometime during that year WordPress switched to using /user.files.wordpress.com instead. Both are still valid, but the latter redirects to the former. Hence, we need to use -D to expand our range of domains.

It should go without saying you may have two places to look for your images as wget doesn’t take note of redirects.

Once wget is done, you’re going to notice a lot of subdirectories in /user.wordpress.com/files and /user.files.wordpress.com if you use the editor at wordpress.com and use the Upload/Insert bar. This is because WordPress likes to “associate” media files with posts. Yes, you can get away with not associating media files with posts IF you don’t use the Upload/Insert bar and only use the Media Library to upload. Otherwise your file gets moved from /user.files.wordpress.com to /user.files.wordpress.com/year/month/post-title.

Now that we have that jibber-jabber out of the way, it’s time to condense all the files into one directory for easy manipulation.

mkdir ~/wordpress-images
find ~/user.files.wordpress.com -type f -name "*" -exec mv {} /wordpress-images  \;
find ~/user.wordpress.com/files -type f -name "*" -exec mv {} /wordpress-images  \;

This command will move all files in all subdirectories to another directory. I suggest doing it this way as you may very well have duplicate file names. In this event, the mv command will refuse to move the file and leave the file where it’s at. You can issue the find commands again without the -exec part to see if you have any stragglers.

NOW THE FUN BEGINS!

In case you haven’t noticed yet, some of your images will be in the format filename.ext?w=nnn&h=nnn. This happens because of two things. One is that the Media Library GUI put them there when you used the thumbnailer options or used a captions. The second thing is that wget doesn’t care about images. The -E option only forces HTML files to end in .html and CSS files to end in .css. Will wget ever care about images? Maybe.

At some point in the future, this option may well be expanded to include suffixes for other types of content, including content types that are not parsed by wget.

So now you’re stuck renaming these files by hand, right? Wrong. This is where Linux really shines.

rename 's/\?[w,h]=[0-9]+&?[w,h]?=?[0-9]*//' ~/wordpress-files/*.*\?*

This command will match the part of filenames that have the following trailers:

  • ?w=nnn
  • ?h=nnn
  • ?w=nnn&h=nnn
  • ?h=nnn&w=nnn

Wait, what? Let me go through it step by step.

Specifies the beginning of the string. Everything between the two apostrophes is going to be our expression.
s
This starts a substitution expression.
/
Used to delineate parts of the expression. Since we’re using substitution, rename is going to expect s/match-expression/replace-expression/
\
Used to escape the next character, as it is a special character in a perl regular expression
?
This marks the spot for where we want to start editing file names. This is the beginning of part we want to lop off.
[w,h]
Brackets indicate a group of characters to match against. We’re looking for either a w or and h as the next character
=
The parts we want to lop off start with ?w= or ?h=. The third character will always be an equal sign.
[0-9]+
You already know what [w.h] did. [0-9] is saying that we’re look for a digit. The plus sign is new, though. It’s saying to look for [0-9] at least 1 time and keep matching until the next character is not a number.
&?
Some images only have one dimension specified. In that case, the file name stops there. If it has a second dimension specified, the next character will be an ampersand. But since we can’t guarantee the ampersand’s existence, we use the question mark to say “match 0 or 1 times.”
[w,h]?
Again, if we have a second part, the next character is going to be a w or an h. But since we can’t guarantee the existence of a second part, we match it 0 or 1 times.
[0-9]*
Like [0-9]+, only this time the asterisk means “0 or more times.”
/
FINALLY! The end of the expression!
/
Wait what? Nothing between the two slashes? That’s right. We’re taking the part we just matched against and lopping it off.
The end of the expression

And now, for some much-needed sleep. Just wanted to document this before I passed out, really. 🙂