HOWTO: Backup Your WordPress Media Library
IMPORTANT: This will only work for images that are currently used in your blog.
REQUIRES: Linux. (Cygwin has a rename
, but it does NOT function the same as Linux’s rename
.) You can download any live image or run Linux in a virtual machine if you’re using Windows.
First, use wget
to mirror your site.
wget -E -H -k -K -p -m -P ~ \ -D user.files.wordpress.com,user.wordpress.com \ http://[user].wordpress.com
By default, wget
only spiders the original domain. This was fine for WordPress blogs back in 2007 when images were stored in /user.wordpress.com/files
. Sometime during that year WordPress switched to using /user.files.wordpress.com
instead. Both are still valid, but the latter redirects to the former. Hence, we need to use -D to expand our range of domains.
It should go without saying you may have two places to look for your images as wget doesn’t take note of redirects.
Once wget
is done, you’re going to notice a lot of subdirectories in /user.wordpress.com/files
and /user.files.wordpress.com
if you use the editor at wordpress.com and use the Upload/Insert bar. This is because WordPress likes to “associate” media files with posts. Yes, you can get away with not associating media files with posts IF you don’t use the Upload/Insert bar and only use the Media Library to upload. Otherwise your file gets moved from /user.files.wordpress.com
to /user.files.wordpress.com/year/month/post-title
.
Now that we have that jibber-jabber out of the way, it’s time to condense all the files into one directory for easy manipulation.
mkdir ~/wordpress-images find ~/user.files.wordpress.com -type f -name "*" -exec mv {} /wordpress-images \; find ~/user.wordpress.com/files -type f -name "*" -exec mv {} /wordpress-images \;
This command will move all files in all subdirectories to another directory. I suggest doing it this way as you may very well have duplicate file names. In this event, the mv
command will refuse to move the file and leave the file where it’s at. You can issue the find
commands again without the -exec
part to see if you have any stragglers.
NOW THE FUN BEGINS!
In case you haven’t noticed yet, some of your images will be in the format filename.ext?w=nnn&h=nnn. This happens because of two things. One is that the Media Library GUI put them there when you used the thumbnailer options or used a captions. The second thing is that wget
doesn’t care about images. The -E option only forces HTML files to end in .html and CSS files to end in .css. Will wget ever care about images? Maybe.
At some point in the future, this option may well be expanded to include suffixes for other types of content, including content types that are not parsed by wget.
So now you’re stuck renaming these files by hand, right? Wrong. This is where Linux really shines.
rename 's/\?[w,h]=[0-9]+&?[w,h]?=?[0-9]*//' ~/wordpress-files/*.*\?*
This command will match the part of filenames that have the following trailers:
- ?w=nnn
- ?h=nnn
- ?w=nnn&h=nnn
- ?h=nnn&w=nnn
Wait, what? Let me go through it step by step.
- ‘
- Specifies the beginning of the string. Everything between the two apostrophes is going to be our expression.
- s
- This starts a substitution expression.
- /
- Used to delineate parts of the expression. Since we’re using substitution, rename is going to expect s/match-expression/replace-expression/
- \
- Used to escape the next character, as it is a special character in a perl regular expression
- ?
- This marks the spot for where we want to start editing file names. This is the beginning of part we want to lop off.
- [w,h]
- Brackets indicate a group of characters to match against. We’re looking for either a w or and h as the next character
- =
- The parts we want to lop off start with ?w= or ?h=. The third character will always be an equal sign.
- [0-9]+
- You already know what [w.h] did. [0-9] is saying that we’re look for a digit. The plus sign is new, though. It’s saying to look for [0-9] at least 1 time and keep matching until the next character is not a number.
- &?
- Some images only have one dimension specified. In that case, the file name stops there. If it has a second dimension specified, the next character will be an ampersand. But since we can’t guarantee the ampersand’s existence, we use the question mark to say “match 0 or 1 times.”
- [w,h]?
- Again, if we have a second part, the next character is going to be a w or an h. But since we can’t guarantee the existence of a second part, we match it 0 or 1 times.
- [0-9]*
- Like [0-9]+, only this time the asterisk means “0 or more times.”
- /
- FINALLY! The end of the expression!
- /
- Wait what? Nothing between the two slashes? That’s right. We’re taking the part we just matched against and lopping it off.
- ‘
- The end of the expression
And now, for some much-needed sleep. Just wanted to document this before I passed out, really. 🙂
Déowyth 9:40 am on October 16, 2012 Permalink |
Reblogged this on Déowyth und kommentierte:
Reblogging for reference.
tergrundo 9:28 pm on July 9, 2014 Permalink |
I wasn’t happy when I found out that WordPress.com doesn’t give any ways to back up your media library. I found people with the same problem since 2009.
To scratch my own itch, I wrote an application to download the media library from WordPress.com blogs. With this tool, you can download the whole library or filter by file extension.
I link to it here just in case someone needs a simpler way to download the files:
[dead url removed]
dreschx 12:02 pm on August 12, 2014 Permalink |
tergundo, you genius. But does your application work with Windows?
tergrundo 5:05 pm on August 16, 2014 Permalink |
It could work, since it is written in Python. But I don’t support Windows. I don’t support proprietary platforms in general, sorry.