My epic MP3 conversion quest – part 1: All your metadata are belong to us

July 2, 2010

This is the first of three posts in which I explain my quest to convert all of my CDs into MP3s, mostly in the hope that it will dissuade others from following in my footsteps. So far, the MP3 conversion 2.0 project (the second run-through of it, details will be revealed) has been running for about three years, in which time absolutely no MP3s have been output. It has, however, resulted in large number of XML files being generated, and used up large chunks of my free time.  This post describes the background to the project, which is basically MP3 conversion 1.0. The second post will describe the initial phases of conversion 2.0. The final post, hopefully, will be written once MP3s are being generated and will cover the events leading up to this (supposed) happy state of affairs.

Many, many years ago (probably around 2001 or thereabouts), I treated myself to what was, at the time, a top-of-the-range iPod – the 2nd generation, 20Gb model. In order that this ridiculously expensive artifact should be more use than a pretty white paperweight, I needed to find a way to convert my CD collection into MP3s.

Most people at this point would ask “well, why didn’t you use iTunes?”. This was not an option because it would have meant that I’d have had to run an OS on one of my machines that was capable of running iTunes whereas all of the machines I had at the time ran Linux, I had a pathological hatred of Windows and not enough money for a Mac. So it was imperative to find a Linux-based way to do this (once I’d gone through the faff of getting the damn iPod to connect in the first place – this was in the early days of iPod when Apple were using FireWire to connect it to the computer). Being a good Linux user, I realised that the solution was clearly involve hacking together some Perl scripts. Discerning readers will instantly realise that this is not going to end well.

Now, there are perfectly good Linux applications which will do the conversion process for you, but I didn’t fancy the process of feeding a disk to the machine, waiting for it to rip and convert, then feeding another one in and so on. My solution was to separate the ripping from the conversion – to rip as many CDs in raw WAV form as disk space would allow, then to kick off the conversion process which would chew through them leaving MP3 files in its wake. Then, the superfluous WAVs could be deleted and more CDs ripped and so on until it was all done.

There was another reason for this separation, of course, which is that I could be sure that the MP3 metadata (the artist and track information stored in each MP3 file) was correct, because I am obsessive about that sort of thing. I would feed CDs into Grip which would look the disc up in CDDB to get the track details, save a WAV for each track in a suitable directory /wavs/[artist]/[album]/[tracknum]_[track].wav, then I would be appalled at the various spelling errors, the artists written in CAPITALS, the double CDs where the two CDs had been submitted by different users with different album names and different capitalisation of the track titles and so on. And the CDs which simply came out with the wrong data entirely, because the CDDB identifier scheme is laughably bad and more-or-less guaranteed to lead to many identifiers which are shared by completely unrelated CDs.

So, by saving the WAVs in this form, I thought I could be cunning – I’d rip them with Grip and observe the metadata chaos, curse the idiots who’d submitted the data, then move them into correctly named directories and rename the tracks to reflect the actual titles and track numbers. So, the process that turned them into MP3s would be able to pick up on this – when it converted the file /wavs/Beatles/Revolver/01_Taxman.wav it would know that it had to set the resulting MP3 to have “Beatles” as the artist, “Revolver” as the album, “01” as the track number and “Taxman” as the title. Sometimes I’d get lucky and get an album which had been submitted by someone who lived up to my high standards and needed very little to be fixed. Sometimes I’d have to enter all the data from scratch. Swings and roundabouts.

In this way, I’d gradually fill the hard disk with WAVs, with the correct names and directories, and then let loose the aforementioned Perl script to direct the conversion to MP3 – basically, by parsing the directory names leading up to the WAV, and the filename of the WAV itself, then invoking LAME to do the actual conversion. The conversion took quite a while, partly because we’re talking about 2001-era processors, partly because there are loads of command-line options one can use to tell LAME to do a better job of the conversion – basically you can trade extra conversion time for better resulting MP3 quality. I felt this was a worthwhile trade. I stuck a finger in the air and guessed that using variable-bit-rate encoding with a maximum rate of 160kb/s would sound reasonable, and give me a chance of fitting the entire collection into the iPod. Once a batch of WAVs had been converted and the MP3s saved in a similar /[artist]/[album]/ directory structure, the WAVs could be deleted to make way for the next lot.

This worked reasonably well but was not without its problems. One was that I lost interest for a bit and ended up with an iPod featuring all artists in the range A – D for a while. Another was that in order to get the artist ordering to come out the way I wanted, I had to butcher many artist names by dropping the initial “The” – hence “Beatles”, “Charlatans”, “La’s”. I’m not sure how I squared this with my desire for correct metadata.

A more serious problem was that encoding the artist, album and track name using directories and file names turned out to be a very bad idea. Firstly, it meant that I couldn’t use forward slashes in anything, so AC/DC became AC\DC and there were various other compromises because of questionable characters in file names (I forget the precise details). Secondly, not everything is an album in this sense – I have a 5CD compilation of punk stuff called “1-2-3-4″, and this became both the artist and album name so the tracks in the compilation came out as e.g. name=”Blitzkrieg Bop – The Ramones”, artist=”1-2-3-4″. Yuck. Other compilations were similarly affected.

On the plus side, it meant that I didn’t have the arbitrary “disk 1 / disk 2” distinction for double albums, because I could just bung all of the tracks on, say, “The Wall” in the same /Pink Floyd/The Wall/ directory and have them come out belonging to one album rather than “The Wall Disk 1” and “The Wall Disk 2”. This was a good result. Also, this worked in reverse – given the Big Star combination release of “#1 Record” and “Radio City” which has both albums on the same disk, I could have these come out as two separate albums which I thought was probably preferable.

So, ultimately all of my CDs at the time got MP3d. But despite my efforts, errors crept in. For example, in an unguarded moment, I’d forgotten to remove the initial “The” from “The Beach Boys”, and similarly on “The Smiths” on one of their albums (“Strangeways…” I think). So now I had albums by “The Smiths” and “Smiths”. D’oh. And, by making the process far more complicated than it needed to be, I was quite rubbish about applying it to new CDs so I ended up with a large backlog of CDs that weren’t converted (therefore, because of intrinsic weirdness, they never got listened to – I didn’t feel I could listen to them properly until I could get them on the iPod, and that didn’t happen, so…).

And here ends part one. I have a large collection of MP3s, those of them which come from compilations have dubious metadata in which the name of the compilation appears as the artist name, I have a load of CDs which aren’t converted, and irritatingly I have no good way of figuring out which these are. The second post in this series will focus on my determination to do something about this distressing situation and how this ran aground fairly quickly. The third post (hopefully) will be the happy ending.



  1. […] but I eated it July 25, 2010 Those of you not keen or demented enough to have read the first blogpost in this series may wish to do so now, otherwise this may seem even more incomprehensible than it […]

  2. […] you’re new to this craziness, start by reading part 1. If that doesn’t leave you feeling faintly nauseous, try part 2. If you’re still a […]

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: