My epic MP3 conversion quest – part 2: I made you some metadata but I eated it

July 25, 2010

Those of you not keen or demented enough to have read the first blogpost in this series may wish to do so now, otherwise this may seem even more incomprehensible than it actually is. At the end of that post, I had many of my CDs in convenient MP3 form, but the situation wasn’t entirely satisfactory and my way of going about it quite convoluted, so inevitably it didn’t get used for new CDs. Thus a backlog developed.

The situation was brought to a head when I bought a Mac for doing development work and, after importing all of the MP3s into iTunes, I found I was using iTunes to rip new CDs. This meant trusting iTunes to do a good job (both of the ripping and of the conversion), and in any case wasn’t nearly complicated enough. You just put the CD in the drive and it gets converted? That kind of thing is OK for normal people, but I’m an IT professional, damnit.

At that point, I had something of an epiphany – I realised that the only way to do the job properly would be to rip each CD and store the uncompressed audio permanently on a large hard drive, and find a better way to represent the associated metadata so that it could be converted into a different format any time I liked. Originally, I planned to go for MP3, but I had vague thoughts about trying to install Linux on my 20G iPod and an Ogg Vorbis player, so converting to OGG seemed like it might be a useful thing to do.

Ripping a complete CD is easy using cdparanoia – and you can have it come out as one long WAV file (basically the entire stream of audio from beginning to end with no breaks). This has the virtue that any ‘hidden’ tracks before the first real track starts appear magically at the start of the resulting WAV – they’re really just at the beginning of the disc, but the electronic table of contents is rigged to declare the first few minutes of the disc as ‘silence’ which means that players skip over it (I own four such CDs).

So, I bought a 500G external hard drive, and wrote a script to rip CDs, which went something like this:

  1. Look up disc in CDDB (which will go wrong in some cases, but never mind) to get the artist and album and save this data somewhere.
  2. Replace spaces in the artist / album with underscores (e.g. “The Beatles” -> “The_Beatles”)
  3. Create a directory: <artist>/<album> (e.g. “The_Beatles/Abbey_Road”)
  4. Read that TOC (Table of Contents) from the disc and save it in the target directory as a plain text file.
  5. Rip the CD as a big WAV file, save it in the target directory along with a file containing its MD5 checksum (to allow it to be subsequently checked for corruption).
  6. Move the CDDB data to the target directory.

Thus, while working at home on a project (I was a self-employed IT consultant at the time), I positioned myself between 3 computers – a Mac for working, a laptop and an old desktop machine, and periodically fed new CDs to the laptop and desktop system and triggered the ‘rip CD’ script. At the end of a run, I’d transfer that day’s ripped CDs to the external hard drive (fixing up minor filename issues in the process). And repeat the next day until all of my CD collection at the time was ripped. This was back in the early summer of 2008.

To recap, this is what I had at the end of it:

  1. A collection of artist/album directories (mostly correct but with some “Beatles” vs “The_Beatles” style duplication, and with albums containing “disc_1” or “disc_2” in places)
  2. A long (400-500M) WAV in each directory and a TOC file (text, not XML) giving just the track start points and lengths.
  3. The CDDB data (artist / album names, track names), not necessarily correct.

This is what I wanted:

  1. Correctly named directories (the names didn’t really mean anything, unlike in the previous scheme, but I still wanted some semblance of order).
  2. The WAV file, as described.
  3. An XML file for each WAV, containing the metadata for it (track names, timings).

The idea was that I could then convert the collection by taking each WAV, splitting it temporarily into separate WAVs on the track boundaries, then encoding each track in the format of my choice (probably MP3 but potentially others).

Now, what I should have done at this point was to generate the XML file based on the (incorrect but fixable) CDDB data and the TOC. Unfortunately, I had the foolish idea that, for those discs which I’d previously converted, I’d be able to extract the metadata from those MP3s (which I assumed was basically correct, since I’d personally entered or checked all of it). So, I wrote a script to do exactly that, which generated a large bunch of XML files with the existing track / artist data, each of which then needed to be moved (manually) to the new target directory. The script was able to disentangle my ‘artist – track’ naming on compilation albums into a separate artist and track, which was a bonus (even though it would do this erroneously for any track names which happened to contain hyphens).

In order to avoid accidentally wiping out the WAV files as I developed scripts to regularise and correct the associated files, I zipped up the entire directory tree of data, without the WAVS. I could then unzip this elsewhere, test out my scripts against the unzipped directory, figure out what still needed to be done, modify the scripts, unzip again and retest. What I was aiming for was a definitive script or set of scripts that I could run on my real directory containing WAVs and XML files, which would fix up all of the data problems and leave me with the desired single XML file for each WAV.

This was a mammoth task, so naturally enough it didn’t get done and I basically abandoned the project for a while because I couldn’t see any way of doing it sensibly (and going through the data and manually fixing up each of the files by hand still seemed like it would be more trouble than it was worth).

Thus ends part 2. I suppose this quest is a little like the original Star Wars trilogy – after part 1 there was a victory but work remained to be done, after part 2 we have an unsatisfactory situation in which the challenge has reappeared and final victory seems some way off,  and in part 3 I’ll explain how I was able to start making progress again, and finally achieve the goal of having MP3 files produced from stored CD rips, using metadata stored in an accompanying XML file.


One comment

  1. […] craziness, start by reading part 1. If that doesn’t leave you feeling faintly nauseous, try part 2. If you’re still a glutton for punishment, continue and at least you’ll have some idea […]

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: