My epic MP3 conversion quest – part 3: Mr T ate my metadata

October 4, 2011

If you’re new to this craziness, start by reading part 1. If that doesn’t leave you feeling faintly nauseous, try part 2. If you’re still a glutton for punishment, continue and at least you’ll have some idea what’s going on.

After a brief flurry of activity in which I actually ripped all of the CDs I had at the time into WAVs, the project languished while I tried to figure out a way of getting all of the XML files into shape. The thing which restarted my interest in the project, and gave me a way out of the murky darkness was Subversion.

If you’re a techie type, you’ll probably know about Subversion already, but for the benefit of those who aren’t (in which case, well done for getting this far), Subversion is a version control system, usually used for program code. One sets up a master ‘repository’ which stores all of your work on a machine somewhere. You can then ‘check out’ a copy of the files in this repository to whatever machine (or machines) you want to work on. You make changes to files, add files, delete them, and periodically commit your changes back to the repository. If, in the meantime, other people have been changing files and committing their changes to the repository also, you can update your local ‘working copy’ to incorporate their changes (and they can update theirs to incorporate your work). You can also look at the revision history of stuff in the repository, so you can recover earlier versions of files if necessary.

I realised earlier in the year that I could sort out a lot of the files I have languishing on various computers, by setting up a bunch of personal Subversion repositories and putting all my stuff in them (this solves my usual problem of having some stuff on one machine, copying it to a laptop or something, then getting confused some months later about which is the more up-to-date copy – with Subversion, the repository is the master copy).

The cunning idea I had was this: I would check in the big tree of WAV files and XML into Subversion, except that I would tell it to ignore the WAVs (checking in 400Gb of WAV files wouldn’t have been particularly sensible). Then, I could check out the resulting tree to another machine, work on it, and commit a new version every time some progress was made. This fixed the ‘having to write one script to do it all in one stage’ problem – it could be broken into manageable chunks, and when it was eventually all done, I could simply update the directory containing the WAVs and all of the finalised XML would just fall into place around them.

So, armed with this wonderful new ability, I went through something like the following sequence of logical steps.

  1. Using the CDDB ID of each CD (which I’d had the good sense to store, even though CDDB is a bit rubbish), I was able to get track and artist information from CDDB and save it as an XML file alongside each of the XML files derived from my pre-existing metadata (see parts 1 and 2).
  2. I combined my metadata and CDDB’s metadata into one file for each CD. I assumed that if my names agreed with CDDB’s names, they were probably correct, so I just needed to look at the differences.
  3. I wrote a script to look through the merged files and write out an XML file containing just the differences (e.g. if there was only a difference in one track name for a particular album, the ‘error.xml’ file would just contain that one track). Brilliantly, if the script found that ‘error.xml’ already existed for that album and had been edited by me, it would use my edits to resolve the differences. Thus my data correction problem became ‘find all error.xml files and make a decision about what name to use’.

For many CDs, the differences were quite trivial to resolve (a mis-spelling or different capitalization of a couple of track titles, for example. Therefore, I was able to do quite a lot of conflict resolution on a laptop on the train to and from work without having access to the CDs. For more complicated problems, the work had to be done next to my stack of CDs, but it was still pretty straightforward.

Here is where the story should have ended – I ended up with a set of XML files, one for each CD, with track titles, artist names and track boundary information. I was able to fix up the (relatively small) number of CDs with hidden first tracks, or with two albums on one disc / one album over several discs, and deal with the annoying track cataloguing of the Mars Volta’s ‘Frances The Mute‘ (if I’m feeling enthusiastic, one day I might restore They Might Be Giants’ ‘Fingertips‘ to the originally planned collection of many short tracks instead of one big one.)

Unfortunately, after discussing this ridiculous project with a friend of mine, he suggested that the MusicBrainz database might be worth investigating, so I investigated it. MusicBrainz is an open, user-contributed database of CDs, track titles and artist names. It is much better than CDDB because a) it isn’t owned by some large corporation, b) it has a much more robust way of identifying CDs than the terrible CDDB ID scheme I’ve already mentioned in part 2 and c) because it assigns globally unique identifiers (GUIDs) to each artist, album and track in the database.

Sadly, the prospect of having GUIDs for everything was too enticing, so I lurched back into ad-hoc scripting mode, writing scripts to match up all of my discs with GUIDs in the MusicBrainz database. Fortunately, they have an on-line service which will attempt to map CDDB IDs to MusicBrainz entries (with the proviso that this is not guaranteed to work because of the rubbishness of CDDB). So I was able to get most of them, but there were about 100 that I had to look up by hand. Then, having got the GUIDs for each disc, I needed to download the MusicBrainz data for that disc and merge it with mine (so my XML files acquired MusicBrainz GUIDs in addition to the titles, artist names and timings) – I could have used XSLT for this, but I decided it was simpler to do it in code using DOM4j (infinitely better than plain DOM). I realised that, good though MusicBrainz was, I owned some CDs which weren’t present in the database and so naturally I felt compelled to add them.

On the whole, the MusicBrainz excursion was worth it, because it simplified the next phase of the project, which was building a robust ripping solution that I could use in the future (and finally converting the WAV files I’d already ripped). In the style of the great four-book Hitch-Hiker’s Guide To The Galaxy trilogy, I will outline that in part 4 of this 3-part series and, for the geeks, I might even include some source code for the procedure. That will, at least, force me to comment it and actually clean up the error-handling (for those who want to know – it’s written in Java on Linux using Maven for dependency management – it might be possible to make it work on a Mac but I doubt you’ll get it to work under Windows because it requires the cdparanoia library and various tools for slicing up WAVs.)


One comment

  1. Why didn’t you rip to Flac? It’s lossless compression & supports tagging. There must be Linux utils for doing that kind of thing. You could then convert from flac –> whatever & use the same tags. You can still convert to Flac & save some space. I did an experiment a while ago, ripping to wav, converting to flac & then from flac –> wav. The original & final wav files were identical.

    If you weren’t such a cock you’d use Windows, at least for tagging. Tag&Rename is awesome & can get album details from a multitude of sources & the also the album covers. It can also automatically rename files. But why make life easy for yourself? 😉

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: