Those Nasty Duplicate Songs

If you are anything like me, you have probably spent the last decade or so building and maintaining a digital music library. I have spent many years searching, finding, buying, and ripping actual music CDs that were really hard to find. I have scouted the internet (in my teen years) looking or rare albums that very few people have put their hands on. I have spent tons of money on iTunes buying music. And of course, I have also spent time buying and ripping many thousands of non-rare CDs.

I have purchased digital versions of albums before I was able to get them in physical form in a CD and re-ripping them. Naturally, my iTunes Library has grown gigantically, not only file size wise (which is very big, because most of my collection is in ALAC), but also in the amount of songs I actually have. All of my songs are neatly organised, with their shiny metadata and album artwork.

At the moment of this writing, my iTunes Library currently has 72,197 songs! And with so many ripped albums, digital purchases, and even sessions of music sharing with my friends in my teen years, I have accumulated tons of duplicate songs.

This is a big problem, and the time to clean my library has finally come.

Existing solutions and why they don’t work

If you also have the problem of having thousands of duplicate songs, chances are you have tried to use some existing tools to do this job.

I am an Apple user, and I use iTunes (I don’t have as many complaints about this software like many people do in this year and age, but I digress). iTunes has a feature that lets you view all your “duplicate” songs in the standard song view. To do this, go to File > Library > Show Duplicate Items. You will get something similar to this:

*Duplicate Song List View*

If you take a look at this list, you will see that everything is simply wrong. In fact, these “duplicate” songs list is so bad, I find it absurd Apple actually shipped this poorly implemented feature on iTunes. What’s worse is that it has been around for years, and it’s one of the least helpful features I have ever seen in a software.

I want to use the song “Nemo” by Nightwish as an example of how badly implemented this feature is. Like you can see, I have three copies of that song on my library, but only two of the ones you see there are true duplicates. The one in the album “Once” and the one in the single of the same name, “Nemo”. “Nemo (Live)” is, surprise surprise, a Live version of this song. They are technically the same song, but performed differently and in different contexts. In other words, there’s no need for iTunes to classify it as a different song. Most people I know enjoy listening to Live and Studio versions of songs, so they will keep both with them.

The problem with iTunes and many software that attempts to solve this problem is that they try to deduce the equality of songs based on Metadata only. If two songs sound “similar” by iTunes standards, then it will mark the mas duplicates. Song titles are the biggest culprits, but it will also disregard different versions of the song based on different versions. For example, I currently have versions of Nemo with the Genre metadata “Symphonic Metal”, but I could add the orchestral version of the song and change the genre to “Orchestral” or anything other than Symphonic Metal, and iTunes would still consider it the same song. This is, presumably, a reason why Apple’s Music matching algorithm is a disgrace, whereas iTunes Match’s works better.

This is the general problem that most software attempting to solve this problem has. They can’t listen to the songs to really determine if they are the same thing or not.

Audio Fingerprinting to the rescue!

If you have built your library over a long period of time and tried to keep it organised, chances are you have used some sort of music database to help you keep your library clean and organised. Music Brainz is popular, but there’s others.

When you rip a CD with a software that integrates with Music Brainz, they try to use all kinds of information available to populate the songs’ metadata within the album. The thing about Music CDs is that they don’t contain the metadata inside them, just raw audio data, so the software can’t populate they info with it.

Maybe you have downloaded an album online and you noticed that the songs have virtually no metadata at all. Those criminal “Track 1”, “Track 2”, “Track X” are everything you can see. But you can throw all the songs to Music Brainz Picard and make it get all the metadata for the album, without having anything else to help it.

How does it do this?

This is achieved through a technology called Audio Fingerprinting. Like its name implies, Audio Fingerprinting is a technology that can uniquely identify a song. Like real life fingerprints in humans, they are unique. No other song or interpretation of it will ever have the same fingerprinting as another. For the mathematically inclined, Audio Fingerprinting is an application of signal analysis. You can tell when two songs are the same based on raw data with little to no metadata available with a very small chance of error.

Music Brainz itself is just a music database. The actual song recognition magic is done through a service called AcoustID. This Audio Fingerprinting service is nothing more than a simple audio fingerprints database. I’m being redundant because AcoustID doesn’t really do anything other than keep track of existing audio fingerprints. What software like Music Brainz Picard do is calculate the audio fingerprint of a song, send it off to AcousticID, and wait for it return with the song information it matched to. Music Brainz and AcousticID work really close together, as the information from the fingerprint in the database is usually tied to a Music Brainz’ release.

Cool! So how do I use this awesome technology?

AcoustID created a tool called Chromaprint. This is a simple C library that can generate the audio fingerprints for you. It also includes a tool called fpcalc which is a command line tool for calculating fingerprints.

Bro, I’m not a developer.

I’m still working on a good script to help me solve this issue, but I have written this quick mockup you can use on your Mac (and probably on Linux too – Sorry Windows users). I cannot help you learn Bash, but this may be useful for you if you are willing to put in the effort. Fellow developers, excuse this ugly code:

#! /bin/bash

mkdir ~/song_dupes &> /dev/null
touch ~/song_dupes/nightwish
touch ~/song_dupes/sorted

find /Volumes/iTunes/Music/Nightwish -name '*' | while IFS=$'\n' read -r FILE; do
	if [ -f "$FILE" ]; then
		calculated_hash=`fpcalc "$FILE" -hash | grep "HASH" | cut -d'=' -f2`
		if [[ -z "$calculated_hash" ]]; then # Test if calculated_hash is empty.
			echo "Unable to calculate fingerprint of $FILE: $calculated_hash"
		else
			echo "$calculated_hash:$FILE" >> ~/song_dupes/nightwish
			echo "Calculated HASH of $FILE: $calculated_hash"
		fi
	fi
done

You may not be a developer, but you will need to use the Terminal if you want to use this script. The good news is that I’m willing to help you. You will need to do the following (and I will guide you through this, if you are using a Mac):

  1. To install Chromaprint
  2. Create a script file with the code above.
  3. Give it the right permissions to be able to execute.

The easiest way to install Chromaprint is to install Homebrew. So click that link and follow the instructions there. They are easy to follow. Come back here when you are done.

To install Chromaprint, simply open your Terminal app and type:

brew install chromaprint

If your internet is slow, it can take a while.

Now download the Script above. Simply copy it, open your Terminal type “cd ~/Desktop”, then type “vim” on it, then press the letter “i” (you will see the bottom left part of the Terminal now says “INSERT”), and press ctrl + v to paste the script there (or right click paste, if you prefer). You may need to modify some things, like changing /Volumes/iTunes/Music/Nightwish to the actual path of the folder you want to analyse. You can also change the names it will generate, but they are just pedantic details and they won’t affect the flow of the script in any way. The script will generate a directory called song_dupes inside your Home directory, and the results will be in a file called nightwish. Nightwish is the band I’m analysing for duplicates, so that will stay there. When you are done, press the ESC key, write “wq”, and press ENTER. It will prompt you for a name. Give it one and press ENTER. The script will be created in your Desktop.

Now on your Terminal write `chmod 755 ~/Desktop/NAME_OF_THE_SCRIPT” and press enter.

You can now drag the script to your Terminal window and watch it execute. You should see output similar to this as it executes:

*Watch it execute*

After the script is done executing, write cat nightwish | sort. This will sort and print the list ordered by audio fingerprinting hash values:

*Dem Hashes!*

This is ugly, but it can help you get some job done. Songs that have the same hashes are likely to be duplicates, but you should check to make sure. This very barebones script is full of failures and the right thing to do would be to send the fingerprints to AcoustID to see if they are really the same song. For instance, if you search for the song Eva nearing the end, you will see that I have many different versions of the song, and more or less has the same problems as iTunes (but it’s more accurate than iTunes, still). I have actually checked those songs with Music Brainz Picard to see what would happen it correctly identified four different versions of the song, whereas this script will be a little bit misleading, showing those four versions with the same hash.

Once I finish this script to use the AcoustID database, it will be doing a much better work identifying real duplicates.

Positive SSL