The Tragedy
Every data hoarder’s nightmare — no, not hard drive corruption (though that’s up there), but something equally as chilling: scrambled file names. Imagine your meticulously curated music collection became a cryptic mess of unidentifiable tracks!
|
|
This is the horror I found myself in when revisiting my old collection of music. Somehow, the titles became unidentifiable indexed numbers. With hundreds of files like this, playing and identifying them all by hand would take an unreasonable amount of time. The only real choice to deal with this issue is to reach for the lazy programmer’s playbook.
Let’s try to automate this.
The Lost Title Format
To begin, it’s important to identify a target file format that would satisfy my data collection needs. An ideal file name format for this use case would include the title of the track as well as the artist name.
Track title - Artist.extension
More song info could be collected, but with the title and artist name the files can at least be identified.
There are a few good ways to collect track info:
- It’s possible some files have their track info embedded in metadata. In this case, we can inspect the file for this metadata and determine if the track name and artist name are preserved.
- Identify the songs automatically based on audio data, this could be done with some song detection API.
Extracting Metadata with ExifTool
By inspecting file metadata (with the tool exiftool
), we can determine if track details are still present on any music files by parsing for Title and Artist details embedded within them.
An example of inspecting a file:
|
|
Here we can see an example of a file with Title and Artist details preserved. The file names for these are relatively easy to resolve, and we can setup a simple bash script to correct them.
Parse metadata and request a rename
First let’s automate retrieving our desired tags. We’ll create a bash script leveraging exiftool
to automate this.
|
|
To ensure name change actions are verified by a user directly, we can ask the user if the would like to rename the file based on the given data. Ensure the file extension is preserved using the bash substring removal pattern.
|
|
Edge case: Handle -filenames
Our script works - almost. There is an edge case that currently breaks it however. Files that begin with a -
, such as -1131991569406517012.mp3
cause an interesting error:
|
|
The error appears to imply exiftool
is parsing the filename as an additional command line option due to the -
signifying an option. To eliminate this issue, exiftool
must be told where command line options end, and where file arguments begin. This can be accomplished by including --
within the command call.
|
|
Final Script
Before we finalize this script, let’s add the functionality to handle all input CLI arguments as filenames. This will allow us to do simple calls such as ./rename_using_tags.sh *.mp3
within a target directory to completely repair corrupt file names!
Dry run capabilities are also included below, allowing test runs to substitute mv
with echo
instead. Dry run is activated by calling -d
before including filenames. for example: ./rename_using_tags.sh -d *.mp3
. This can allow us to verify our script will function as expected before running any irreversible commands.
|
|
Identifying tracks automatically with the ShazamIO API
What about the songs without any useful metadata to be found? Only inspecting the audio itself could help identify the track. For this case, an identification API such as shazamio can be leveraged.
Test basic usage of ShazamIO
First let’s implement a basic usage of ShazamIO (referencing their example script). With this, we can experiment and determine the functionality required to identify and retrieve our track data. ShazamIO is an asynchronous API, and thus we’ll need to leverage coroutines to utilize it.
|
|
From this quick example script, we can parse the returned data and determine the useful fields for our utility.
Output:
|
|
Great, the title is an available field, and subtitle references the Artist of the song. We can create our parsing algorithm to retrieve this info and rename the song files.
Encapsulate within an identify function
Let’s create an identify function to retrieve the Shazam data, parse it, and return the desired file name. Making this function asynchronous will help us later.
For error scenarios, we’ll use Python exceptions. These can be caught by the calling function, which can determine how to proceed with the song file.
|
|
Create a process to handle the main execution logic
We’ll create a main process to handle the control flow of our tool. Because the ShazamIO API takes some time to identify the song, our plan is to use several workers at the same time to quickly find and rename the many files given to our application.
Here, multiple identifier workers will be created. These workers will concurrently handle identifying all songs, and populating our async song queue when found.
For our limited amount of workers to traverse the long list of input files efficiently, a stride based setup can be leveraged. Each worker thread is given a start index and a stride length based on the total number of workers. Using this method allows for workers to process multiple songs by striding through the array without stepping on each other.
Here is example of a stride based traversal to illustrate this method:
|
|
There will be just one renamer worker however. This single worker handles taking files from the async queue and renaming them. This way, we keep the renaming part simple and sequential.
|
|
Create coroutines to handle identifying songs and renaming files
Here we implement the identifier and renamer coroutines. As mentioned before, the identifier_coroutine
will utilize a stride based implementation.
|
|
The throttling problem
The Shazam API unfortunately puts a cap on how many requests you can make in a certain amount of time. If this limit is exceeded, any additional requests will time out for a while. To avoid hitting this limit, we can make our identify function wait for a random amount of time before trying again. This way, not all of our identifier workers will send requests at the same time. Spacing out requests with this method after the timeout ends will help prevent overloading the Shazam API again with multiple requests at once.
|
|
Final Result
Some songs are still misidentified by ShazamIO, but for most cases this combo of tools fixed my music library.
|
|
Additional Thoughts and Improvements
- These tools prompt the user before every file name change, but this could quickly become tedious. Instead, implementing a -y option to skip prompts could be effective.
- For review, adding the option to generate a log of file name changes could also help if some name change occurred that was not intended.
- Using an API like ShazamIO that connects to a commercially owned service is a weak point, and creates issues such as the throttling problem above. Using a different API, or replacing with a deep learning model that can run on local hardware or private servers would be preferred. At the time of writing this, I could not find a well developed and updated song identification model, but something like this should be feasible with current deep learning technology.