People Who Like This Also Like …

Distance Metrics for Fun and Profit

A while ago a friend of mine asked me how I would go about building a ‘People Who Like This Also Like …’ feature for a music startup he was working at. For each band or musician, he wanted to display a list of other artists that people might also be interested in.

At the time, I think I arrogantly responded with something like “Thats easy! I can think of like a dozen ways of calculating this!”. Which of course was profoundly unhelpful and probably slightly infuriating. Once he calmed down, I sketched out how I would calculate the distance between any two artists – and use that distance as a ranking function to build this feature.

Since then he has been encouraging me to write a blog post about this, and after a totally unreasonable delay I finally got around to finishing it up. So here is my step by step guide for the non data scientist, using Python with Pandas and SciPyto compute the distances, and D3.js for building gratuitously interactive visualizations.

Finding Similar Music using Matrix Factorization

In a previous post I wrote about how to build a ‘People Who Like This Also Like …’ feature for displaying lists of similar musicians. My goal was to show how simple Information Retrieval techniques can do a good job calculating lists of related artists. For instance, using BM25 distance on The Beatles shows the most similar artists being John Lennon and Paul McCartney.

One interesting technique I didn’t cover was using Matrix Factorization methods to reduce the dimensionality of the data before calculating the related artists. This kind of analysis can generate matches that are impossible to find with the techniques in my original post.

This post is a step by step guide on how to calculate related artists using a couple of different matrix factorization algorithms. The code is written in Python using Pandas and SciPy to do the calculations and D3.js to interactively visualize the results.

As part of writing this post, I also open sourced a high performance python version of the Implicit Alternating Least Squares matrix factorization algorithm. Most of the code here can be found in the examples directory of that project.