Welcome back! So far, you have learned how to represent music tracks and user preferences as vectors, and how to use cosine similarity to recommend tracks to users. In this lesson, we will take a new step: grouping similar tracks together using a technique called clustering.
Clustering helps us organize our music library by finding groups of tracks that are similar to each other. This is useful for many reasons. For example, you can use clusters to create playlists, suggest new genres to users, or simply explore your music collection in a more structured way. In this lesson, you will learn how to use the KMeans algorithm to cluster tracks based on their embeddings.
Before we start clustering, let’s quickly remind ourselves how we get the data and embeddings for our tracks. You have already seen how to load track data and generate embeddings in previous lessons. Here is a short code block that shows the basic setup:
This code gives us two important things:
track_idsandtrack_embeddings_matrix: These are the unique IDs for each track and their corresponding embedding vectors.tracks_df: This is a DataFrame containing all the details about each track.
We will use these as the starting point for clustering.
To confirm everything is wired up correctly, we run some isolated tests on our clustering system (you’ll see these in the practice inside the src/tests/test_clustering.py). Here's the initial diagnostic output:
Clustering is a type of unsupervised learning, where we don’t start with any labels or categories. Instead, we ask the algorithm to find natural groupings in the data — that is, clusters.
Think of a cluster as a “cloud” of similar items in a multidimensional space. Each track is represented as a point in this space (based on its embedding), and clustering algorithms try to group nearby points together. The assumption is: if points are close, they’re likely to be similar in meaningful ways (e.g., mood, tempo, instrumentation).
Unlike classification, where we already know categories (like genre) and assign items to them, clustering figures out the categories for us. It answers: "What kinds of groups naturally exist in my data?"
For example, without knowing any genres up front, clustering might still discover groups like “slow instrumental tracks,” “fast electronic tracks,” or “melancholic acoustic songs” — purely based on numerical similarities.
This makes clustering especially useful for:
- Exploring datasets you don’t fully understand yet
- Discovering unexpected patterns
- Creating structure from messy or unlabeled data
KMeans is one of the simplest and most widely used clustering methods, which is why we start with it here.
KMeans is a popular clustering algorithm. It works by dividing your data into a set number of groups, called clusters. Each cluster contains tracks that are similar to each other based on their embeddings. Two tracks can be placed in the same cluster even if their genres differ, as long as their embeddings are close in vector space.
Here’s how KMeans works in simple terms:
- You choose how many clusters you want (for example, 3).
- The algorithm tries to group the tracks so that tracks in the same cluster are as similar as possible.
- Each track is assigned a cluster label (like 0, 1, or 2).
In the context of music tracks, this means that tracks with similar features (like genre, tempo, or mood) will end up in the same cluster. This makes it easier to find and recommend similar music.
Under the hood, KMeans follows a simple but powerful iterative process to find good cluster groupings:
- Initialization: It randomly picks
kpoints from the dataset to act as the first "centroids" (these are like the centers of each cluster). - Assignment Step: Every data point (in our case, every track embedding) is assigned to the nearest centroid — using Euclidean distance (straight-line distance in vector space).
- Update Step: Once all points are assigned, each centroid is moved to the center of the cluster — that is, it’s recalculated as the average of all points in that group.
- Repeat: Steps 2 and 3 are repeated until the centroids stop moving (or only move very little). This means the algorithm has converged to a stable solution.
This helps you understand that KMeans doesn’t magically “know” what a cluster is. Instead, it’s optimizing a mathematical goal: minimize the sum of squared distances within each cluster. It’s entirely based on numerical similarity, which is why having good embeddings is so important — bad embeddings = bad clusters, no matter how good your KMeans implementation is.
🤔 But How Many Clusters Should You Use?
Choosing the number of clusters (n_clusters) can be tricky. There's no perfect answer—it depends on the data. For a small number of tracks, 2–5 clusters might work. In real-world applications, you might:
- Experiment with different values and evaluate results visually (e.g., with PCA or t-SNE).
Let’s walk through the main function that clusters tracks using KMeans: assign_track_clusters. Here is the code, broken down into key steps:
Here’s an example of how you might use the assign_track_clusters function and what the output could look like:
Here’s a real example showing 5 tracks clustered into 3 groups::
We also verify that each cluster contains a reasonable number of tracks:
Even if the number of clusters you request is more than the number of tracks, the system handles it gracefully.
This is done because KMeans requires the number of clusters to be less than or equal to the number of samples. If you ask for 10 clusters with only 5 tracks, the system quietly reduces it to 5. Otherwise, scikit-learn would raise a ValueError. This fallback protects the system from crashing and provides a helpful behavior for testing small datasets.
For example:
And the final cluster assignment table will look like this:
You’ve now clustered tracks into groups — but how can this help in real-world systems?
Here are a few practical examples:
- 🎧 Playlist Generation: Automatically group similar songs for mood- or genre-based playlists.
- 🔁 Exploration Interfaces: Let users explore clusters visually (“These tracks feel similar — wanna try more?”).
- 🧠 Cold Start Help: If you don’t know anything about a new user, recommending popular tracks from diverse clusters gives a safe and broad introduction.
- 🕵️♂️ Anomaly Detection: Tracks that consistently fall into odd clusters might need review—they could have corrupt metadata or unusual embedding profiles.
Clustering isn’t just backend logic — it can directly shape UX and feature design.
In this lesson, you learned how to group music tracks into clusters using the KMeans algorithm. You saw how to prepare your data, run the clustering, and interpret the results. Clustering is a powerful tool for organizing and exploring your music library, and it can help you build better recommendation systems.
You are now ready to practice clustering tracks yourself. In the next exercises, you will get hands-on experience with these concepts. This will help you reinforce what you have learned and prepare you for more advanced topics in music recommendation. Good luck, and enjoy exploring your music clusters!
