Introduction: Why Predict Track Affinity?

Welcome back! In the previous lesson, you learned how to prepare training data from user session logs. Now, we are ready to take the next step: training a machine learning model that can predict which tracks a user is likely to enjoy. This is called predicting track affinity.

Track affinity is a measure of how much a user might like a particular track. By predicting this, we can recommend songs that users are more likely to enjoy, making the music player smarter and more personalized. In this lesson, you will learn how to train a simple but effective model to make these predictions.

Understanding the Training Data

Before training our model, let’s quickly review what kind of data we’re working with.

The function prepare_training_data() returns:

  • A feature matrix X where each row is a combination of a user profile vector and a track embedding. A label vector y, where:
  • 1 means the user listened to (liked) the track
  • 0 means the user did not listen to (negative sample) Each feature vector may look like this:

This structure is important: our model will try to learn a function like f(user, track) → probability of liking.

Why Logistic Regression?

In this project, we use logistic regression because our goal is a binary classification:

  • 1 → The user is likely to enjoy the track (positive sample)
  • 0 → The user is unlikely to enjoy the track (negative sample)

Logistic regression is a simple yet effective model for this kind of task because:

  • It predicts probabilities, not just yes/no outcomes. This is useful for ranking recommendations by likelihood.
  • It’s interpretable — the learned weights show which features push predictions higher or lower.
  • It’s fast to train and works well even with small datasets, making it ideal for an educational setting before moving on to more complex models.
  • It outputs a sigmoid function result between 0 and 1, which maps naturally to the idea of “likelihood of liking.”

While real-world recommendation systems may use deep learning or more complex models, logistic regression is a great starting point to:

  • Understand the full pipeline from data preparation to prediction.
  • Build intuition for how features influence recommendations.
  • Avoid overfitting when data is limited.
Training the Affinity Model

Now, let’s train a model that can predict whether a user will like a track. We will use a logistic regression model, which is a simple and popular choice for binary classification tasks (like predicting 1 or 0).

Here is the main function for training the model:

Let’s break down what happens here:

  • We get our features (X) and labels (y) using the data preparation function.
  • If there isn’t enough data, or if all the labels are the same, we skip training and print a message. These checks prevent runtime errors and model training failures. For instance:
    • If you try to train with fewer than 10 samples, the model might overfit or fail to converge.
    • If the labels are all 1 or all 0, the classifier can’t learn anything meaningful — it's like trying to teach it to distinguish cats from... just more cats.
  • We split the data into training and test sets. This helps us check how well the model works on new data.
Evaluating and Saving the Model

After training, it’s important to check if the model is actually working well. We use two main metrics:

  • Accuracy: The percentage of correct predictions.
  • ROC AUC: A score that tells us how well the model can distinguish between positive and negative samples. A score closer to 1.0 is better.

However, accuracy alone can be misleading — especially if your dataset is imbalanced (e.g., many more negative samples than positive). That’s why we also calculate ROC AUC, which evaluates how well the model separates the two classes regardless of their ratio. A score of 0.5 means the model is guessing; a score closer to 1.0 means strong separation.

Here’s how we save the trained model so we can use it later:

  • save_model writes the trained model to a file, so you don’t have to retrain it every time — saving you compute time and ensuring consistent predictions. This becomes especially important once you deploy the system or start batch-generating recommendations.
  • load_model lets you load the model back into your program when you need it.

Example Output:

Summary And What’s Next

In this lesson, you learned how to train a logistic regression model to predict track affinity using prepared user and track data. You also saw how to evaluate the model’s performance and save it for future use.

Next, you will get a chance to practice these steps yourself. You’ll train your own model, check its accuracy, and save it — just like we did here. Good luck, and have fun experimenting with your own music recommendation model!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal