Preparing Training Data

Introduction: Why Prepare Training Data?

Welcome to the course. In this lesson, we will focus on preparing training data from session logs — a crucial step in building a system that can predict what music a user might like next.

Training data is the foundation of any machine learning model. For our smart music player, we want to teach the model to recognize patterns in what users listen to so it can make smart recommendations. Session logs are records of what tracks users have listened to. By turning these logs into structured training data, we give our model the information it needs to learn user preferences.

By the end of this lesson, you will understand how to transform raw session logs into a format that a machine learning model can use.

Recap: Project Setup and Data Sources

Before we dive in, let’s briefly remind ourselves of the project setup and where our data comes from. Our project uses several data sources:

Session logs: These are stored in a file called sessions.csv and record which users listened to which tracks.
Track data: Information about all available tracks.
User profiles: Vectors that summarize a user's listening history.
Track embeddings: Vectors that represent the features of each track.

Here is a quick example of how we load the necessary modules and data in our project:

This code sets up the basic data we need for preparing our training data. If you are using the CodeSignal environment, these modules and files should already be available.

Data Check: If the session log file path doesn’t exist, it’s good practice to print a warning so it’s clear that no training data will be loaded. You can also print the first few rows of the DataFrame with sessions_df.head() when the file exists — this helps remind you what data you’re working with before moving on.

Positive and Negative Samples Explained

To train a model to predict user preferences, we need to show it examples of both what a user likes and what they don’t like (or at least, what they haven’t listened to yet).

Positive samples: These are track-user pairs where the user has listened to the track. In our session logs, these are easy to find.
Negative samples: These are track-user pairs where the user has not listened to the track. We create these by pairing users with tracks they haven’t played.

Note that we don’t include all unlistened tracks as negative samples — this would create an overwhelming number of negatives compared to positives and hurt model training. Instead, we sample a limited number using the negative_sample_ratio parameter, which controls how many negatives are generated per positive. This helps maintain a healthy balance in the training dataset. Keep in mind that if a user has listened to nearly all tracks, the number of unlistened tracks might be very small or even zero. In such cases, the function gracefully avoids creating excess negative samples — meaning fewer (or no) negatives will be generated, even if the negative_sample_ratio is set high. This behavior prevents index errors or sampling failures but might result in class imbalance for some users.

A very high ratio can create class imbalance (far more 0s than 1s), which often inflates accuracy while hurting recall for positives. It can also slow training and bias the model toward predicting “not liked.” Practical tips: start with 1–3, cap by available unlistened tracks (already handled), and monitor label counts. If you must go higher, consider class weights or downsampling negatives during training.

Why do we need both? If we only show the model what users like, it won’t learn to tell the difference between liked and unliked tracks. By including both, we help the model learn what makes a track appealing to a user.

What Happens If We Only Include Positives? Let’s try preparing the training data with negative_sample_ratio=0, meaning no negative examples are added:

How Feature Vectors Are Built

For each user-track pair, we need to create a feature vector that the model can use. This vector combines information about the user and the track.

User profile vector: Summarizes the user’s listening history.
Track embedding: Represents the features of the track.

We combine these two vectors into one by simply joining them together (concatenation). This concatenation ensures that the model sees both the user’s musical taste and the track’s characteristics side by side. Over time, it learns how different combinations of user preferences and track features affect listening behavior — essentially, it’s learning a function f(user, track) → like or not. Here’s a simplified example:

Output:

This combined vector is what we use as input for our model. Each row in our training data will be one of these feature vectors, and each will have a label: 1 for positive, 0 for negative.

To avoid confusion, it’s important to understand that both the user profile vector and track embedding must have a fixed and consistent length. For example, if the user profile vector is of length 10 and the track embedding is also of length 12, the resulting feature vector will have 22 dimensions — which is exactly what you’ll observe in the training data. If their shapes are mismatched or inconsistent between users or tracks, the model training will fail with shape-related errors. Always ensure that the user and track vectors are computed using the same embedding logic.

Walking Through the Data Preparation Function

Now, let’s look at how we put all these ideas together in the prepare_training_data function. This function creates the training data for our model by:

Loading all tracks and their embeddings.
Reading the session logs to find which users listened to which tracks.
For each user:
- Generating their profile vector.
- Creating positive samples for tracks they listened to.
- Creating negative samples for tracks they did not listen to.
- Combining user and track vectors into feature vectors.
Returning the features (X) and labels (y).

Here is the main part of the function:

Summary And What’s Next

In this lesson, you learned how to turn raw session logs into structured training data for a machine learning model. We covered:

The importance of both positive and negative samples
How to build feature vectors by combining user and track information
How the prepare_training_data function works step by step and also includes checks for edge cases such as missing session files, empty track lists, or users with no listen history. These are verified in the test suite you’ll use shortly, ensuring robustness even when the data is sparse or incomplete.

This foundation is essential for building models that can predict user preferences. In the next practice exercises, you will get hands-on experience running and testing this data preparation process yourself. Be sure to pay attention to how the data is structured and how the function handles different scenarios, as this will help you build more robust recommendation systems in the future.

Next Lesson: Training Track Affinity Model

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal