Lesson 3
Setting Content-Based Recommendations Baseline with Linear Regression
Introduction to More Complex Content-Based Recommendations

In previous lessons, you learned about content-based recommendation systems and how they rely on user and item profiles. We covered how to extract content features such as likes, clicks, and genres, and how to compute similarities using straightforward methods like the dot product. This lesson will build on those foundations to guide you through a more complex example, using advanced techniques like regression models to generate recommendations.

We'll explore how to simulate user preferences, calculate genre similarities, and predict song ratings, offering you a glimpse into the practical applications of these systems in real-world scenarios, such as music streaming services. Let's dive into this sophisticated example step by step.

Recap of Initial Setup

As a reminder from our previous lessons, let's quickly revisit how to load and merge datasets. We begin by using Python's pandas library to read from JSON files and create a merged DataFrame that contains both track and author information. Here's a code block demonstrating this process:

Python
1import pandas as pd 2 3# Load data from JSON files 4tracks_df = pd.read_json('tracks.json') 5authors_df = pd.read_json('authors.json') 6 7# Merge the dataframes on the common 'author_id' field 8merged_df = pd.merge(tracks_df, authors_df, on='author_id', how='inner')

By executing this code, we create a unified view of our music tracks, integrating both track details and author information, which will serve as a foundation for our recommendation system.

Simulating User Preferences

To offer personalized recommendations, we need to simulate user preferences. Let's define a hypothetical user's listening history, quantifying their genre preferences and listening behavior.

Python
1# Simulate user listening history or preferences 2user_features = { 3 "rock_preference": 5, # On a scale of 1-5 4 "pop_preference": 4, # On a scale of 1-5 5 "jazz_preference": 2, # On a scale of 1-5 6 "listens": 50, # Total listens 7 "likes": 30 # Total likes 8} 9 10# Create a profile for the user 11user_profile = pd.DataFrame([user_features])

Here, we've created a simple user profile indicating that our hypothetical user enjoys rock the most, followed by pop, and has a moderate affinity for jazz. This profile will be used to tailor recommendations to their tastes.

Calculating Genre Similarities: Part 1

Next, let's map music genres into numerical vectors and compute genre similarities using cosine_similarity. This allows us to quantitatively compare the user's genre preferences with those of the tracks available.

The genre_map is a dictionary that maps each genre to a unique numerical vector representation. Each vector can be seen as a one-hot encoding, where a genre is represented by a vector with a 1 in the position that corresponds to the specific genre, and 0s elsewhere. This means each genre has a distinct and orthogonal representation, useful for calculating similarities.

Here’s how we define it:

Python
1genre_map = { 2 "Rock": np.array([1, 0, 0]), 3 "Pop": np.array([0, 1, 0]), 4 "Jazz": np.array([0, 0, 1]) 5}

The vectors are set up to reflect a one-hot encoding for simplicity. For example, the vector for "Rock" is [1, 0, 0], indicating the presence of rock and the absence of pop and jazz. This representation will help us calculate the similarity between the user’s genre preferences and each track's genre.

This time, to calculate similarities, let's use another metric: cosine similarity. Cosine similarity measures the cosine of the angle between two vectors in a multidimensional space. A cosine similarity value of 1 indicates that the vectors are identical, while 0 means they are orthogonal (no similarity). High cosine similarity signifies that the preferences are closely aligned.

Calculating Genre Similarities: Part 2

Now, let's build the code snippet to calculate similarities between the user profile and songs.

Python
1import numpy as np 2from sklearn.metrics.pairwise import cosine_similarity 3 4# Function to map genre preferences for similarity calculation 5def map_genre_to_similarity(df): 6 genre_map = { 7 "Rock": np.array([1, 0, 0]), 8 "Pop": np.array([0, 1, 0]), 9 "Jazz": np.array([0, 0, 1]) 10 } 11 genre_features = df['genre'].apply(lambda x: genre_map[x]) 12 return genre_features.tolist() 13 14# Calculate similarity between user's genre preferences and tracks' genres 15track_genre_features = np.array(map_genre_to_similarity(merged_df)) 16user_genre_preferences = np.array([user_profile.iloc[0]['rock_preference'], 17 user_profile.iloc[0]['pop_preference'], 18 user_profile.iloc[0]['jazz_preference']]).reshape(1, -1) 19similarities = cosine_similarity(track_genre_features, user_genre_preferences).flatten() 20 21# Attach similarity scores to the tracks 22merged_df['similarity'] = similarities

Here, iloc is used to retrieve specific row and column positions from the DataFrame—for example, user_profile.iloc[0] accesses the first row's value for each genre preference. Using the cosine_similarity method, we measure the closeness between the user's preferences and available genres, effectively scoring each track's potential appeal to the user. Higher scores indicate a closer match to the user's tastes.

Also note methods that we use to ensure proper calculations:

  • reshape: Adjusts the dimensions of your arrays. Here, it's used to turn a single row of user preferences into a 2D array required by the cosine_similarity function.
  • flatten: Converts a multidimensional array into a 1D array, which helps in simplifying the structure for further analysis or appending results.
Standardizing Features and Applying Regression Model

Before making predictions, it's vital to standardize our features to ensure balanced input for our regression model.

Standardization rescales features so that they have a mean of 0 and a standard deviation of 1. This process eliminates biases in model training that could arise due to features being on different scales, ensuring each feature has equal influence.

For the content-based recommendations baseline, we will use the Linear Regression model. Linear regression is a method to model the linear relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. The model is trained by minimizing the difference between predicted and actual values, essentially finding the line of best fit.

Python
1from sklearn.preprocessing import StandardScaler 2from sklearn.linear_model import LinearRegression 3 4# Standardize numerical features 5scaler = StandardScaler() 6numeric_columns = ["likes", "clicks", "full_listens", "author_listeners", "similarity"] 7track_features_scaled = scaler.fit_transform(merged_df[numeric_columns]) 8 9# Add a synthetic rating 10merged_df['rating'] = [4, 5, 3] # User real ratings for tracks 11 12# Train a simple regression model 13X = track_features_scaled 14y = merged_df['rating'] 15 16reg_model = LinearRegression() 17reg_model.fit(X, y)

Here, .fit is an essential part of the model training process. It adjusts the parameters of the regression model (like the slope and intercept in a simple linear regression) based on the input features X and the target values y. By fitting the regression model with user real ratings, the model learns how to predict user ratings for new tracks based on standardized content features.

Predicting Test Song Ratings

Finally, we define a test song, which the model was not trained on, process its features similarly, and use the model to predict its rating, completing our recommendation system process.

Python
1# Define a test song and calculate its similarity 2test_song = { 3 "likes": 120, 4 "clicks": 350, 5 "full_listens": 110, 6 "author_listeners": 6000, 7 "genre": "Rock" 8} 9 10# Map genre for test song 11test_song_genre_feature = np.array(map_genre_to_similarity(pd.DataFrame([test_song]))).reshape(1, -1) 12 13# Calculate similarity for the test song 14test_song_similarity = cosine_similarity(test_song_genre_feature, user_genre_preferences).flatten() 15 16# Prepare test song features with column names for scaling 17test_song_features = pd.DataFrame({ 18 "likes": [test_song['likes']], 19 "clicks": [test_song['clicks']], 20 "full_listens": [test_song['full_listens']], 21 "author_listeners": [test_song['author_listeners']], 22 "similarity": [test_song_similarity[0]], 23}) 24 25# Scale test song features 26test_song_features_scaled = scaler.transform(test_song_features) 27 28# Predict rating for the test song 29predicted_test_rating = reg_model.predict(test_song_features_scaled) 30print(f"Predicted rating for the test song: {predicted_test_rating[0]}")

By defining features for a new track and calculating its similarity to user preferences, our regression model predicts the track's rating. The use of .predict applies the model's learned patterns to estimate a rating based on the test song's features. Similarly, we can predict ratings for multiple songs and recommend ones with the highest rating.

Summary and Preparation for Practice

In this lesson, you've successfully integrated advanced content-based recommendation concepts, from simulating user preferences to predicting track ratings with a regression model. You've combined data merging, similarity calculations, and regression insights to create a concrete recommendation system.

As you move on to practice exercises, use this lesson as a framework for applying similar techniques to your unique datasets and user scenarios. This practical experience will consolidate your understanding and proficiency, enabling you to build sophisticated content-based recommendation systems independently.

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.