Advanced Content Recommendations

Introduction to More Complex Content-Based Recommendations

In previous lessons, you learned about content-based recommendation systems and how they rely on user and item profiles. We covered how to extract content features such as likes, clicks, and genres, and how to compute similarities using straightforward methods like the dot product. This lesson will build on those foundations to guide you through a more complex example, using advanced techniques like regression models to generate recommendations. We'll explore how to simulate user preferences, calculate genre similarities, and predict song ratings, offering you a glimpse into the practical applications of these systems in real-world scenarios, such as music streaming services. Let's dive into this sophisticated example step by step.

Representing User and Track Data in C++

Before we proceed, let's recall how to represent user and track data using C++ data structures. Instead of using dictionaries or dataframes, we use struct to define the features of users and tracks, and arrays to store their values. Here is how we can define user and track profiles in C++: C++#include <iostream> #include <vector> #include <string> struct UserProfile { int rock_preference; // Scale 1-5 int pop_preference; // Scale 1-5 int jazz_preference; // Scale 1-5 int listens; // Total listens int likes; // Total likes }; struct Track { std::string name; std::string genre; int likes; int clicks; int full_listens; int author_listeners; };#include <iostream> #include <vector> #include <string> struct UserProfile { int rock_preference; // Scale 1-5 int pop_preference; // Scale 1-5 int jazz_preference; // Scale 1-5 int listens; // Total listens int likes; // Total likes }; struct Track { std::string name; std::string genre; int likes; int clicks; int full_listens; int author_listeners; }; This setup allows us to store and manipulate user and track information efficiently in C++.

Simulating User Preferences

To offer personalized recommendations, we need to simulate user preferences. In C++, we can create a user profile by initializing a UserProfile struct with the desired values. C++// Simulate user listening history or preferences UserProfile user = { 5, // rock_preference 4, // pop_preference 2, // jazz_preference 50, // listens 30 // likes };// Simulate user listening history or preferences UserProfile user = { 5, // rock_preference 4, // pop_preference 2, // jazz_preference 50, // listens 30 // likes }; Here, we've created a simple user profile indicating that our hypothetical user enjoys rock the most, followed by pop, and has a moderate affinity for jazz. This profile will be used to tailor recommendations to their tastes.

Calculating Genre Similarities: Part 1

Next, let's map music genres into numerical vectors and compute genre similarities. In C++, we can use arrays to represent these vectors. Each genre is represented by a one-hot encoded array, where only one element is set to 1 and the rest are 0. C++#include <map> // Define genre vectors using one-hot encoding std::map<std::string, std::vector<int>> genre_map = { {"Rock", {1, 0, 0}}, {"Pop", {0, 1, 0}}, {"Jazz", {0, 0, 1}} };#include <map> // Define genre vectors using one-hot encoding std::map<std::string, std::vector<int>> genre_map = { {"Rock", {1, 0, 0}}, {"Pop", {0, 1, 0}}, {"Jazz", {0, 0, 1}} }; For example, the vector for "Rock" is {1, 0, 0}, indicating the presence of rock and the absence of pop and jazz. This representation will help us calculate the similarity between the user’s genre preferences and each track's genre.

Calculating Genre Similarities: Part 2

To compare the user's genre preferences with each track's genre, we need to compute the similarity between two vectors. One common metric is cosine similarity. In C++, we can implement this calculation manually. First, let's define a function to compute cosine similarity between two vectors: C++#include <cmath> // Compute the dot product of two vectors double dot_product(const std::vector<int>& a, const std::vector<int>& b) { double result = 0.0; for (size_t i = 0; i < a.size(); ++i) { result += a[i] * b[i]; } return result; } // Compute the norm (magnitude) of a vector double norm(const std::vector<int>& v) { double sum = 0.0; for (int val : v) { sum += val * val; } return std::sqrt(sum); } // Compute cosine similarity between two vectors double cosine_similarity(const std::vector<int>& a, const std::vector<int>& b) { double dot = dot_product(a, b); double norm_a = norm(a); double norm_b = norm(b); if (norm_a == 0 || norm_b == 0) return 0.0; return dot / (norm_a * norm_b); }#include <cmath> // Compute the dot product of two vectors double dot_product(const std::vector<int>& a, const std::vector<int>& b) { double result = 0.0; for (size_t i = 0; i < a.size(); ++i) { result += a[i] * b[i]; } return result; } // Compute the norm (magnitude) of a vector double norm(const std::vector<int>& v) { double sum = 0.0; for (int val : v) { sum += val * val; } return std::sqrt(sum); } // Compute cosine similarity between two vectors double cosine_similarity(const std::vector<int>& a, const std::vector<int>& b) { double dot = dot_product(a, b); double norm_a = norm(a); double norm_b = norm(b); if (norm_a == 0 || norm_b == 0) return 0.0; return dot / (norm_a * norm_b); } Now, let's create a list of tracks and calculate the similarity between the user's genre preferences and each track's genre: C++// Example tracks std::vector<Track> tracks = { {"Song A", "Rock", 100, 300, 90, 5000}, {"Song B", "Pop", 80, 250, 70, 4000}, {"Song C", "Jazz", 60, 200, 50, 3000} }; // User's genre preferences as a vector std::vector<int> user_genre_preferences = { user.rock_preference, user.pop_preference, user.jazz_preference }; // Calculate and print similarity for each track std::vector<double> similarities; for (const auto& track : tracks) { std::vector<int> track_genre_vector = genre_map[track.genre]; double sim = cosine_similarity(track_genre_vector, user_genre_preferences); similarities.push_back(sim); std::cout << "Similarity between user and " << track.name << ": " << sim << std::endl; }// Example tracks std::vector<Track> tracks = { {"Song A", "Rock", 100, 300, 90, 5000}, {"Song B", "Pop", 80, 250, 70, 4000}, {"Song C", "Jazz", 60, 200, 50, 3000} }; // User's genre preferences as a vector std::vector<int> user_genre_preferences = { user.rock_preference, user.pop_preference, user.jazz_preference }; // Calculate and print similarity for each track std::vector<double> similarities; for (const auto& track : tracks) { std::vector<int> track_genre_vector = genre_map[track.genre]; double sim = cosine_similarity(track_genre_vector, user_genre_preferences); similarities.push_back(sim); std::cout << "Similarity between user and " << track.name << ": " << sim << std::endl; } In this code, we manually calculate the cosine similarity between the user's genre preferences and each track's genre vector. Higher scores indicate a closer match to the user's tastes.

Standardizing Features and Applying a Simple Regression Model

Before making predictions, it's important to standardize our features so that each feature contributes equally to the model. Standardization means subtracting the mean and dividing by the standard deviation for each feature. Since we don't have external libraries, we'll implement standardization manually for a small dataset. Let's assume we want to use the following features for each track: likes clicks full_listens author_listeners similarity (calculated above) We'll also add a synthetic rating for each track, representing the user's real rating. C++// Add synthetic ratings for demonstration std::vector<double> ratings = {4.0, 5.0, 3.0}; // User's real ratings for tracks // Collect features for standardization std::vector<std::vector<double>> features; for (size_t i = 0; i < tracks.size(); ++i) { features.push_back({ static_cast<double>(tracks[i].likes), static_cast<double>(tracks[i].clicks), static_cast<double>(tracks[i].full_listens), static_cast<double>(tracks[i].author_listeners), similarities[i] }); } // Standardize features std::vector<double> means(5, 0.0); std::vector<double> stds(5, 0.0); // Calculate means for (int j = 0; j < 5; ++j) { for (size_t i = 0; i < features.size(); ++i) { means[j] += features[i][j]; } means[j] /= features.size(); } // Calculate standard deviations for (int j = 0; j < 5; ++j) { for (size_t i = 0; i < features.size(); ++i) { stds[j] += (features[i][j] - means[j]) * (features[i][j] - means[j]); } stds[j] = std::sqrt(stds[j] / features.size()); } // Apply standardization std::vector<std::vector<double>> features_scaled = features; for (size_t i = 0; i < features.size(); ++i) { for (int j = 0; j < 5; ++j) { if (stds[j] != 0) features_scaled[i][j] = (features[i][j] - means[j]) / stds[j]; else features_scaled[i][j] = 0.0; } }// Add synthetic ratings for demonstration std::vector<double> ratings = {4.0, 5.0, 3.0}; // User's real ratings for tracks // Collect features for standardization std::vector<std::vector<double>> features; for (size_t i = 0; i < tracks.size(); ++i) { features.push_back({ static_cast<double>(tracks[i].likes), static_cast<double>(tracks[i].clicks), static_cast<double>(tracks[i].full_listens), static_cast<double>(tracks[i].author_listeners), similarities[i] }); } // Standardize features std::vector<double> means(5, 0.0); std::vector<double> stds(5, 0.0); // Calculate means for (int j = 0; j < 5; ++j) { for (size_t i = 0; i < features.size(); ++i) { means[j] += features[i][j]; } means[j] /= features.size(); } // Calculate standard deviations for (int j = 0; j < 5; ++j) { for (size_t i = 0; i < features.size(); ++i) { stds[j] += (features[i][j] - means[j]) * (features[i][j] - means[j]); } stds[j] = std::sqrt(stds[j] / features.size()); } // Apply standardization std::vector<std::vector<double>> features_scaled = features; for (size_t i = 0; i < features.size(); ++i) { for (int j = 0; j < 5; ++j) { if (stds[j] != 0) features_scaled[i][j] = (features[i][j] - means[j]) / stds[j]; else features_scaled[i][j] = 0.0; } } Now, let's fit a simple linear regression model manually. For simplicity, we'll use the least squares method for a single feature, or for multiple features if you wish to extend it. Here, we'll just demonstrate the concept for a small dataset. C++// For demonstration, we'll use a simple linear regression with one feature (similarity) // In practice, you can extend this to multiple features using matrix operations // Calculate coefficients for y = a * similarity + b double sum_x = 0.0, sum_y = 0.0, sum_xx = 0.0, sum_xy = 0.0; for (size_t i = 0; i < features_scaled.size(); ++i) { double x = features_scaled[i][4]; // similarity double y = ratings[i]; sum_x += x; sum_y += y; sum_xx += x * x; sum_xy += x * y; } double n = features_scaled.size(); double denominator = n * sum_xx - sum_x * sum_x; double a = (n * sum_xy - sum_x * sum_y) / denominator; double b = (sum_y * sum_xx - sum_x * sum_xy) / denominator; std::cout << "Fitted regression: rating = " << a << " * similarity + " << b << std::endl;// For demonstration, we'll use a simple linear regression with one feature (similarity) // In practice, you can extend this to multiple features using matrix operations // Calculate coefficients for y = a * similarity + b double sum_x = 0.0, sum_y = 0.0, sum_xx = 0.0, sum_xy = 0.0; for (size_t i = 0; i < features_scaled.size(); ++i) { double x = features_scaled[i][4]; // similarity double y = ratings[i]; sum_x += x; sum_y += y; sum_xx += x * x; sum_xy += x * y; } double n = features_scaled.size(); double denominator = n * sum_xx - sum_x * sum_x; double a = (n * sum_xy - sum_x * sum_y) / denominator; double b = (sum_y * sum_xx - sum_x * sum_xy) / denominator; std::cout << "Fitted regression: rating = " << a << " * similarity + " << b << std::endl; This code fits a simple linear regression model using the standardized similarity feature. You can extend this to multiple features with more advanced techniques.

Predicting Test Song Ratings

Finally, let's define a test song, process its features, and use our regression model to predict its rating. C++// Define a test song Track test_song = {"Test Song", "Rock", 120, 350, 110, 6000}; // Map genre for test song std::vector<int> test_song_genre_vector = genre_map[test_song.genre]; // Calculate similarity for the test song double test_song_similarity = cosine_similarity(test_song_genre_vector, user_genre_preferences); // Prepare test song features std::vector<double> test_song_features = { static_cast<double>(test_song.likes), static_cast<double>(test_song.clicks), static_cast<double>(test_song.full_listens), static_cast<double>(test_song.author_listeners), test_song_similarity }; // Standardize test song features std::vector<double> test_song_features_scaled(5, 0.0); for (int j = 0; j < 5; ++j) { if (stds[j] != 0) test_song_features_scaled[j] = (test_song_features[j] - means[j]) / stds[j]; else test_song_features_scaled[j] = 0.0; } // Predict rating using the regression model (using only similarity feature for simplicity) double predicted_rating = a * test_song_features_scaled[4] + b; std::cout << "Predicted rating for the test song: " << predicted_rating << std::endl;// Define a test song Track test_song = {"Test Song", "Rock", 120, 350, 110, 6000}; // Map genre for test song std::vector<int> test_song_genre_vector = genre_map[test_song.genre]; // Calculate similarity for the test song double test_song_similarity = cosine_similarity(test_song_genre_vector, user_genre_preferences); // Prepare test song features std::vector<double> test_song_features = { static_cast<double>(test_song.likes), static_cast<double>(test_song.clicks), static_cast<double>(test_song.full_listens), static_cast<double>(test_song.author_listeners), test_song_similarity }; // Standardize test song features std::vector<double> test_song_features_scaled(5, 0.0); for (int j = 0; j < 5; ++j) { if (stds[j] != 0) test_song_features_scaled[j] = (test_song_features[j] - means[j]) / stds[j]; else test_song_features_scaled[j] = 0.0; } // Predict rating using the regression model (using only similarity feature for simplicity) double predicted_rating = a * test_song_features_scaled[4] + b; std::cout << "Predicted rating for the test song: " << predicted_rating << std::endl; By defining features for a new track and calculating its similarity to user preferences, our regression model predicts the track's rating. Similarly, you can predict ratings for multiple songs and recommend the ones with the highest predicted rating.

Summary and Preparation for Practice

In this lesson, you've successfully integrated advanced content-based recommendation concepts, from simulating user preferences to predicting track ratings with a regression model. You've combined data representation, similarity calculations, and regression insights to create a concrete recommendation system in C++. As you move on to practice exercises, use this lesson as a framework for applying similar techniques to your unique datasets and user scenarios. This practical experience will consolidate your understanding and proficiency, enabling you to build sophisticated content-based recommendation systems independently.

Previous Lesson

Next Lesson: Preparing Data for Factorization Machines

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal