In the world of recommendation systems, one of the keys to success is understanding the similarity between users or items. This understanding forms the backbone of making accurate recommendations. Similarity measures allow us to identify users with similar preferences, improving the quality and relevance of recommendations.
In this lesson, we will explore Pearson Correlation, a tool used to measure similarity based on patterns in ratings. By the end, you will be able to implement this measure and understand its application in recommendation systems.
Before we dive in, let's quickly recap the setup from previous lessons. We will be using Python and numpy for this lesson. If you've done these steps before, consider this a helpful reminder.
Here's a simple code block to demonstrate setting up and using numpy
to create user rating datasets:
Python1import numpy as np 2 3# Example user ratings 4user1_ratings = np.array([5, 3, 4, 2, 1, 5, 3, 4, 2, 1, 5, 3, 4, 2, 1]) 5user2_ratings = np.array([5, 3, 4, 5, 1, 3, 3, 4, 2, 1, 5, 2, 2, 2, 1])
Each index in user1_ratings
and user2_ratings
corresponds to the rating of the same item by both users.
These arrays can be extracted from the user-item matrix, but this time we will simply define them like this for brevity. If a rating is missing for one user in the user-item matrix, that item should be excluded from the calculation. This ensures that only ratings for items both users have rated are compared.
Pearson Correlation measures the strength and direction of a linear relationship between two sets of data. It's a popular tool in recommendation systems because it helps gauge the similarity between users based on their rating trends, rather than their absolute ratings.
The formula for Pearson Correlation is:
- , are the individual ratings.
- , are the mean ratings for each user.
- The numerator sums the product of the differences from the mean.
- The denominator normalizes this sum with the square root of the squared differences.
The Pearson correlation coefficient ranges from -1 to 1:
- A coefficient of 1 indicates a perfect positive linear relationship.
- A coefficient of -1 indicates a perfect negative linear relationship.
- A coefficient of 0 indicates no linear correlation.
Let’s break down the implementation of the Pearson Correlation
function to understand how to calculate it step-by-step.
Python1# Function to calculate Pearson correlation between two users 2def pearson_correlation(ratings1, ratings2): 3 n = len(ratings1) 4 assert n == len(ratings2) # Check if both arrays have the same length 5 6 # Calculate means 7 mean1 = np.mean(ratings1) 8 mean2 = np.mean(ratings2) 9 10 # Calculate the difference from the mean 11 diff1 = ratings1 - mean1 12 diff2 = ratings2 - mean2 13 14 # Calculate numerator and denominator 15 numerator = np.sum(diff1 * diff2) 16 denominator = np.sqrt(np.sum(diff1 ** 2) * np.sum(diff2 ** 2)) 17 18 if denominator == 0: 19 return 0 # Prevent division by zero 20 else: 21 return numerator / denominator
- Calculate Means: We compute the mean rating for each user to understand their general preference level.
- Difference from Mean: We find how each rating deviates from the mean.
- Numerator and Denominator: We calculate these using the differences to determine the correlation.
Let's apply this function to our previous example user ratings and see how it works in practice.
Python1# Calculate and print Pearson correlation 2pearson_similarity = pearson_correlation(user1_ratings, user2_ratings) 3print(f"Pearson Correlation: {pearson_similarity:.2f}")
Output:
1Pearson Correlation: 0.7
In this context, a Pearson Correlation
of 0.7 indicates a positive relationship between the rating trends of the two users. This suggests that they have similar interests, making it easier to recommend new items they might both enjoy.
The term 'correlation' refers to the statistical measure that describes the extent to which two variables change together. For instance, even if two users rate more items commonly, their correlation might not be as high as compared to another pair because correlation focuses on how the ratings move together.
Let's illustrate with a third user:
Python1# Additional user ratings 2user1_ratings = np.array([2, 2, 3, 4, 4]) 3user2_ratings = np.array([2, 5, 3, 1, 4]) 4user3_ratings = np.array([1, 1, 2, 4, 3]) 5 6# Calculate Pearson correlations 7pearson_similarity_12 = pearson_correlation(user1_ratings, user2_ratings) 8pearson_similarity_13 = pearson_correlation(user1_ratings, user3_ratings) 9 10print(f"Pearson Correlation between User 1 and User 2: {pearson_similarity_12:.2f}") # -0.32 11print(f"Pearson Correlation between User 1 and User 3: {pearson_similarity_13:.2f}") # 0.96
Despite having more common ratings with User 2, User 1’s ratings' movement is more aligned with User 3’s. This illustrates how correlation is about the trend in ratings rather than the sheer count of mutual ratings.
This fact is not an advantage or disadvantage of this similarity measure, but it is a fact that you should be aware when using it. There are other metrics that do not have this feature, like cosine similarity.
Throughout this lesson, you've learned to calculate and apply Pearson Correlation to measure user similarity based on their ratings. This measure is a powerful tool in crafting accurate and personalized recommendations.
With these concepts and coding steps in mind, prepare yourself for the practice exercises that follow. Try implementing the code on your own and explore how varying datasets influence the correlation results. Remember, this hands-on experience will strengthen your understanding and competence in developing effective recommendation systems.