Loading...

Introduction to Factorization Machines

Welcome to this lesson on factorization machines, an important model in the realm of recommendation systems. Factorization machines, or FM, excel in capturing interactions between variables, making them a powerful tool for both regression and classification tasks. For instance, they can predict a rating (regression) or calculate the likelihood of a recommendation (classification).

Review of Dataset Preparation

Before we delve into the implementation of a factorization machine, let's briefly revisit the dataset preparation process from the previous lesson. Even though we won't repeat the entire code here, it's crucial to remember the structure we've established.

In the prior lesson, you learned how to load JSON files and create a user-item interaction matrix using dummy variables. Additionally, you enriched the dataset with auxiliary features like user preferences and genre similarity. These steps laid the groundwork for accurately predicting ratings in a recommendation system. Recall the importance of these preparatory steps as we move forward.

Theory Behind

Factorization machines leverage interactions between variables by decomposing them into simpler, latent factors. Mathematically, the prediction for a factorization machine can be expressed as:

$\hat{y}(\mathbf{x}) = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j$

Here's what each component represents:

$\hat{y}(\mathbf{x})$ : The predicted value.
$w_0$ : The global bias term.
$w_i$ : The weight associated with the feature $x_i$ .
$x_i$ : The individual features of the input vector $\mathbf{x}$ .
$\langle \mathbf{v}_i, \mathbf{v}_j \rangle$ : The dot product between the latent vectors of two features, capturing their pairwise interaction.

Latent Vectors

Latent vectors are fundamental components in factorization machines used to capture complex pairwise interactions between features. Each column in the dataset is represented by a latent vector, and the interaction between different columns is determined by the dot product of these latent vectors.

Let's consider the dataset with the columns described:

The columns user1, user2, user3, item1, item2, item3 are one-hot encoded features indicating which user-item pair the row represents.
uf1, uf2 are additional user features (e.g., user attributes).
if1, if2 are additional item features (e.g., item attributes).
r is the rating given by the user to the item.

In the context of this dataset, each feature column has an associated latent vector, which is initialized randomly and learned during training. These latent vectors help to infer interactions between features that are not explicitly represented in the data.

For demonstration, let's consider a latent factor size of 2 for simplicity, though in practice, n_factors can be larger.

Feature	Latent Vector
user1	$[v_{u1,1}, v_{u1,2}]$
user2	$[v_{u2,1}, v_{u2,2}]$
user3	$[v_{u3,1}, v_{u3,2}]$
item1	$[v_{i1,1}, v_{i1,2}]$
item2	$[v_{i2,1}, v_{i2,2}]$
item3	$[v_{i3,1}, v_{i3,2}]$
uf1	$[v_{uf1,1}, v_{uf1,2}]$
uf2	$[v_{uf2,1}, v_{uf2,2}]$
if1	$[v_{if1,1}, v_{if1,2}]$
if2	$[v_{if2,1}, v_{if2,2}]$

For the row example: user2 item3 uf1 uf2 if1 if2 r, the active features are user2, item3, uf1, uf2, if1, and if2.

Dot Product Calculation: The interaction between any two features (e.g., user2 and item3) is captured by the dot product of their latent vectors.

$\langle \mathbf{v}_{u2}, \mathbf{v}_{i3} \rangle = v_{u2,1} \cdot v_{i3,1} + v_{u2,2} \cdot v_{i3,2}$

For each pair of interacting features, this dot product evaluates how much they co-influence each other in predicting the rating.
Extending to All Feature Pairs: The interactions will be summed for all feature pairs where interactions are considered:

$\text{interaction\_term} = \sum_{\text{pairs}} \langle \mathbf{v}_{\text{feature}_i}, \mathbf{v}_{\text{feature}_j} \rangle$
Example Calculation: Given $\mathbf{v}_{u2} = [0.5, 0.3]$ , $\mathbf{v}_{i3} = [0.4, 0.7]$ , the dot product is:

$\langle \mathbf{v}_{u2}, \mathbf{v}_{i3} \rangle = 0.5 \cdot 0.4 + 0.3 \cdot 0.7 = 0.2 + 0.21 = 0.41$

Latent vectors "encode" features allowing compact representations of complex relationships beyond direct correlations. They effectively project user-item-feature information into a latent space, where their interactions can be learned and utilized for making predictions.

Thus, by updating these latent vectors through the fit method, the factorization machine fine-tunes its understanding of how different features interact, boosting its predictive performance.

Implementing the Factorization Machine Model: Part 1

Let's move on to the implementation of the factorization machine model. We'll break this into parts to ensure clarity.

First, let's define the __init__ method to initialize the required data.

Python
1import numpy as np
2
3class SimpleFactorizationMachine:
4    def __init__(self, n_factors, n_features, learning_rate=0.01, epochs=100, reg=0.01):
5        self.n_factors = n_factors
6        self.learning_rate = learning_rate
7        self.epochs = epochs
8        self.reg = reg
9        
10        self.w0 = 0  # Global bias term
11        self.W = np.zeros(n_features)  # Linear coefficients
12        self.V = np.random.normal(0, 0.1, (n_features, n_factors))  # Interaction factors

In the __init__ method, we initialize several key parameters of the factorization machine.

n_factors: This defines the number of components in each latent vector. It represents the dimensionality of the latent space for each feature, capturing the complexity of interactions.
n_features: This is the total number of features in the dataset.

Together, n_factors and n_features define the dimensions of the interaction matrix V, which is of size (n_features, n_factors). Each row in this matrix corresponds to a feature, and each column corresponds to a component of the latent vector for that feature.

The learning_rate, epochs, and reg are hyperparameters governing the learning process. The w0 is the global bias, W stores linear coefficients for features, and V contains the interaction factors, initialized with small random values.

Gradient Descent

Before explaining the fit method, let's briefly discuss gradient descent, a cornerstone optimization algorithm for training models. The principle behind gradient descent is to minimize a function by iteratively moving in the opposite direction of the gradient of the function. We move in the opposite direction because the gradient indicates the direction of the steepest increase, while we aim to minimize the loss function. In the context of our factorization machine, we adjust model parameters to minimize the error between predicted and actual outputs.

The gradient descent update rule for a parameter $\theta$ is given by:

$\theta := \theta - \alpha \frac{\partial}{\partial \theta} J(\theta)$

Here, $\alpha$ is the learning rate, dictating the step size of each iteration, and $J(\theta)$ is the cost function measuring prediction error.

Implementing the Factorization Machine Model: Part 2

Next, we define the fit method that uses gradient descent to train the algorithm.

Python
1def fit(self, X, y):
2    m, n = X.shape
3    for epoch in range(self.epochs):
4        for i in range(m):
5            # Calculate linear terms by combining global bias and feature coefficients
6            linear_terms = self.w0 + np.dot(X[i], self.W)
7            
8            # Calculate interaction terms using dot product of feature interactions
9            interaction_term = sum(
10                (np.dot(X[i], self.V[:, f]) ** 2 - np.dot(X[i] ** 2, self.V[:, f] ** 2)) / 2
11                for f in range(self.n_factors))
12            
13            # Combine linear terms and interaction terms to get predictions
14            predictions = linear_terms + interaction_term
15            # Calculate prediction error
16            err = predictions - y[i]
17            
18            # Update global bias using gradient descent
19            self.w0 -= self.learning_rate * err
20            # Update linear coefficients with regularization
21            self.W -= self.learning_rate * (err * X[i] + self.reg * self.W)
22            
23            # Update interaction factors for each latent feature
24            for f in range(self.n_factors):
25                V_f = self.V[:, f]
26                self.V[:, f] -= self.learning_rate * (
27                    err * (X[i] @ V_f - X[i] ** 2 * V_f) / 2 + self.reg * V_f)

Initialize Parameters and Loop: We begin by determining the shape of the data (m, n) and iterating over each epoch to train the model.
Calculate Linear Terms: The linear terms are computed by summing the global bias w0 and the dot product of the feature coefficients W with the data instance X[i].
Calculate Interaction Terms: For each latent factor f, interaction terms are computed by taking the difference between the square of dot products and the dot product of squared terms, capturing feature interactions.
Compute Predictions and Error: Combine linear and interaction terms for predictions, then compute the error by subtracting actual ratings from predicted values.
Update Global Bias: The global bias is updated with the gradient of the error.
Update Linear Coefficients: Linear coefficients are adjusted using gradient descent, with regularization to prevent overfitting.
Update Interaction Factors: Each interaction factor is updated using the error and incorporates regularization to fine-tune learning of feature interactions. The gradient for interaction factors includes two parts:
- The term (X[i] @ V_f) computes the dot product between the feature vector and the current latent vector, highlighting the current influence of all features on the interaction term.
- The term (X[i] ** 2 * V_f) is the element-wise multiplication between the squared feature vector and the current latent vector, used to adjust for non-linearity and overfitting in interactions.

This dual consideration ensures that interactions are learned without inflating the error, especially with regularization.

Implementing the Factorization Machine Model: Part 3

Finally, we define the predict method that will use model's coefficients to make predictions.

Python
1def predict(self, X):
2    m, n = X.shape
3    y_pred = np.zeros(m)
4    for i in range(m):
5        # Calculate linear terms by combining global bias and feature coefficients
6        linear_terms = self.w0 + np.dot(X[i], self.W)
7        
8        # Calculate interaction terms for each data instance
9        interaction_term = sum(
10            (np.dot(X[i], self.V[:, f]) ** 2 - np.dot(X[i] ** 2, self.V[:, f] ** 2)) / 2 
11            for f in range(self.n_factors))
12        
13        # Sum linear and interaction terms for prediction
14        y_pred[i] = linear_terms + interaction_term
15    return y_pred

Initialize Predictions Array: Start by creating an array to store predictions for each data instance.
Calculate Linear Terms: For each instance, compute linear terms by adding the global bias to the dot product of features with their coefficients.
Calculate Interaction Terms: Iterate over all latent factors to compute the interaction term for each instance.
Store Predictions: For each instance, sum the linear and interaction terms to calculate and store the predicted value.

Making Predictions and Evaluating Model Performance

Now that the factorization machine is ready, let's see it in action.

Python
1from data_extraction import extract_dataset
2from sklearn.model_selection import train_test_split
3
4# Extract dataset
5df = extract_dataset()
6
7# Split dataset into train and test
8X = df.drop(columns=['rating']).values
9y = df['rating'].values
10
11X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
12
13# Define and train factorization machine model
14fm_model = SimpleFactorizationMachine(n_factors=3, n_features=X_train.shape[1], learning_rate=0.001, epochs=1000)
15fm_model.fit(X_train, y_train)
16
17# Test the model
18y_pred = fm_model.predict(X_test)
19
20# Evaluate model using Mean Absolute Error
21mae = np.mean(abs((y_test - y_pred)))
22print(f'Mean Absolute Error: {mae:.4f}')
23# Mean Absolute Error: 0.9364

The Mean Absolute Error (MAE) is defined as:

$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$

where $n$ is the number of observations, $y_i$ is the actual value, and $\hat{y}_i$ is the predicted value.

Here, we extract and split the dataset into training and testing partitions, initialize the factorization machine, and fit it with training data. Predictions are made on the test dataset, and we evaluate accuracy using the Mean Absolute Error (MAE). On small datasets, the accuracy may be limited, but larger datasets typically result in better performance.

Conclusion and Summary

In this lesson, we successfully implemented and evaluated a factorization machine model for recommendation systems. We've gone from initializing parameters, through training, to making predictions and evaluating performance. This concludes our exploration of factorization machines and marks the end of this course module.

Congratulations on completing the course! The skills you've acquired here form a strong foundation for building and understanding recommendation systems. Continue exploring other models and refine your expertise in this dynamic field. Well done!

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal