Welcome to the first lesson of "Training Neural Networks: the Backpropagation Algorithm"! This is the third course in our "Neural networks from scratch" path. In the previous courses, we started with defining single neurons, worked our way through layers and the Multi-Layer Perceptron (MLP) architecture, implemented various activation functions like ReLU
and sigmoid
, and explored proper weight initialization techniques.
So far, we've created neural networks that can make predictions, but there's a critical question we haven't addressed: how do we know if those predictions are any good? More importantly, how can we systematically improve them? This is where loss functions come in.
In this course, we'll finally tackle the most exciting part of neural networks: training them to learn from data. This is admittedly the most mathematically intensive part of our journey, as we'll be working with concepts like gradients, derivatives, and the backpropagation algorithm. But don't worry! We'll build up these concepts gradually and provide intuitive explanations alongside the mathematics.
Our first step on this journey is to understand how to measure the error of our network's predictions, which is the focus of today's lesson on loss functions and specifically Mean Squared Error (MSE).
Before we dive into specific loss functions, let's understand what they are and why they're crucial.
A loss function (sometimes called a cost function or objective function) measures how far our model's predictions deviate from the true values. It quantifies the "wrongness" of our predictions into a single number that we aim to minimize through training. The lower the loss, the better our model is performing.
Think of a loss function as a kind of "fitness score" or "report card" for our neural network: a high loss value means the predictions are far from the truth and performance is poor, while a low loss value means the predictions are close to the truth and performance is good; in fact, a perfect model would have a loss of zero, indicating its predictions exactly match the true values.
Loss functions are essential because they:
- Provide direction: they tell us whether changes to our model are helping or hurting.
- Enable optimization: their mathematical properties allow us to use algorithms to minimize them.
- Quantify performance: they give us a consistent way to measure and compare model quality.
Different tasks require different loss functions. For example, binary classification problems often use Binary Cross Entropy (BCE), while regression problems (predicting continuous values) typically use Mean Squared Error (MSE), which is what we'll focus on today and for the remaining of this course path.
Mean Squared Error is one of the most widely used loss functions for regression tasks. It's intuitive, mathematically well-behaved, and computationally efficient.
The mathematical formula for MSE is:
Where:
n
is the number of samplesy_true
is the actual target valuey_pred
is our model's predictionΣ
represents the sum over all samples
In words, MSE:
- Takes the difference between each predicted value and the corresponding true value;
- Squares each difference (making all values positive and penalizing larger errors more heavily);
- Calculates the average of these squared differences.
Why square the differences? There are several good reasons:
- It treats positive and negative errors equally.
- It penalizes larger errors disproportionately more (an error of 2 is penalized 4 times as much as an error of 1).
- It creates a smooth function that's easy to differentiate (critical for the gradient-based optimization we'll learn about).
Now that we understand MSE conceptually, let's implement it in code. The beauty of MSE is that its implementation is remarkably straightforward, especially with mathjs
's vectorized operations:
Let's break down how this works:
math.subtract(yTrue, yPred)
calculates the difference between each true value and predictionmath.dotPow(diff, 2)
squares each differencemath.mean(squared)
calculates the average of all squared differences
This implementation works for both single samples and batches of data thanks to mathjs
's broadcasting capabilities. Whether yTrue
and yPred
are single values or entire arrays, the function will compute the appropriate MSE.
The elegance of this implementation highlights why mathjs
is so valuable for numerical computing — what would require explicit loops in standard JavaScript is accomplished in a single, efficient operation.
To demonstrate how we can use our MSE loss function with a neural network, let's first set up a simple MLP
using the architecture we developed in the previous course. We'll create a network with two layers: a hidden layer with ReLU
activation and an output layer with linear
activation (appropriate for regression tasks).
We're building a neural network with:
- An input layer that accepts 3 features
- A hidden layer with 5 neurons and
ReLU
activation - An output layer with 1 neuron and
linear
activation (because we're doing regression)
Using linear
activation for the output layer is a common choice for regression problems since we want to predict unbounded continuous values. We're reusing our DenseLayer
and MLP
classes from the previous course, which include the weight initialization strategies we learned about.
Now that we have our network set up, let's make a prediction and calculate the MSE loss for a single sample:
This code:
- Passes our input sample through the network using the
forward
method - Prints the input, true value, and predicted value
- Calculates the MSE loss between the true and predicted values
- Prints the loss value rounded to 4 decimal places
The output would look something like:
Notice how our untrained network's prediction (approximately 0.00002) is very different from the true value (0.8), resulting in a relatively high MSE of 0.64. This is expected since we haven't trained the network yet — its weights are still initialized randomly. The purpose of training will be to adjust those weights to minimize this loss.
In practice, neural networks are rarely trained or evaluated on a single example at a time. Instead, we process multiple examples simultaneously in "batches," which improves computational efficiency and training stability. Let's see how our MSE function handles a batch of data:
This code:
- Creates a batch of 3 samples, each with 3 features
- Creates a corresponding batch of 3 true values
- Passes the entire batch through the network at once
- Calculates the MSE loss across the entire batch
The output would be:
Note that the batch MSE (0.31) is different from our single sample MSE (0.64). This is because the batch MSE is the average error across all samples in the batch. Some predictions might be closer to their true values than others, leading to a different overall average.
The ability to process batches efficiently is one of the key advantages of using mathjs
and proper vectorization in our implementation. Without any code changes, our mseLoss
function handles both single samples and batches seamlessly.
Congratulations! We've taken our first step into the world of neural network training by understanding and implementing the Mean Squared Error loss function. This gives us a way to quantify how good (or bad) our network's predictions are, which is the foundation for learning from data. We've seen how MSE can be efficiently calculated for both individual samples and batches, and how an untrained network produces high loss values that we'll aim to reduce through training.
In the upcoming practice exercises, you'll have the opportunity to experiment with the MSE loss function and see how it behaves with different predictions. After that, we'll continue our journey by exploring how to use this loss measure to actually improve our network through training. We'll learn about gradients, which tell us how changes in each weight affect the loss, and the backpropagation algorithm, which efficiently computes these gradients across all layers of the network. Keep learning!
