Introduction

Welcome to the first lesson of "Training Neural Networks: the Backpropagation Algorithm"! This is the third course in our "Neural networks from scratch" path. In the previous courses, we started with defining single neurons, worked our way through layers and the Multi-Layer Perceptron (MLP) architecture, implemented various activation functions like ReLU and sigmoid, and explored proper weight initialization techniques.

So far, we've created neural networks that can make predictions, but there's a critical question we haven't addressed: how do we know if those predictions are any good? More importantly, how can we systematically improve them? This is where loss functions come in.

In this course, we'll finally tackle the most exciting part of neural networks: training them to learn from data. This is admittedly the most mathematically intensive part of our journey, as we'll be working with concepts like gradients, derivatives, and the backpropagation algorithm. But don't worry! We'll build up these concepts gradually and provide intuitive explanations alongside the mathematics.

Our first step on this journey is to understand how to measure the error of our network's predictions, which is the focus of today's lesson on loss functions and specifically Mean Squared Error (MSE).

Understanding Loss Functions

Before we dive into specific loss functions, let's understand what they are and why they're crucial.

A loss function (sometimes called a cost function or objective function) measures how far our model's predictions deviate from the true values. It quantifies the "wrongness" of our predictions into a single number that we aim to minimize through training. The lower the loss, the better our model is performing.

Think of a loss function as a kind of "fitness score" or "report card" for our neural network: a high loss value means the predictions are far from the truth and performance is poor, while a low loss value means the predictions are close to the truth and performance is good; in fact, a perfect model would have a loss of zero, indicating its predictions exactly match the true values.

Loss functions are essential because they:

  1. Provide direction: they tell us whether changes to our model are helping or hurting.
  2. Enable optimization: their mathematical properties allow us to use algorithms to minimize them.
  3. Quantify performance: they give us a consistent way to measure and compare model quality.

Different tasks require different loss functions. For example, binary classification problems often use Binary Cross Entropy (BCE), while regression problems (predicting continuous values) typically use Mean Squared Error (MSE), which is what we'll focus on today and for the remaining of this course path.

Mean Squared Error: Mathematical Foundation

Mean Squared Error is one of the most widely used loss functions for regression tasks. It's intuitive, mathematically well-behaved, and computationally efficient.

The mathematical formula for MSE is:

MSE=1n(ytrueypred)2MSE = \frac{1}{n} * \sum(y_{true} - y_{pred})^2

Where:

  • n is the number of samples
  • y_true is the actual target value
  • y_pred is our model's prediction
  • Σ represents the sum over all samples

In words, MSE:

  1. Takes the difference between each predicted value and the corresponding true value;
  2. Squares each difference (making all values positive and penalizing larger errors more heavily);
  3. Calculates the average of these squared differences.

Why square the differences? There are several good reasons:

  • It treats positive and negative errors equally.
  • It penalizes larger errors disproportionately more (an error of 2 is penalized 4 times as much as an error of 1).
  • It creates a smooth function that's easy to differentiate (critical for the gradient-based optimization we'll learn about).
Implementing the MSE Loss Function

Now that we understand MSE conceptually, let's implement it in code. The beauty of MSE is that its implementation is remarkably straightforward, especially with mathjs's vectorized operations:

Let's break down how this works:

  1. math.subtract(yTrue, yPred) calculates the difference between each true value and prediction
  2. math.dotPow(diff, 2) squares each difference
  3. math.mean(squared) calculates the average of all squared differences

This implementation works for both single samples and batches of data thanks to mathjs's broadcasting capabilities. Whether yTrue and yPred are single values or entire arrays, the function will compute the appropriate MSE.

The elegance of this implementation highlights why mathjs is so valuable for numerical computing — what would require explicit loops in standard JavaScript is accomplished in a single, efficient operation.

Setting Up Our Neural Network

To demonstrate how we can use our MSE loss function with a neural network, let's first set up a simple MLP using the architecture we developed in the previous course. We'll create a network with two layers: a hidden layer with ReLU activation and an output layer with linear activation (appropriate for regression tasks).

We're building a neural network with:

  • An input layer that accepts 3 features
  • A hidden layer with 5 neurons and ReLU activation
  • An output layer with 1 neuron and linear activation (because we're doing regression)

Using linear activation for the output layer is a common choice for regression problems since we want to predict unbounded continuous values. We're reusing our DenseLayer and MLP classes from the previous course, which include the weight initialization strategies we learned about.

Calculating Loss for a Single Prediction

Now that we have our network set up, let's make a prediction and calculate the MSE loss for a single sample:

This code:

  1. Passes our input sample through the network using the forward method
  2. Prints the input, true value, and predicted value
  3. Calculates the MSE loss between the true and predicted values
  4. Prints the loss value rounded to 4 decimal places

The output would look something like:

Notice how our untrained network's prediction (approximately 0.00002) is very different from the true value (0.8), resulting in a relatively high MSE of 0.64. This is expected since we haven't trained the network yet — its weights are still initialized randomly. The purpose of training will be to adjust those weights to minimize this loss.

Working with Batches of Data

In practice, neural networks are rarely trained or evaluated on a single example at a time. Instead, we process multiple examples simultaneously in "batches," which improves computational efficiency and training stability. Let's see how our MSE function handles a batch of data:

This code:

  1. Creates a batch of 3 samples, each with 3 features
  2. Creates a corresponding batch of 3 true values
  3. Passes the entire batch through the network at once
  4. Calculates the MSE loss across the entire batch

The output would be:

Note that the batch MSE (0.31) is different from our single sample MSE (0.64). This is because the batch MSE is the average error across all samples in the batch. Some predictions might be closer to their true values than others, leading to a different overall average.

The ability to process batches efficiently is one of the key advantages of using mathjs and proper vectorization in our implementation. Without any code changes, our mseLoss function handles both single samples and batches seamlessly.

Conclusion and Next Steps

Congratulations! We've taken our first step into the world of neural network training by understanding and implementing the Mean Squared Error loss function. This gives us a way to quantify how good (or bad) our network's predictions are, which is the foundation for learning from data. We've seen how MSE can be efficiently calculated for both individual samples and batches, and how an untrained network produces high loss values that we'll aim to reduce through training.

In the upcoming practice exercises, you'll have the opportunity to experiment with the MSE loss function and see how it behaves with different predictions. After that, we'll continue our journey by exploring how to use this loss measure to actually improve our network through training. We'll learn about gradients, which tell us how changes in each weight affect the loss, and the backpropagation algorithm, which efficiently computes these gradients across all layers of the network. Keep learning!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal