Loading...

Introduction

Welcome back to lesson 4 of "Building and Applying Your Neural Network Library"! You’ve made great progress so far. We started by modularizing our core components for layers and activations, then organized our training pipeline with modular loss functions and optimizers. Most recently, we built a high-level orchestration layer with our Model and SequentialModel classes, providing a clean API for building and training neural networks.

Now it’s time to put our complete neural network library to work on a real-world problem! While our XOR examples were perfect for learning, real machine learning applications involve working with actual datasets that come with their own challenges: multiple features, varying scales, missing values, and the need for proper data preprocessing.

In this lesson, we’ll prepare a California Housing dataset sample — a classic regression problem that predicts house prices based on various demographic and geographic features. We’ll learn essential data handling techniques, including loading real datasets, understanding feature characteristics, splitting data properly for evaluation, and applying feature scaling to ensure our neural network can learn effectively. This foundation will set us up perfectly for our final lesson, where we’ll apply our complete neural network library to solve this practical prediction problem.

Understanding Real-World Datasets

Real datasets come with complexities that require careful handling before we can apply machine learning algorithms effectively:

Feature diversity is one key challenge — real datasets often contain features measured in completely different units and scales. For example, our California Housing dataset includes features like median income, house age, average rooms per household, and geographic coordinates. These dramatically different scales can cause problems for neural networks, which work best when all inputs are in similar ranges.
Data splitting becomes crucial when working with real datasets. Unlike our toy examples, where we could evaluate on the same data we trained on, real applications require us to reserve some data for testing. This allows us to get an honest estimate of how our model will perform on new, unseen data — the true test of machine learning success.
Preprocessing requirements also become more sophisticated. We need to standardize features so they have similar scales, handle the train-test split properly to avoid data leakage, and ensure our preprocessing steps are applied consistently between training and testing phases.

Loading the California Housing Dataset

For this lesson, you will be provided with the California Housing dataset in CSV format, located at data/california_housing.csv. This file contains several records, each with multiple features and a target value representing the median house value.

To load and parse the CSV file, you can use the PapaParse library, which is a popular and easy-to-use CSV parser for JavaScript. Here’s how you can use PapaParse to load the dataset, extract features and targets, and inspect the data:

Example Output:

Splitting Data for Proper Evaluation

To evaluate our model honestly, we need to split our data into training and testing sets. We can shuffle the data and then slice it into two parts. For reproducibility, we’ll use a simple seeded random shuffle.

Here’s a basic implementation:

Example Output:

We’ve allocated 80% of our data for training and 20% for testing. The randomState parameter ensures reproducible results — we’ll get the same split every time we run the code. This consistency is important for comparing different models or hyperparameters fairly.

The key principle here is that our model will never see the test data during training. This separation allows us to evaluate how well our neural network generalizes to new, unseen examples.

Feature Scaling for Neural Networks

Neural networks are particularly sensitive to the scale of input features. When features have very different ranges, the network may have difficulty learning effectively. Let’s apply standardization (z-score normalization) to transform our features.

Here’s a standard scaler implementation:

We’ll compute the mean and standard deviation for each feature using only the training data, then use these statistics to scale both the training and test sets. This avoids data leakage.

Notice the crucial distinction: we use only the training data to compute the means and standard deviations, and then apply those statistics to both training and test sets. This ensures we don’t introduce data leakage — the test data statistics don’t influence our preprocessing parameters.

Verifying Our Preprocessing

Let’s verify that our scaling worked correctly by examining the statistical properties of our transformed data:

Example Output:

Perfect! Our scaled features now have means approximately 0 and standard deviations of 1 across all dimensions. The sample that originally had values ranging from 1.02 to 2401 now has standardized values in a much narrower range. Similarly, our target variable has been scaled to have zero mean and unit variance.

This transformation puts all our features on equal footing, allowing our neural network to learn effectively without being dominated by features that happen to have larger numerical values. The network can now focus on the actual patterns and relationships in the data rather than fighting against scale differences.

Conclusion and Next Steps

Excellent work! We’ve successfully prepared a sample of the California Housing dataset for neural network training by implementing all the essential preprocessing steps. You now understand how to load real-world datasets, handle the challenges of multi-feature problems, and apply proper data splitting and scaling techniques. Our dataset is ready with training and test samples, all properly standardized for effective neural network learning.

The skills you’ve developed in this lesson — data loading, splitting, and preprocessing — are fundamental to any machine learning project. In the upcoming practice exercises, you’ll get hands-on experience implementing these preprocessing steps yourself, building confidence in your ability to handle real-world data preparation challenges. And then, we’ll be ready to apply our neural network library on this dataset!

Previous Lesson

Next Lesson: California Housing Regression

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal