Welcome back to lesson 4 of "Building and Applying Your Neural Network Library"! You’ve made great progress so far. We started by modularizing our core components for layers and activations, then organized our training pipeline with modular loss functions and optimizers. Most recently, we built a high-level orchestration layer with our Model
and SequentialModel
classes, providing a clean API for building and training neural networks.
Now it’s time to put our complete neural network library to work on a real-world problem! While our XOR examples were perfect for learning, real machine learning applications involve working with actual datasets that come with their own challenges: multiple features, varying scales, missing values, and the need for proper data preprocessing.
In this lesson, we’ll prepare a California Housing dataset sample — a classic regression problem that predicts house prices based on various demographic and geographic features. We’ll learn essential data handling techniques, including loading real datasets, understanding feature characteristics, splitting data properly for evaluation, and applying feature scaling to ensure our neural network can learn effectively. This foundation will set us up perfectly for our final lesson, where we’ll apply our complete neural network library to solve this practical prediction problem.
Real datasets come with complexities that require careful handling before we can apply machine learning algorithms effectively:
-
Feature diversity is one key challenge — real datasets often contain features measured in completely different units and scales. For example, our California Housing dataset includes features like median income, house age, average rooms per household, and geographic coordinates. These dramatically different scales can cause problems for neural networks, which work best when all inputs are in similar ranges.
-
Data splitting becomes crucial when working with real datasets. Unlike our toy examples, where we could evaluate on the same data we trained on, real applications require us to reserve some data for testing. This allows us to get an honest estimate of how our model will perform on new, unseen data — the true test of machine learning success.
-
Preprocessing requirements also become more sophisticated. We need to standardize features so they have similar scales, handle the train-test split properly to avoid data leakage, and ensure our preprocessing steps are applied consistently between training and testing phases.
For this lesson, you will be provided with the California Housing dataset in CSV format, located at data/california_housing.csv
. This file contains several records, each with multiple features and a target value representing the median house value.
To load and parse the CSV file, you can use the PapaParse library, which is a popular and easy-to-use CSV parser for JavaScript. Here’s how you can use PapaParse to load the dataset, extract features and targets, and inspect the data:
Example Output:
We have several housing records with multiple features each. The features include median income (MedInc
), house age (HouseAge
), average rooms per household (AveRooms
), and geographic coordinates (Latitude
, Longitude
). Notice the dramatic scale differences: median income ranges around 3–8, house age is in years (21–52), while geographic coordinates are around 37–38 for latitude and -122 for longitude. The target variable (MedHouseVal
) represents median house values in units of $100,000.
To evaluate our model honestly, we need to split our data into training and testing sets. We can shuffle the data and then slice it into two parts. For reproducibility, we’ll use a simple seeded random shuffle.
Here’s a basic implementation:
Example Output:
We’ve allocated 80% of our data for training and 20% for testing. The randomState
parameter ensures reproducible results — we’ll get the same split every time we run the code. This consistency is important for comparing different models or hyperparameters fairly.
The key principle here is that our model will never see the test data during training. This separation allows us to evaluate how well our neural network generalizes to new, unseen examples.
Neural networks are particularly sensitive to the scale of input features. When features have very different ranges, the network may have difficulty learning effectively. Let’s apply standardization (z-score normalization) to transform our features.
Here’s a standard scaler implementation:
We’ll compute the mean and standard deviation for each feature using only the training data, then use these statistics to scale both the training and test sets. This avoids data leakage.
Notice the crucial distinction: we use only the training data to compute the means and standard deviations, and then apply those statistics to both training and test sets. This ensures we don’t introduce data leakage — the test data statistics don’t influence our preprocessing parameters.
Let’s verify that our scaling worked correctly by examining the statistical properties of our transformed data:
Example Output:
Perfect! Our scaled features now have means approximately 0 and standard deviations of 1 across all dimensions. The sample that originally had values ranging from 1.02 to 2401 now has standardized values in a much narrower range. Similarly, our target variable has been scaled to have zero mean and unit variance.
This transformation puts all our features on equal footing, allowing our neural network to learn effectively without being dominated by features that happen to have larger numerical values. The network can now focus on the actual patterns and relationships in the data rather than fighting against scale differences.
Excellent work! We’ve successfully prepared a sample of the California Housing dataset for neural network training by implementing all the essential preprocessing steps. You now understand how to load real-world datasets, handle the challenges of multi-feature problems, and apply proper data splitting and scaling techniques. Our dataset is ready with training and test samples, all properly standardized for effective neural network learning.
The skills you’ve developed in this lesson — data loading, splitting, and preprocessing — are fundamental to any machine learning project. In the upcoming practice exercises, you’ll get hands-on experience implementing these preprocessing steps yourself, building confidence in your ability to handle real-world data preparation challenges. And then, we’ll be ready to apply our neural network library on this dataset!
