Introduction

Welcome back to lesson 4 of "Building and Applying Your Neural Network Library"! You've accomplished so much on this journey. We began by modularizing our core components into clean, reusable modules for layers and activations. Then, we organized our training components by creating dedicated modules for loss functions and optimizers. Most recently, we built a powerful orchestration layer with our Model and SequentialModel classes that provide a clean, high-level API for building and training neural networks.

Now it's time to put our complete neural network library to work on a real-world problem! While our XOR examples have been perfect for learning and testing, real machine learning applications involve working with actual datasets that come with their own challenges: multiple features, varying scales, missing values, and the need for proper data preprocessing.

In this lesson, we'll prepare the California Housing dataset — a classic regression problem that predicts house prices based on various demographic and geographic features. We'll learn essential data handling techniques, including loading real datasets, understanding feature characteristics, splitting data properly for evaluation, and applying feature scaling to ensure our neural network can learn effectively. This foundation will set us up perfectly for our final lesson, where we'll apply our complete neural network library to solve this practical prediction problem.

Understanding Real-World Datasets

Real datasets come with complexities that require careful handling before we can apply machine learning algorithms effectively:

  • Feature diversity is one key challenge — real datasets often contain features measured in completely different units and scales. For example, our California Housing dataset includes features like median income (measured in tens of thousands of dollars), house age (measured in years), and geographic coordinates (latitude and longitude). These dramatically different scales can cause problems for neural networks, which work best when all inputs are in similar ranges.

  • Data splitting becomes crucial when working with real datasets. Unlike our toy examples, where we could evaluate on the same data we trained on, real applications require us to reserve some data for testing. This allows us to get an honest estimate of how our model will perform on new, unseen data — the true test of machine learning success.

  • Preprocessing requirements also become more sophisticated. We need to standardize features so they have similar scales, handle the train-test split properly to avoid data leakage, and ensure our preprocessing steps are applied consistently between training and testing phases.

Loading the California Housing Dataset

Let's start by loading and exploring our dataset. The California Housing dataset is conveniently available through scikit-learn, making it perfect for our educational purposes. Here's how we load and examine the basic structure:

This code loads the dataset and reveals important characteristics. The output shows us:

We have 20,640 housing records with 8 features each. The features include median income (MedInc), house age (HouseAge), average rooms per household (AveRooms), and geographic coordinates (Latitude, Longitude). Looking at the sample data, notice the dramatic scale differences: median income ranges around 3-8, house age is in years (21-52), while geographic coordinates are around 37-38 for latitude and -122 for longitude. The target variable (MedHouseVal) represents median house values in units of $100,000, so our first sample shows a house worth approximately $452,600.

Splitting Data for Proper Evaluation

Now we need to split our data into training and testing sets. This is crucial for getting an honest evaluation of our model's performance. We'll use scikit-learn's train_test_split function, which handles the splitting process properly:

The output confirms our split:

We've allocated 80% of our data (16,512 samples) for training and 20% (4,128 samples) for testing. The random_state=42 parameter ensures reproducible results — we'll get the same split every time we run the code. This consistency is important for comparing different models or hyperparameters fairly.

The key principle here is that our model will never see the test data during training. This separation allows us to evaluate how well our neural network generalizes to new, unseen examples, which is the ultimate goal of machine learning.

Feature Scaling for Neural Networks

Neural networks are particularly sensitive to the scale of input features. When features have very different ranges, the network may have difficulty learning effectively. Let's apply standardization (also called z-score normalization) to transform our features:

Notice the crucial distinction here: we use fit_transform() on the training data, which both learns the scaling parameters (mean and standard deviation) and applies the transformation. For the test data, we use only transform() with the same scaler object. This ensures we don't introduce data leakage — the test data statistics don't influence our preprocessing parameters.

We also scale our target variable, which often helps with regression problems:

Scaling the target variable helps our neural network learn more effectively by keeping the output values in a reasonable range, typically around zero with unit variance.

Verifying Our Preprocessing

Let's verify that our scaling worked correctly by examining the statistical properties of our transformed data:

The output confirms our scaling is working correctly:

Perfect! Our scaled features now have means approximately 0 and standard deviations of 1 across all dimensions. The sample that originally had values ranging from 1.02 to 322 now has standardized values ranging from -1.37 to 1.27. Similarly, our target variable has been scaled to have zero mean and unit variance.

This transformation puts all our features on equal footing, allowing our neural network to learn effectively without being dominated by features that happen to have larger numerical values. The network can now focus on the actual patterns and relationships in the data rather than fighting against scale differences.

Conclusion and Next Steps

Excellent work! We've successfully prepared the California Housing dataset for neural network training by implementing all the essential preprocessing steps. You now understand how to load real-world datasets, handle the challenges of multi-feature problems, and apply proper data splitting and scaling techniques. Our dataset is ready with 16,512 training samples and 4,128 test samples, all properly standardized for effective neural network learning.

The skills you've developed in this lesson — data loading, splitting, and preprocessing — are fundamental to any machine learning project. In the upcoming practice exercises, you'll get hands-on experience implementing these preprocessing steps yourself, building confidence in your ability to handle real-world data preparation challenges. And then, we'll be ready to apply our neuralnets package on this dataset!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal