Welcome back to lesson 4 of "Building and Applying Your Neural Network Library"! You've accomplished so much on this journey. We began by modularizing our core components into clean, reusable modules for layers and activations. Then, we organized our training components by creating dedicated modules for loss functions and optimizers. Most recently, we built a powerful orchestration layer with our Model and SequentialModel classes that provide a clean, high-level API for building and training neural networks.
Now it's time to put our complete neural network library to work on a real-world problem! While our XOR examples have been perfect for learning and testing, real machine learning applications involve working with actual datasets that come with their own challenges: multiple features, varying scales, missing values, and the need for proper data preprocessing.
In this lesson, we'll prepare the California Housing dataset — a classic regression problem that predicts house prices based on various demographic and geographic features. We'll learn essential data handling techniques, including loading real datasets, understanding feature characteristics, splitting data properly for evaluation, and applying feature scaling to ensure our neural network can learn effectively. This foundation will set us up perfectly for our final lesson, where we'll apply our complete neural network library to solve this practical prediction problem.
Real datasets come with complexities that require careful handling before we can apply machine learning algorithms effectively:
- Feature diversity is one key challenge — real datasets often contain features measured in completely different units and scales. For example, our California Housing dataset includes features like median income (measured in tens of thousands of dollars), house age (measured in years), and geographic coordinates (latitude and longitude). These dramatically different scales can cause problems for neural networks, which work best when all inputs are in similar ranges.
- Data splitting becomes crucial when working with real datasets. Unlike our toy examples, where we could evaluate on the same data we trained on, real applications require us to reserve some data for testing. This allows us to get an honest estimate of how our model will perform on new, unseen data — the true test of machine learning success.
- Preprocessing requirements also become more sophisticated. We need to standardize features so they have similar scales, handle the train-test split properly to avoid data leakage, and ensure our preprocessing steps are applied consistently between training and testing phases.
Let's start by loading and exploring our dataset. We'll use R's reticulate package to access Python's scikit-learn library, which provides the California Housing dataset in a convenient format:
This code loads the dataset and reveals important characteristics. The output shows us:
We have 20,640 housing records with 8 features each. The features include median income (MedInc), house age (HouseAge), average rooms per household (AveRooms), and geographic coordinates (Latitude, Longitude). Looking at the sample data, notice the dramatic scale differences: median income is around 8.3, house age is 41 years, while geographic coordinates are around for latitude and for longitude. The target variable () represents median house values in units of 452,600.
Now we need to split our data into training and testing sets. This is crucial for getting an honest evaluation of our model's performance. We'll use the caret package's createDataPartition() function, which provides stratified sampling for better train-test splits:
The output confirms our split:
We've allocated 80% of our data (16,512 samples) for training and 20% (4,128 samples) for testing. The set.seed(42) function ensures reproducible results — we'll get the same split every time we run the code. The createDataPartition() function from caret provides stratified sampling, which helps ensure our train and test sets have similar distributions of the target variable.
The key principle here is that our model will never see the test data during training. This separation allows us to evaluate how well our neural network generalizes to new, unseen examples, which is the ultimate goal of machine learning.
Neural networks are particularly sensitive to the scale of input features. When features have very different ranges, the network may have difficulty learning effectively. Let's apply standardization (also called z-score normalization) using the caret package's preprocessing functions:
Notice the crucial distinction here: we use preProcess() on the training data, which calculates the scaling parameters (mean and standard deviation) without applying the transformation. Then we use predict() to apply these parameters to both training and test data. This ensures we don't introduce data leakage — the test data statistics don't influence our preprocessing parameters.
We also scale our target variable, which often helps with regression problems:
Scaling the target variable helps our neural network learn more effectively by keeping the output values in a reasonable range, typically around zero with unit variance.
Let's verify that our scaling worked correctly by examining the statistical properties of our transformed data:
The output confirms our scaling is working correctly:
Perfect! Our scaled features now have means approximately 0 and standard deviations of 1 across all dimensions. The sample that originally had values ranging from 1.02 to -122.23 now has standardized values ranging from -0.84 to 2.34. Similarly, our target variable has been scaled to have zero mean and unit variance.
This transformation puts all our features on equal footing, allowing our neural network to learn effectively without being dominated by features that happen to have larger numerical values. The network can now focus on the actual patterns and relationships in the data rather than fighting against scale differences.
Excellent work! We've successfully prepared the California Housing dataset for neural network training by implementing all the essential preprocessing steps. You now understand how to use reticulate to access Python's scikit-learn datasets, handle the challenges of multi-feature problems, and apply proper data splitting and scaling techniques using the caret package. Our dataset is ready with 16,512 training samples and 4,128 test samples, all properly standardized for effective neural network learning.
The skills you've developed in this lesson — data loading with reticulate, splitting with caret, and preprocessing with standardization — are fundamental to any machine learning project. In the upcoming practice exercises, you'll get hands-on experience implementing these preprocessing steps yourself, building confidence in your ability to handle real-world data preparation challenges. Then, we'll be ready to apply our neural network package to this dataset!
