Preparing Data for Machine Learning Models

Introduction & Overview

Welcome back! In the previous lesson, you explored the California housing dataset and learned how to inspect, summarize, and visualize your data. Your exploration revealed several important insights: extreme outliers in multiple features (like AveOccup with a maximum of 1,243 people per household, AveRooms with 141.91 rooms, and AveBedrms with 34.07 bedrooms), artificial capping in the target variable, and strong correlations between certain features. These findings directly inform the data preprocessing steps you need to take.

In this lesson, you will learn how to transform your raw data into a form that is suitable for modeling. This process is called data preprocessing, and it is one of the most important stages in any machine learning workflow. We will focus on four key tasks: creating meaningful new features from existing data, splitting your data into training and testing sets for fair model evaluation, systematically handling outliers across all features while avoiding data leakage, and saving your processed datasets for future use. Each step builds directly on the insights from your exploratory data analysis.

Feature Engineering: Creating Meaningful Derived Features

Feature engineering is the process of creating new input features from your existing data. This can help your model capture important patterns that might not be obvious from the original features alone. From our correlation analysis in the previous lesson, we saw that AveRooms and AveBedrms were highly correlated (0.85), and both relate to household space. We can create a more meaningful feature by combining AveRooms with AveOccup to understand space per person—a potentially important factor in determining house values.

The output shows our new feature calculation:

Notice that the RoomsPerHousehold values (around 2-3 rooms per person) are realistic and interpretable

Splitting the Data into Training and Testing Sets

Before we handle outliers, we need to split our data into training and testing sets. This is crucial for avoiding data leakage—a common mistake where information from the test set influences the preprocessing of the training set. The training set is used to fit your model, while the testing set is used to evaluate how well your model performs on new, unseen data.

Based on our exploration, we know we have 20,640 samples to work with. Let's use the train_test_split function from scikit-learn to split this data, keeping 80% for training and 20% for testing:

The random_state=42 parameter ensures that our data split is reproducible—running this code multiple times will always produce the same training and test sets. This is important for consistent results across different experiments and when sharing your work with others.

The output confirms our split worked correctly:

This 80/20 split gives us plenty of data for training while reserving a substantial test set for reliable performance evaluation.

Outlier Handling: Avoiding Data Leakage

Outliers are data points that are much higher or lower than most of your data. In the previous lesson, our data exploration revealed extreme outliers in multiple features that could negatively impact model performance. Now that we've split our data, we can handle these outliers properly using a technique called capping.

Capping means setting a maximum limit on your data values. We'll use the 95th percentile as our limit—this is the value below which 95% of your data falls, meaning only the most extreme 5% of values get reduced. For example, if the 95th percentile of AveRooms is 7.65, then any house with more than 7.65 average rooms gets "clipped" down to exactly 7.65. This removes extreme outliers while preserving the vast majority of your data.

The critical step is calculating these limits correctly to avoid data leakage. We must calculate the 95th percentiles using only the training data, then apply those same thresholds to both training and test sets. If we used the entire dataset to calculate thresholds, we'd be using information from the test set to preprocess our training data, which would give us overly optimistic performance estimates.

This approach:

Prevents data leakage by using only training data to determine thresholds
Handles all outliers systematically rather than cherry-picking specific features
Uses a consistent threshold (95th percentile) across all features
Preserves geographic information by excluding coordinates from capping
Applies the same transformation to both training and test sets for consistency

By following this systematic approach, we ensure that our model training will be based on clean, realistic data while maintaining the integrity of our evaluation process.

Verifying Our Preprocessing Results

Before saving our data, let's verify that our preprocessing steps worked as expected by examining the final statistics of our processed training dataset:

The output shows the dramatic impact of our systematic outlier handling:

This summary confirms the effectiveness of our preprocessing:

MedInc is now capped at 7.31 (down from 15.00)
AveRooms is capped at 7.65 (down from 141.91)
AveBedrms is capped at 1.28 (down from 34.07)
Population is capped at 3,282 (down from 35,682)
AveOccup is capped at 4.33 (down from 1,243.33)

All 16,512 training samples remain with no missing values. Our systematic approach has successfully removed extreme outliers while preserving the vast majority of our data.

Combining Features and Target for Export

Now let's prepare our processed datasets for saving. We'll combine the features and target variables back together for each dataset, creating complete datasets that are ready to use in future modeling work:

The output confirms that your datasets are properly structured:

Each dataset has 10 columns: 9 feature columns (including our new RoomsPerHousehold feature) plus 1 target column (MedHouseVal). This structure makes it easy to load and use the data in future experiments.

Saving the Processed Data for Future Use

Now that your data is properly preprocessed and split, you should save these datasets to files. This allows you to reuse the same data splits in future modeling work without having to repeat all the preprocessing steps. It also ensures consistency across different experiments and makes it easier to share your prepared data with others.

After running this code, your project's data folder will have the following structure:

By saving your processed data, you create a checkpoint in your workflow. The preprocessing steps you've applied—feature engineering, data splitting, and systematic outlier handling—are now permanently captured in these files.

Summary & Preparation for Practices

In this lesson, you learned how to systematically prepare your data for machine learning while avoiding common pitfalls like data leakage. You created a meaningful new feature (RoomsPerHousehold) that captures the relationship between space and occupancy. You then split your data into training and testing sets BEFORE handling outliers—a crucial step for preventing data leakage. You took a clean, systematic approach to outlier handling by calculating the 95th percentiles from the training data only, then applying those thresholds to both datasets. Finally, you saved your cleaned data for future use.

The key insight here is that the order of operations matters. By splitting your data before calculating outlier thresholds, you ensure that no information from the test set influences your preprocessing decisions. This gives you a more realistic assessment of how your models will perform on truly unseen data.

In the upcoming practice exercises, you will apply these techniques yourself on different datasets. Remember, the goal is not just to follow the steps, but to understand why each step matters for your specific data and modeling objectives.

Previous Lesson

Next Lesson: Training a Machine Learning Model

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal