Introduction to Data Preprocessing

Welcome back to our data exploration journey! In the previous lessons, you learned how to load datasets and perform exploratory data analysis (EDA) to understand the structure and relationships in your data. You identified numerical and categorical features, visualized distributions, and examined correlations. These steps have given you valuable insights into your dataset.

Now that you understand what your data looks like, it's time to prepare it for modeling. This critical step is called data preprocessing, and it addresses common issues that can prevent machine learning algorithms from performing well. In this lesson, we'll focus on three essential preprocessing tasks:

  1. Splitting your data into training and test sets
  2. Handling missing values in both numerical and categorical features
  3. Encoding categorical features into a numerical format that machine learning algorithms can understand

By the end of this lesson, you'll have a complete preprocessing pipeline that prepares your data for the modeling phase we'll tackle in the next unit.

Train-Test Split: Why and How

Before we preprocess our data, it’s best practice to split the data into training and test sets. This split allows us to evaluate our model’s performance on unseen data, simulating how it would perform in the real world. The training set is used to fit the model, while the test set is reserved for final evaluation.

Here’s how you can perform a train-test split using scikit-learn:

After splitting, you should perform all preprocessing steps (imputation, encoding, etc.) using only the training data to compute statistics (like median or mode), and then apply those transformations to both the training and test sets. This ensures that your model evaluation remains unbiased and realistic.

Handling Missing Values in Numerical Features

Missing values are a common problem in real-world datasets. They can occur for various reasons: data entry errors, equipment malfunctions, or simply because the information wasn't available. Whatever the cause, missing values can significantly impact model performance if not handled properly.

Let's first check if our dataset has any missing values:

Example output:

This output shows which columns have missing values and how many are missing in each column of the training set. For numerical features, common strategies include:

  • Replacing with the mean (average)
  • Replacing with the median (middle value)
  • Replacing with the mode (most frequent value)

Median imputation is often preferred over the mean because it's less sensitive to outliers.

Why Use the Median of the Training Set?

It’s important to use the median calculated only from the training set when filling missing values in both the training and test sets. This prevents data leakage—the accidental use of information from the test set during training—which can lead to overly optimistic model performance. By using only the training set statistics, we ensure our preprocessing mimics real-world scenarios, where the model never sees the test data during training.

Here’s how to implement median imputation correctly:

Handling Missing Values in Categorical Features

Categorical features require a different approach for handling missing values. Since categorical data represents discrete categories rather than continuous values, using statistical measures like the mean or median doesn't make sense.

For categorical features, the most common approach is to replace missing values with the mode (most frequent category) from the training set.

Again, using the mode from the training set for both training and test data helps prevent data leakage and ensures consistency.

Encoding Categorical Features with LabelEncoder

Most machine learning algorithms require numerical input data. However, categorical features contain text values that represent different categories. To use these features in our models, we need to convert them to numbers—a process called encoding.

One of the simplest encoding techniques is Label Encoding, which assigns a unique integer to each category. For example, if a Color feature has values ["Red", "Blue", "Green"], Label Encoding would convert them to [0, 1, 2].

To keep preprocessing realistic, we fit the encoder using only the training data, then apply that same mapping to the test set. This matches the same train-only principle we used for medians and modes. If the test set or future data contains a category that never appeared in training, we need a fallback strategy. In this course, we'll map unseen categories to -1 so the pipeline can still run consistently.

It is also helpful to know where this simple approach fits best. Label encoding is often acceptable for tree-based models, which split on values without assuming the codes represent meaningful distances. For linear models, however, these numeric codes can accidentally imply an order that does not really exist. In many real projects, one-hot encoding is a better choice for nominal categorical features. Here, we'll continue with LabelEncoder because it keeps the preprocessing pipeline compact and easy to follow.

Creating a Complete Preprocessing Pipeline

Now that we've learned how to split the data, handle missing values, and encode categorical features, let's put everything together into a complete preprocessing pipeline. This pipeline will take our raw data and transform it into a format suitable for machine learning.

You can think of the full preprocessing pipeline as this sequence:

  1. Split the original dataset into training and test sets.
  2. Identify numerical and categorical columns from the training data.
  3. Fill missing numerical values using medians from the training set.
  4. Fill missing categorical values using modes from the training set.
  5. Fit categorical encoders on the training set only.
  6. Apply the same transformations to both training and test data.

Let's verify that our preprocessing was successful by checking for any remaining missing values and examining the transformed data:

The output should show zero missing values, and the sample data should display numerical values for all features, including those that were originally categorical.

Summary

In this lesson, you learned essential data preprocessing techniques that prepare your data for machine learning:

  1. Splitting your data into training and test sets to ensure unbiased model evaluation
  2. Handling missing values in numerical features using median imputation from the training set
  3. Handling missing values in categorical features using mode imputation from the training set
  4. Encoding categorical features using LabelEncoder fitted on the training data
  5. Creating a complete preprocessing pipeline that applies these transformations consistently

These preprocessing steps are crucial for building effective machine learning models. Clean, well-formatted data allows algorithms to learn meaningful patterns and make accurate predictions.

In the upcoming practice exercises, you'll have the opportunity to apply these techniques to different datasets with various missing value patterns and categorical features. This hands-on practice will help solidify your understanding of data preprocessing.

In the next unit, we'll build on this foundation by creating baseline regression models. We'll use the preprocessed data to train simple models that establish performance benchmarks. These baselines will serve as a starting point for more advanced modeling techniques.

Remember that data preprocessing is not a one-size-fits-all process. The techniques we've covered today are common approaches, but the best preprocessing strategy depends on your specific dataset and problem. As you gain more experience, you'll develop intuition for which techniques work best in different situations.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal