Lesson Introduction

Hello there! Today, we're going to talk about handling missing values in datasets for machine learning. Why is this important? Imagine you are building a model to predict house prices, but some houses are missing information about their size or the number of bedrooms. These missing values can affect the performance of your model. In this lesson, you'll learn why data might be missing, different ways to handle it, and how to use Python libraries to do so.

By the end of this lesson, you'll know why handling missing values is crucial, understand different strategies to deal with them, and be able to use Python tools to handle missing data efficiently.

Understanding Missing Values

Why does data go missing? There are many reasons:

  1. Human Error: Sometimes, people forget to fill in all the fields when entering data.
  2. System Error: Occasionally, the system that collects the data might have problems.
  3. Other Reasons: Data may be intentionally left out for privacy reasons.

There are three common types of missing data:

  • MCAR (Missing Completely at Random): The data is missing randomly without any pattern.
  • MAR (Missing at Random): There is a pattern, but it is not related to the missing data itself.
  • MNAR (Missing Not at Random): There is a pattern related to why the data is missing.
Strategies for Handling Missing Values

Handling missing values can be done in several ways:

  1. If the missing data is a small percentage, you might just delete those rows or columns. But be careful: if you remove too much data, you might lose important information.

  2. You can also replace the missing values with some constant value like the mean, median, or mode. This method is often more suitable because it still keeps the data structure.

Dropping Missing Values: Part 1

Dropping missing values is easy and straightforward with pandas dataframes. Let's recall it quickly.

Let's consider this simple dataset:

Let's remove rows with None values using the dropna() function:

The output is:

"Charlie"'s row is removed because it contained a null value. Also the one row with a missing name is removed.

Dropping Missing Values: Part 2

To scan only specific columns for missing values with dropna(), you can use the subset argument to specify which columns to check for missing values. Here's an example:

The output is:

As you can see, the fourth row is not removed. Though it contains a missing value in the Name column, this time we only remove rows with missing Score.

Using `SciKit Learn` to Impute Missing Values: Part 1

One of the easiest ways to handle missing values in Python is by using the SimpleImputer class from the sklearn.impute module. Let's break it down.

The SimpleImputer has a few strategies you can use:

  • mean: Replaces missing values with the mean of each column.
  • median: Replaces missing values with the median of each column.
  • most_frequent: Replaces missing values with the most frequent value in each column.
  • constant: Replaces missing values with a constant value you provide.

Let's walk through some code that handles missing values using the SimpleImputer.

First, we need a dataset. We'll use the pandas library to create one with some missing values.

Output:

Note that we use np.nan here instead of None. None is a Python singleton object representing missing values across all data types, while np.nan is a floating-point "Not a Number" value from the numpy library, specifically used for numeric data. None is versatile and not tied to any library, but it may cause errors in operations unless explicitly handled. In contrast, np.nan is tailored for numerical computations, supporting vectorized operations in numpy and pandas, making it more suitable for handling missing numerical values.

Using `SciKit Learn` to Impute Missing Values: Part 2

Here, we use the SimpleImputer from sklearn.impute to handle the missing values. In this case, we'll use the mean strategy, meaning the missing values are replaced with the mean value of the corresponding column. Note that missing values won't be taken into account when calculating the mean.

Output:

Converting `Numpy` Array Back to `DataFrame`

The result of the imputation is a NumPy array. Let's convert it back to a DataFrame for better readability.

Output:

Notice how we use df.columns to assign the same columns names we had before.

Using `SciKit Learn` to Impute Missing Values for Specific Columns

Sometimes, you may want to impute only specific columns in your dataset. You can achieve this by selecting those columns and applying the SimpleImputer to them. Here's how you can do it.

Let's use the same dataset that we created earlier

Output:

In this example, the missing value in Feature1 is replaced by the mean of the other values in that column. The Feature2 column remains unchanged. This approach allows you to target specific columns that need imputation while leaving others untouched.

In the same manner, you can impute values into any subset of columns.

Lesson Summary

Great job! 🎉 You've learned why handling missing values is crucial, discovered different strategies to tackle missing data, and practiced using SimpleImputer to handle missing values in a sample dataset. Missing data is a common issue, but now you have the tools to manage it and improve the quality of your datasets.

Now that you've learned the theory, it's time to get hands-on practice! In the practice session, you'll handle missing values in various datasets, experimenting with different imputation strategies, and observing the outcomes. This practice will help solidify your understanding and make you more confident in managing missing data for your machine learning projects. Let's get started! 🚀

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal