Hello there! Today, we're going to talk about handling missing values in datasets for machine learning. Why is this important? Imagine you are building a model to predict house prices, but some houses are missing information about their size or the number of bedrooms. These missing values can affect the performance of your model. In this lesson, you'll learn why data might be missing, different ways to handle it, and how to use Python libraries to do so.
By the end of this lesson, you'll know why handling missing values is crucial, understand different strategies to deal with them, and be able to use Python tools to handle missing data efficiently.
Why does data go missing? There are many reasons:
- Human Error: Sometimes, people forget to fill in all the fields when entering data.
- System Error: Occasionally, the system that collects the data might have problems.
- Other Reasons: Data may be intentionally left out for privacy reasons.
There are three common types of missing data:
- MCAR (Missing Completely at Random): The data is missing randomly without any pattern.
- MAR (Missing at Random): There is a pattern, but it is not related to the missing data itself.
- MNAR (Missing Not at Random): There is a pattern related to why the data is missing.
Handling missing values can be done in several ways:
-
If the missing data is a small percentage, you might just delete those rows or columns. But be careful: if you remove too much data, you might lose important information.
-
You can also replace the missing values with some constant value like the mean, median, or mode. This method is often more suitable because it still keeps the data structure.
Dropping missing values is easy and straightforward with pandas dataframes. Let's recall it quickly.
Let's consider this simple dataset:
Let's remove rows with None values using the dropna() function:
The output is:
"Charlie"'s row is removed because it contained a null value. Also the one row with a missing name is removed.
To scan only specific columns for missing values with dropna(), you can use the subset argument to specify which columns to check for missing values. Here's an example:
The output is:
As you can see, the fourth row is not removed. Though it contains a missing value in the Name
column, this time we only remove rows with missing Score.
One of the easiest ways to handle missing values in Python is by using the SimpleImputer
class from the sklearn.impute
module. Let's break it down.
The SimpleImputer
has a few strategies you can use:
- mean: Replaces missing values with the mean of each column.
- median: Replaces missing values with the median of each column.
- most_frequent: Replaces missing values with the most frequent value in each column.
- constant: Replaces missing values with a constant value you provide.
Let's walk through some code that handles missing values using the SimpleImputer
.
First, we need a dataset. We'll use the pandas
library to create one with some missing values.
Output:
Note that we use np.nan
here instead of None
. None
is a Python singleton object representing missing values across all data types, while np.nan
is a floating-point "Not a Number" value from the numpy
library, specifically used for numeric data. None
is versatile and not tied to any library, but it may cause errors in operations unless explicitly handled. In contrast, np.nan
is tailored for numerical computations, supporting vectorized operations in numpy
and pandas
, making it more suitable for handling missing numerical values.
Here, we use the SimpleImputer
from sklearn.impute
to handle the missing values. In this case, we'll use the mean
strategy, meaning the missing values are replaced with the mean value of the corresponding column. Note that missing values won't be taken into account when calculating the mean.
Output:
The result of the imputation is a NumPy
array. Let's convert it back to a DataFrame
for better readability.
Output:
Notice how we use df.columns
to assign the same columns names we had before.
Sometimes, you may want to impute only specific columns in your dataset. You can achieve this by selecting those columns and applying the SimpleImputer
to them. Here's how you can do it.
Let's use the same dataset that we created earlier
Output:
In this example, the missing value in Feature1
is replaced by the mean of the other values in that column. The Feature2
column remains unchanged. This approach allows you to target specific columns that need imputation while leaving others untouched.
In the same manner, you can impute values into any subset of columns.
Great job! 🎉 You've learned why handling missing values is crucial, discovered different strategies to tackle missing data, and practiced using SimpleImputer
to handle missing values in a sample dataset. Missing data is a common issue, but now you have the tools to manage it and improve the quality of your datasets.
Now that you've learned the theory, it's time to get hands-on practice! In the practice session, you'll handle missing values in various datasets, experimenting with different imputation strategies, and observing the outcomes. This practice will help solidify your understanding and make you more confident in managing missing data for your machine learning projects. Let's get started! 🚀
