Introduction

Welcome to our Handling Missing Values lesson. Missing values in data sheets can complicate data analysis. Incorrect handling can lead to inaccurate results. So, we'll learn how to manage these values using Python's Pandas.

Missing Data in Datasets

Missing data in datasets is common. It occurs when no data values are stored for certain variable observations. It can cause bias, make some functions inapplicable, and obscure insightful data patterns. Consider a dataset of student scores:

"Charlie" has a missing score (None).

Identifying Missing Values with Pandas

Before handling missing values, we must identify them. Pandas' functions isnull() and notnull() can perform this task. isnull() returns a DataFrame where each cell is either True or False depending on that cell's null status.

From our student scores data:

The None (missing) value for "Charlie" returns True when isnull() is used. notnull works similarly, but returns exactly opposite values: True is for present value!:

Handling Missing Values: Removal, Part 1

After identifying missing values, the next step is handling them. The strategy depends on the nature of our data and analysis purpose. A common strategy is to remove rows with None values using the dropna() function:

"Charlie"'s row is removed because it contained a null value. Also the one row with a missing name is removed.

Handling Missing Values: Removal, Part 2

To scan only specific columns for missing values with dropna(), you can use the subset argument to specify which columns to check for missing values. Here's an example:

As you can see, the fourth row is not removed. Though it contains a missing value in the Name column, this time we only remove rows with missing Score

Handling Missing Values: Replacement

Another strategy is to fill missing values with a specific value or a calculated value such as the mean, median, or mode. The fillna() function can achieve this:

"Charlie"'s score is replaced with 0.

Additionally, you can use forward fill (ffill) or backward fill (bfill) to propagate the next or previous value:

Here, "Charlie"'s score is filled using the next available score, which is 92.0 from "David". In the newest pandas versions, there are separate methods df.ffill() and df.bfill() for this.

Handling Missing Values in Real-world Scenarios

The real-world strategy to handle missing values relies on the data's nature and the analysis ambition. If we're analyzing average student scores, it may be better to fill missing values with the non-missing values' mean. Here's an example:

Summary

Handling missing values creates clean datasets, which offer a better basis for data analysis. You've now learned how to handle missing values in Python datasets! Are you ready to practice what we've covered? A hands-on approach is the best way to learn and understand data analysis. You're doing great! Let's keep going!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal