Welcome to our Handling Missing Values lesson. Missing values in data sheets can complicate data analysis. Incorrect handling can lead to inaccurate results. So, we'll learn how to manage these values using Python's Pandas
.
Missing data in datasets is common. It occurs when no data values are stored for certain variable observations. It can cause bias, make some functions inapplicable, and obscure insightful data patterns. Consider a dataset of student scores:
"Charlie" has a missing score (None
).
Before handling missing values, we must identify them. Pandas' functions isnull()
and notnull()
can perform this task. isnull()
returns a DataFrame where each cell is either True
or False
depending on that cell's null status.
From our student scores data:
The None
(missing) value for "Charlie" returns True
when isnull()
is used. notnull
works similarly, but returns exactly opposite values: True
is for present value!:
After identifying missing values, the next step is handling them. The strategy depends on the nature of our data and analysis purpose. A common strategy is to remove rows with None
values using the dropna()
function:
"Charlie"'s row is removed because it contained a null value. Also the one row with a missing name is removed.
To scan only specific columns for missing values with dropna()
, you can use the subset
argument to specify which columns to check for missing values. Here's an example:
As you can see, the fourth row is not removed. Though it contains a missing value in the Name
column, this time we only remove rows with missing Score
Another strategy is to fill missing values with a specific value or a calculated value such as the mean, median, or mode. The fillna()
function can achieve this:
"Charlie"'s score is replaced with 0
.
Additionally, you can use forward fill (ffill
) or backward fill (bfill
) to propagate the next or previous value:
Here, "Charlie"'s score is filled using the next available score, which is 92.0
from "David". In the newest pandas versions, there are separate methods df.ffill()
and df.bfill()
for this.
The real-world strategy to handle missing values relies on the data's nature and the analysis ambition. If we're analyzing average student scores, it may be better to fill missing values with the non-missing values' mean. Here's an example:
Handling missing values creates clean datasets, which offer a better basis for data analysis. You've now learned how to handle missing values in Python datasets! Are you ready to practice what we've covered? A hands-on approach is the best way to learn and understand data analysis. You're doing great! Let's keep going!
