Let's tackle a common data problem: missing values! These are blank or empty cells in your dataset.
Look at this car table:
See those empty cells? Those are missing values, and they're everywhere in real data.
Engagement Message
What might cause these blanks to appear?
Missing values happen for many reasons. Sometimes people skip survey questions, equipment fails to record measurements, or data gets lost during transfers.
Other times, the missing value actually means something - like leaving "Income" blank might indicate privacy concerns.
Engagement Message
Can you think of a situation where someone might intentionally leave data blank?
Missing values create problems for machine learning models. Most models can't handle blank cells - they need actual numbers or categories to work with.
It's like trying to calculate the average age of a group when some people won't tell you their age!
Engagement Message
What would happen if you tried to calculate average car year with that blank cell?
You have two main strategies to handle missing values: remove the problematic rows entirely, or fill in the blanks with reasonable guesses.
Removing rows is simple but you lose information. Filling blanks keeps all data but requires making assumptions.
Engagement Message
Which approach sounds better for a dataset with very few examples?
Let's explore removal first. If someone didn't provide their car's year, you could delete that entire row from your dataset.
Here's what the table looks like after removing rows with missing data:
As you can see, this approach does not work well if you have only a small dataset or if many rows have missing values. Removing too many rows can shrink your dataset significantly and leave you with too little data to analyze.
Engagement Message
When might removing rows be a bad idea?
The alternative is imputation - filling missing values with educated guesses. You might use the average (for numbers) or the most common value (for categories).
For our missing car year, you could fill it with the average year of all other cars in your dataset.
Here's how the table might look after imputing missing values (let's say we fill the missing year with 2018 and the missing price with $16,750):
Engagement Message
Which value would you impute for a missing car brand?
Here's the key decision rule: if less than 5% of your data is missing, removal usually works fine. If 10-20% is missing, consider imputation.
If more than 30% of a column is missing, that feature might not be reliable enough to use at all!
Engagement Message
What would you do with a column where 90% of values are missing?
Type
Sort Into Boxes
Practice Question
Let's practice choosing the right strategy for different missing value scenarios. Sort each situation into the best approach.
Labels
- First Box Label: Remove Rows
- Second Box Label: Fill Missing Values
First Box Items
- 2% missing prices
- Equipment failure
Second Box Items
- 25% missing ages
- Survey skip pattern
- 45% missing income
- Random blank cells
