Section 1 - Instruction

Let's tackle a common data problem: missing values! These are blank or empty cells in your dataset.

Look at this car table:

BrandYearPrice
Toyota2018$15,000
Ford$18,500
BMW2017

See those empty cells? Those are missing values, and they're everywhere in real data.

Engagement Message

What might cause these blanks to appear?

Section 2 - Instruction

Missing values happen for many reasons. Sometimes people skip survey questions, equipment fails to record measurements, or data gets lost during transfers.

Other times, the missing value actually means something - like leaving "Income" blank might indicate privacy concerns.

Engagement Message

Can you think of a situation where someone might intentionally leave data blank?

Section 3 - Instruction

Missing values create problems for machine learning models. Most models can't handle blank cells - they need actual numbers or categories to work with.

It's like trying to calculate the average age of a group when some people won't tell you their age!

Engagement Message

What would happen if you tried to calculate average car year with that blank cell?

Section 4 - Instruction

You have two main strategies to handle missing values: remove the problematic rows entirely, or fill in the blanks with reasonable guesses.

Removing rows is simple but you lose information. Filling blanks keeps all data but requires making assumptions.

Engagement Message

Which approach sounds better for a dataset with very few examples?

Section 5 - Instruction

Let's explore removal first. If someone didn't provide their car's year, you could delete that entire row from your dataset.

Here's what the table looks like after removing rows with missing data:

BrandYearPrice
Toyota2018$15,000

As you can see, this approach does not work well if you have only a small dataset or if many rows have missing values. Removing too many rows can shrink your dataset significantly and leave you with too little data to analyze.

Engagement Message

When might removing rows be a bad idea?

Section 6 - Instruction

The alternative is imputation - filling missing values with educated guesses. You might use the average (for numbers) or the most common value (for categories).

For our missing car year, you could fill it with the average year of all other cars in your dataset.

Here's how the table might look after imputing missing values (let's say we fill the missing year with 2018 and the missing price with $16,750):

BrandYearPrice
Toyota2018$15,000
Ford2018$18,500
BMW2017$16,750

Engagement Message

Which value would you impute for a missing car brand?

Section 7 - Instruction

Here's the key decision rule: if less than 5% of your data is missing, removal usually works fine. If 10-20% is missing, consider imputation.

If more than 30% of a column is missing, that feature might not be reliable enough to use at all!

Engagement Message

What would you do with a column where 90% of values are missing?

Section 8 - Practice

Type

Sort Into Boxes

Practice Question

Let's practice choosing the right strategy for different missing value scenarios. Sort each situation into the best approach.

Labels

  • First Box Label: Remove Rows
  • Second Box Label: Fill Missing Values

First Box Items

  • 2% missing prices
  • Equipment failure

Second Box Items

  • 25% missing ages
  • Survey skip pattern
  • 45% missing income
  • Random blank cells
Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal