Section 1 - Instruction

Now that you can spot messy data, let's learn how to fix it systematically! Random cleaning approaches often create more problems than they solve.

Professional data analysts follow proven strategies that protect data integrity while fixing issues.

Engagement Message

Name one risk of randomly deleting rows with missing data?

Section 2 - Instruction

For missing values, you have three main strategies: deletion, imputation, or leaving them as-is.

Deletion removes rows or columns. Imputation fills gaps with estimates. Sometimes keeping them empty is actually the right choice.

Engagement Message

Which strategy would you choose if only 2% of ages were missing?

Section 3 - Instruction

Outliers are extreme values that seem out of place - like a $50,000 salary in a dataset of $50 salaries.

First, verify if it's real (CEO salary?) or an error (missing decimal point?). Real outliers might stay; errors get fixed or removed.

Engagement Message

What's one check you could do to verify whether a $50,000 salary is valid?

Section 4 - Instruction

Formatting inconsistencies need standardization. Pick one format and convert everything to match.

For example, choose "United States" and convert all "USA" and "US" entries to match this standard format.

Engagement Message

Can you give one benefit of standardizing country names?

Section 5 - Instruction

Sometimes you need to exclude data entirely. This requires clear, defensible reasoning that you can explain to stakeholders.

"We excluded 2019 data because the tracking system changed" is defensible. "We excluded it because results looked weird" is not.

Engagement Message

What is one defensible reason for excluding a whole month's data?

Section 6 - Instruction

Document every cleaning decision you make! Future you (and your team) will thank you when someone asks "Why did we do that?"

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal