Now that you can spot messy data, let's learn how to fix it systematically! Random cleaning approaches often create more problems than they solve.
Professional data analysts follow proven strategies that protect data integrity while fixing issues.
Engagement Message
Name one risk of randomly deleting rows with missing data?
For missing values, you have three main strategies: deletion, imputation, or leaving them as-is.
Deletion removes rows or columns. Imputation fills gaps with estimates. Sometimes keeping them empty is actually the right choice.
Engagement Message
Which strategy would you choose if only 2% of ages were missing?
Outliers are extreme values that seem out of place - like a $50,000 salary in a dataset of $50 salaries.
First, verify if it's real (CEO salary?) or an error (missing decimal point?). Real outliers might stay; errors get fixed or removed.
Engagement Message
What's one check you could do to verify whether a $50,000 salary is valid?
Formatting inconsistencies need standardization. Pick one format and convert everything to match.
For example, choose "United States" and convert all "USA" and "US" entries to match this standard format.
Engagement Message
Can you give one benefit of standardizing country names?
Sometimes you need to exclude data entirely. This requires clear, defensible reasoning that you can explain to stakeholders.
"We excluded 2019 data because the tracking system changed" is defensible. "We excluded it because results looked weird" is not.
Engagement Message
What is one defensible reason for excluding a whole month's data?
Document every cleaning decision you make! Future you (and your team) will thank you when someone asks "Why did we do that?"
