Welcome to working with real-world data! Here's the truth: most business data you'll encounter is messy, incomplete, and inconsistent.
Unlike textbook examples, real data comes with problems that need fixing before analysis.
Engagement Message
What’s one clue that told you a dataset was "off" or incomplete?
Let's start with missing values - gaps where data should exist but doesn't. Imagine a customer database where some people didn't provide their age or income.
These blank cells create holes in your analysis and can lead to wrong conclusions.
Engagement Message
Can you think of why someone might skip providing their income?
Next up: duplicate records. These happen when the same customer, transaction, or product appears multiple times in your dataset.
For example, if John Smith appears twice with slightly different spellings, your analysis might count him as two different customers.
Engagement Message
How might duplicates affect your customer count analysis?
Finally, inconsistencies - when the same information is recorded differently across your data. Think "USA", "United States", and "US" all meaning the same country.
Or dates written as "Jan 15" in one place and "1/15" in another.
Engagement Message
What problems might this create when analyzing sales by country?
Why does this matter for business decisions? Messy data leads to wrong insights, which lead to poor decisions.
If 30% of your customer ages are missing, can you trust an analysis about age preferences?
Engagement Message
What business decision might go wrong with incomplete customer data?
The good news: recognizing these issues is the first step to fixing them. In upcoming units, we'll learn systematic approaches to clean data.
