We've covered two crucial data cleaning steps: handling missing values and scaling numeric features. Now it's time to practice deciding which technique to use in different situations.
Engagement Message
Are you ready?
Type
Multiple Choice
Practice Question
Your dataset has 10,000 rows. The 'Age' column is missing in 200 rows (2% of the data). What is the most common and simplest strategy to handle this?
A. Impute the missing values with the mean age. B. Remove the 200 rows with missing ages. C. Delete the entire 'Age' column. D. Replace missing ages with zero.
Suggested Answers
- A
- B - Correct
- C
- D
Type
Sort Into Boxes
Practice Question
When should you use min-max scaling versus standardization? Sort these use cases.
Labels
- First Box Label: Use Min-Max Scaling
- Second Box Label: Use Standardization
First Box Items
- Input for neural networks
- You need values between 0 and 1
Second Box Items
- Data has significant outliers
- Algorithm is sensitive to feature range
- Feature follows a Gaussian distribution
Type
Fill In The Blanks
Markdown With Blanks
Let's practice imputation. Fill in the blanks with the correct imputed values.
For a numeric 'Temperature' column with a mean of 25, a missing value would be imputed with [[blank:25]]. For a categorical 'Shirt Color' column where 'Blue' is the most frequent, a missing value would be imputed with [[blank:Blue]].
Suggested Answers
- 25
- Blue
- 0
- Red
Type
Swipe Left or Right
Practice Question
Which scaling method is generally preferred for these scenarios? Swipe each one to the correct method.
Labels
- Left Label: Min-Max Scaling
- Right Label: Standardization
Left Label Items
- When you need values between 0 and 1
- For data that should stay within fixed boundaries
- When working with percentages or ratings
Right Label Items
- When your data has many outliers
- For most machine learning models
- When your data values are spread normally around an average
Type
Multiple Choice
Practice Question
You have a feature for 'Years of Experience' ranging from 0 to 40, and another for 'Annual Salary' ranging from 500,000. Why is scaling important here?
A. To make the data easier to read. B. To prevent the 'Annual Salary' feature from dominating the model's calculations. C. To convert the data into integers. D. To remove any missing values.
Suggested Answers
- A
- B - Correct
- C
- D
