Section 1 - Instruction

We've covered two crucial data cleaning steps: handling missing values and scaling numeric features. Now it's time to practice deciding which technique to use in different situations.

Engagement Message

Are you ready?

Section 2 - Practice

Type

Multiple Choice

Practice Question

Your dataset has 10,000 rows. The 'Age' column is missing in 200 rows (2% of the data). What is the most common and simplest strategy to handle this?

A. Impute the missing values with the mean age. B. Remove the 200 rows with missing ages. C. Delete the entire 'Age' column. D. Replace missing ages with zero.

Suggested Answers

  • A
  • B - Correct
  • C
  • D
Section 3 - Practice

Type

Sort Into Boxes

Practice Question

When should you use min-max scaling versus standardization? Sort these use cases.

Labels

  • First Box Label: Use Min-Max Scaling
  • Second Box Label: Use Standardization

First Box Items

  • Input for neural networks
  • You need values between 0 and 1

Second Box Items

  • Data has significant outliers
  • Algorithm is sensitive to feature range
  • Feature follows a Gaussian distribution
Section 4 - Practice

Type

Fill In The Blanks

Markdown With Blanks

Let's practice imputation. Fill in the blanks with the correct imputed values.

For a numeric 'Temperature' column with a mean of 25, a missing value would be imputed with [[blank:25]]. For a categorical 'Shirt Color' column where 'Blue' is the most frequent, a missing value would be imputed with [[blank:Blue]].

Suggested Answers

  • 25
  • Blue
  • 0
  • Red
Section 5 - Practice

Type

Swipe Left or Right

Practice Question

Which scaling method is generally preferred for these scenarios? Swipe each one to the correct method.

Labels

  • Left Label: Min-Max Scaling
  • Right Label: Standardization

Left Label Items

  • When you need values between 0 and 1
  • For data that should stay within fixed boundaries
  • When working with percentages or ratings

Right Label Items

  • When your data has many outliers
  • For most machine learning models
  • When your data values are spread normally around an average
Section 6 - Practice

Type

Multiple Choice

Practice Question

You have a feature for 'Years of Experience' ranging from 0 to 40, and another for 'Annual Salary' ranging from 30,000to30,000 to 500,000. Why is scaling important here?

A. To make the data easier to read. B. To prevent the 'Annual Salary' feature from dominating the model's calculations. C. To convert the data into integers. D. To remove any missing values.

Suggested Answers

  • A
  • B - Correct
  • C
  • D
Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal