Welcome to scaling numeric features! Imagine comparing heights in meters versus millimeters - same data, but vastly different numbers.
Machine learning models can get confused when features have different scales, leading to poor predictions.
Engagement Message
Can you name another pair of features you've seen with very different numeric ranges?
Here's the problem: models often calculate distances between data points. A feature with large numbers (like income) will dominate features with small numbers (like age).
It's like shouting over whispers - the loud feature drowns out the quiet ones!
Engagement Message
Which difference will influence distance more—$1 k in income or 1 year in age?
Scaling transforms all features to similar ranges so they contribute equally. There are two main approaches: min-max scaling and standardization.
Min-max scaling squeezes all values into a 0-1 range. Standardization centers values around zero with consistent spread.
Engagement Message
Do any of these approaches sound familiar?
Min-max scaling uses this formula: (value - minimum) ÷ (maximum - minimum)
Let's say we have car ages: 2, 5, 8 years. Minimum = 2, Maximum = 8.
For age 5: (5 - 2) ÷ (8 - 2) = 3 ÷ 6 = 0.5
Engagement Message
What would age 2 become after min-max scaling?
Let's complete our car age example:
- Age 2:
(2-2) ÷ (8-2) = 0 ÷ 6 = 0
- Age 5:
(5-2) ÷ (8-2) = 3 ÷ 6 = 0.5
- Age 8:
(8-2) ÷ (8-2) = 6 ÷ 6 = 1
Notice how all values now fall between 0 and 1?
Engagement Message
Why does the oldest car scale to 1 and the youngest to 0?
You can scale an entire array at once using NumPy:
Engagement Message
Notice how all the values are now between 0 and 1, just like when we did it by hand?
Standardization uses: (value - mean) ÷ standard deviation
Same car ages: 2, 5, 8. Mean = 5, Standard deviation ≈ 2.45.
For age 5: (5 - 5) ÷ 2.45 = 0 ÷ 2.45 = 0
Engagement Message
This centers the average value at zero. What would age 8 become?
Let's complete our car age example using standardization:
- Age 2:
(2 - 5) ÷ 2.45 = (-3) ÷ 2.45 ≈ -1.22
- Age 5:
(5 - 5) ÷ 2.45 = 0 ÷ 2.45 = 0
- Age 8:
(8 - 5) ÷ 2.45 = 3 ÷ 2.45 ≈ 1.22
Now, the average car age is 0, younger cars are negative, and older cars are positive.
Engagement Message
Why do you think standardization can help when features have very different spreads or outliers?
You can also standardize an entire array at once using NumPy:
Now, the average value is 0, and the spread is consistent across the array!
Engagement Message
Does this make sense?
Use min-max scaling when you want values bounded between 0 and 1. It's great for neural networks and when you know the expected range.
Use standardization when your data has outliers or follows a normal distribution. It's preferred for many algorithms.
Engagement Message
Which would work better for test scores ranging 0-100?
Type
Multiple Choice
Practice Question
Let's practice min-max scaling! Scale these house prices to 0-1 range: $100,000, $150,000, $200,000.
What's the scaled value for $150,000?
Formula: (value - min) ÷ (max - min)
A. 0.25 B. 0.5 C. 0.75 D. 1.0
Suggested Answers
- A
- B - Correct
- C
- D
