Lesson Introduction

Hey there! Today, we're going to learn about feature scaling. You might be wondering, what is feature scaling, and why should we care? Simply put, feature scaling is like making sure all the ingredients in your recipe are measured in the same unit. Imagine trying to mix pounds of flour and teaspoons of salt without converting one to the other — it wouldn't make sense, right?

Our goal is to understand why feature scaling is crucial in machine learning and to learn how to do it using Python and a library called SciKit Learn.

What is Feature Scaling?

Feature scaling ensures that all your data features contribute equally when building a machine learning model. Without scaling, large values might dominate, leading to biased outcomes. For example, if predicting house prices, and one feature was in thousands (like square footage) and another in single digits (like the number of rooms), the model might overlook the smaller values just because they seem less relevant.

There are two common types:

  1. Standardization: Transforms data to have a mean (μ\mu) of 0 and a standard deviation (σ\sigma) of 1.

    Formula: z=xμσz = \frac{x - \mu}{\sigma}, where xx is the original feature value, μ\mu is the mean of the feature, and σ\sigma is the standard deviation of the feature.

  2. Normalization: Rescales data to range between 0 and 1.

    Formula: x=xmin(x)max(x)min(x)x' = \frac{x - \min(x)}{\max(x) - \min(x)}, where xx is the original feature value, min(x)\min(x) is the minimum value of the feature, and max(x)\max(x) is the maximum value of the feature.

Today, we'll focus on both standardization using StandardScaler and normalization using MinMaxScaler from SciKit Learn.

Example of Feature Scaling with `StandardScaler`

Let's create a small sample dataset to see how feature scaling works.

Output:

Before scaling, Feature1 ranges from 1 to 4, and Feature2 ranges from 10 to 40. Let's scale this dataset using StandardScaler.

Applying Feature Scaling with `StandardScaler`

We’ll use the StandardScaler to perform the scaling. The fit_transform method will calculate the mean and standard deviation for scaling, and then apply the scaling to the data.

Continuing from where we left off, we need to convert this scaled data back to a DataFrame for better readability.

Output:

Scaling Double-check

Let's check if the data is scaled correctly. We will calculate mean and standard deviation for both features:

Here is the output:

The mean of each feature in the scaled DataFrame is 0, and the standard deviation is 1. This makes it easier for the machine learning model to treat all features equally.

Example of Feature Scaling with `MinMaxScaler`

Let's also apply feature scaling using the MinMaxScaler to see how normalization works. The good news is that using the MinMaxScaler is exactly the same as for the StandardScaler. You literally just change the scaler's name and everything works!

Convert the normalized data back to a DataFrame for better readability and verify the range.

Output:

Scaling Double-Check

Let's validate the results:

Output:

The minimum of each feature in the scaled DataFrame is 0, and the maximum is 1, ensuring that all data points fall within this range.

Lesson Summary

Great job! You learned what feature scaling is and why it is essential in machine learning. By scaling your features, you ensure that all data points contribute equally to the model. You also got hands-on with Python, StandardScaler, and MinMaxScaler from SciKit Learn to both standardize and normalize a sample dataset.

Now it's time to move on to some practice exercises. You'll get the chance to apply what you learned and become even more confident in your ability to scale features effectively. Let's get started!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal