Hey there! Today, we're going to learn about feature scaling. You might be wondering, what is feature scaling, and why should we care? Simply put, feature scaling is like making sure all the ingredients in your recipe are measured in the same unit. Imagine trying to mix pounds of flour and teaspoons of salt without converting one to the other — it wouldn't make sense, right?
Our goal is to understand why feature scaling is crucial in machine learning and to learn how to do it using Python and a library called SciKit Learn
.
Feature scaling ensures that all your data features contribute equally when building a machine learning model. Without scaling, large values might dominate, leading to biased outcomes. For example, if predicting house prices, and one feature was in thousands (like square footage) and another in single digits (like the number of rooms), the model might overlook the smaller values just because they seem less relevant.
There are two common types:
-
Standardization: Transforms data to have a mean () of 0 and a standard deviation () of 1.
Formula: , where is the original feature value, is the mean of the feature, and is the standard deviation of the feature.
-
Normalization: Rescales data to range between 0 and 1.
Formula: , where is the original feature value, is the minimum value of the feature, and is the maximum value of the feature.
Today, we'll focus on both standardization using StandardScaler
and normalization using MinMaxScaler
from SciKit Learn
.
Let's create a small sample dataset to see how feature scaling works.
Output:
Before scaling, Feature1
ranges from 1 to 4, and Feature2
ranges from 10 to 40. Let's scale this dataset using StandardScaler
.
We’ll use the StandardScaler
to perform the scaling. The fit_transform
method will calculate the mean and standard deviation for scaling, and then apply the scaling to the data.
Continuing from where we left off, we need to convert this scaled data back to a DataFrame for better readability.
Output:
Let's check if the data is scaled correctly. We will calculate mean and standard deviation for both features:
Here is the output:
The mean of each feature in the scaled DataFrame is 0, and the standard deviation is 1. This makes it easier for the machine learning model to treat all features equally.
Let's also apply feature scaling using the MinMaxScaler
to see how normalization works. The good news is that using the MinMaxScaler
is exactly the same as for the StandardScaler
. You literally just change the scaler's name and everything works!
Convert the normalized data back to a DataFrame for better readability and verify the range.
Output:
Let's validate the results:
Output:
The minimum of each feature in the scaled DataFrame is 0, and the maximum is 1, ensuring that all data points fall within this range.
Great job! You learned what feature scaling is and why it is essential in machine learning. By scaling your features, you ensure that all data points contribute equally to the model. You also got hands-on with Python, StandardScaler
, and MinMaxScaler
from SciKit Learn
to both standardize and normalize a sample dataset.
Now it's time to move on to some practice exercises. You'll get the chance to apply what you learned and become even more confident in your ability to scale features effectively. Let's get started!
