Ready for another deep dive? Today, we'll explore Data Transformation and Scaling Techniques, an essential constituent of the data cleaning and preprocessing process for machine learning. We will learn how to transform numerical data to different ranges using various scaling techniques, such as Standard Scaling
, Min-Max Scaling
, and Robust Scaling
.
Data scaling is crucial because machine learning algorithms perform more effectively when numerical features are on the same scale. Without scaling, variables with higher ranges may dominate others in the machine learning models, reducing the model's accuracy.
For example, imagine having two features — age
and income
— in your Titanic dataset. Age varies between 0 and 100, while income may range from 0 to thousands. A machine learning model could be biased towards income because of its higher magnitude, leading to poor model performance.
Ready to dive in? Let's go!
Before we move into the hands-on part, let's briefly discuss three popular techniques to standardize numerical data.
-
Standard Scaler: It assumes data is normally distributed and scales it to have zero mean and unit variance. It's best used when the data is normally distributed. In other words, when the values of a particular feature follow a bell curve, a Standard Scaler is a good option to standardize the feature.
-
Min-Max Scaler: Also known as normalization, this technique scales data to range between 0 and 1 (or -1 to 1 if there are negative values). It's commonly used for algorithms that don't assume any distribution of the data. This means if your data doesn't follow a specific shape or form, you might consider using Min-Max Scaler.
