Topic Overview

Ready for another deep dive? Today, we'll explore Data Transformation and Scaling Techniques, an essential constituent of the data cleaning and preprocessing process for machine learning. We will learn how to transform numerical data to different ranges using various scaling techniques, such as Standard Scaling, Min-Max Scaling, and Robust Scaling.

Data scaling is crucial because machine learning algorithms perform more effectively when numerical features are on the same scale. Without scaling, variables with higher ranges may dominate others in the machine learning models, reducing the model's accuracy.

For example, imagine having two features — age and income — in your Titanic dataset. Age varies between 0 and 100, while income may range from 0 to thousands. A machine learning model could be biased towards income because of its higher magnitude, leading to poor model performance.

Ready to dive in? Let's go!

Introduction to Data Scaling

Before we move into the hands-on part, let's briefly discuss three popular techniques to standardize numerical data.

  • Standard Scaler: It assumes data is normally distributed and scales it to have zero mean and unit variance. It's best used when the data is normally distributed. In other words, when the values of a particular feature follow a bell curve, a Standard Scaler is a good option to standardize the feature.

  • Min-Max Scaler: Also known as normalization, this technique scales data to range between 0 and 1 (or -1 to 1 if there are negative values). It's commonly used for algorithms that don't assume any distribution of the data. This means if your data doesn't follow a specific shape or form, you might consider using Min-Max Scaler.

  • Robust Scaler: As its name suggests, this scaler is robust to outliers. It uses the Interquartile Range (IQR) to scale data, and it's suitable when the dataset contains outliers. Outliers are data points that significantly deviate from other observations. They can be problematic because they can affect the results of a data analysis.

There's no "one size fits all" scaler. You'll need to choose the appropriate scaler based on your data's characteristics and your machine-learning algorithm's requirements.

Standard Scaling

We'll start with the Standard Scaler. It scales data based on its mean (μ\mu) and standard deviation (σ\sigma), using the formula to calculate the z-score: z=xμσz = \frac{x - \mu}{\sigma}.

Let's try it on the age column of the Titanic dataset:

Note how the transformed age values are not easily interpretable. That's because they've been transformed into their respective z-scores. But the important thing to understand is the transformed data is standardized and can be readily included in a machine learning model.

Standard Scaling Visualization

Min-Max Scaling

Next, we'll explore Min-Max Scaling, which scales your data to a specified range. The formula used here is: xnew=xxminxmaxxminx_{new} = \frac{x - x_{min}}{x_{max} - x_{min}}. This formula essentially resizes your data to fit within the range of 0 to 1.

Let's apply Min-Max Scaler on the fare column:

All fare values are now within the range of 0 to 1, with the smallest fare being 0 and the largest being 1. Intermediate fare values are spread out proportionally between 0 and 1.

Robust Scaling

Last but not least, we have Robust Scaling useful when dealing with outliers, as it scales data according to its IQR (Inter Quartile Range). Effectively, it's robust against outliers since it uses the IQR, and outliers fall outside the IQR.

Let's apply it to the fare column:

The fare values now reflect how many IQRs are away from the median. This scaling method is resilient to outliers, which effectively become small positive and negative values.

Wrapping up the Lesson

You should now understand why data scaling is essential in machine learning and how to implement three common data scaling techniques in Python: Standard Scaling, Min-Max Scaling, and Robust Scaling.

Remember, the choice of scaling technique depends on the nature of your data and the specific requirements of your machine-learning algorithm. Each scaler has its strengths: Standard Scaler works best with data that are normally distributed, Min-Max Scaler is adaptable with data of any shape, and Robust Scaler is capable of handling outliers.

Ready for Practice?

Great work on assimilating the essentials of data transformation and scaling! Let's move to the next part—practice! The exercises are designed to deepen your understanding of data scaling techniques. You'll code, implement the learning, and apply these techniques to various data distributions. So roll up your sleeves and get ready for some coding action! You can expect to gain much deeper insights and develop your data scaling expertise during the practice session, so don't miss it!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal