Lesson Introduction

Welcome to our enlightening session on Normalization and Standardization of Passenger Data. These two techniques play a crucial role in preparing your data for machine learning algorithms. During this lesson, our focus will particularly be on the historical Titanic dataset, where we will practice cleaning, normalizing, and standardizing certain features, such as passenger ages and fares. By the end of this lesson, you should have a solid understanding of normalization and standardization and be able to apply these techniques in any data preprocessing assignment using Python and Pandas.

Understanding Normalization

Normalization is a critical preprocessing step, which primarily involves scaling the numerical data in the dataset to a fixed range, usually from 0 to 1. It reduces skewness and bias in the data by bringing all the values to a similar range. Therefore, normalization plays a significant role in algorithms that use a distance measure.

To better illustrate how normalization works, let's apply it to the 'age' column of our Titanic dataset. Normalization will transform the age values so that they fall within a range from 0 to 1:

Output:

In this code snippet, we first subtract the minimum age from each age value, then divide by the range of ages. The ages are scaled to the range [0, 1]. Normalized columns are easier for some machine-learning models to process.

Understanding Standardization

Unlike normalization, standardization does not scale the data to a limited range. Instead, standardization subtracts the mean value of the feature and then divides it by the feature’s standard deviation, transforming the feature values to have a mean of 0 and a standard deviation of 1. This method is often used when you want to compare data that was measured on different scales.

Let's apply standardization to the 'fare' column of the Titanic dataset. This column represents how much each passenger paid for their ticket:

Output:

Now, the 'fare' column is re-scaled so the fares have an average value of 0 and a standard deviation of 1. Notice that the values are not within the [0, 1] range like normalized data.

Implementing Normalization with Pandas

Armed with an understanding of normalization, let's dig a little deeper with the Pandas library. We'll use MinMaxScaler() from the sklearn.preprocessing module, a handy technique for normalizing data in pandas:

Output:

The MinMaxScaler scales and translates each feature individually so that it falls in the given range on the training set, in our case, between 0 and 1.

Implementing Standardization with Pandas

To standardize our data with pandas, we'll make use of the StandardScaler() function from the sklearn.preprocessing module that standardizes features by deducting the mean and scaling to unit variance:

Output:

StandardScaler standardizes a feature by deducting the mean and scaling to unit variance. This operation is performed feature-wise in an independent way. Notice how our standardized fares now have a mean of 0 and a standard deviation of 1.

Comparing Normalization and Standardization

Choose normalization when your data needs to be bounded within a specific range (0 to 1, for example) and is not heavily influenced by outliers. This is particularly useful for algorithms that are sensitive to the scale of the data, such as neural networks and k-nearest neighbors. On the other hand, standardization is more effective when your data has a Gaussian distribution, and you are dealing with algorithms that assume this, such as linear regression, logistic regression, and linear discriminant analysis.

Now that you've got to experience both normalization and standardization, it's safe to say each technique is practical and useful but under different circumstances. Their primary purpose is to handle the varying ranges of data. However, depending on the algorithm deployed and the desired output distribution, normalization or standardization is selected. Remember that not all algorithms benefit from normalization or standardization.

Lesson Summary and Practice

Give yourself a pat on the back as you've made it through the session on data preprocessing techniques! We explored the concepts of normalization and standardization, their practical applications, and how to implement these techniques using Python and Pandas. It's key to remember that these techniques are vital tools in enhancing the performance of your machine-learning models.

Next up, we have some hands-on practice sessions to get your hands dirty with real-world datasets. Remember, the best way to absorb knowledge is by applying it practically. Looking forward to seeing you in the next session! Happy learning!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal