Data Preprocessing: Mastering Normalization and Standardization Techniques

Lesson Introduction

Welcome to our enlightening session on Normalization and Standardization of Passenger Data. These two techniques play a crucial role in preparing your data for machine learning algorithms. During this lesson, our focus will particularly be on the historical Titanic dataset, where we will practice cleaning, normalizing, and standardizing certain features, such as passenger ages and fares. By the end of this lesson, you should have a solid understanding of normalization and standardization and be able to apply these techniques in any data preprocessing assignment using Python and Pandas.

Understanding Normalization

Normalization is a critical preprocessing step, which primarily involves scaling the numerical data in the dataset to a fixed range, usually from 0 to 1. It reduces skewness and bias in the data by bringing all the values to a similar range. Therefore, normalization plays a significant role in algorithms that use a distance measure.

To better illustrate how normalization works, let's apply it to the 'age' column of our Titanic dataset. Normalization will transform the age values so that they fall within a range from 0 to 1:

Output:

In this code snippet, we first subtract the minimum age from each age value, then divide by the range of ages. The ages are scaled to the range [0, 1]. Normalized columns are easier for some machine-learning models to process.

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal