Introduction

We step into the world of Data Cleaning and Transformation. Real-life data isn't always tidy; it has inconsistencies, missing data points, outliers, and even incorrect data! To extract meaningful insights or build reliable machine learning models, we clean and transform data.

In this session, we handle inconsistencies and outliers and apply various data transformations to enhance its readiness for analysis. Now, let's start this exploratory journey!

Why is Data Cleaning and Transformation Necessary?

Why clean and transform data? Simple: unclean or inconsistent data can skew analysis or predictions. Weather data with missing temperatures, for instance, can lead to misleading climate predictions. The real world is full of such examples of analysis gone awry due to unclean data.

Recognizing Inconsistencies in Data

Let's delve into spotting inconsistencies. For instance, XL, X-L, xl represent the same clothing size but are reported differently. Python's pandas library comes in handy here.

Output:

Dealing with Inconsistencies in Data

To sort out inconsistencies, replace them with a standard value.

Output:

Detecting and Filtering Outliers

Scanning for outliers, or exceptional values, is the next step. Outliers can distort the analytical outcome. One common method to detect outliers is using the Interquartile Range (IQR).

As a short reminder, IQR method suggests that any value below Q11.5IQRQ_1 - 1.5 \cdot IQR and above Q3+1.5IQRQ_3 + 1.5 \cdot IQR are considered to be outliers. Where:

  • Q1Q_1 – The first quartile
  • Q3Q_3 – The third quartile
  • IQRIQR – The Interquartile Range

Let's use the IQR method to identify and filter out outliers in a dataset.

Output:

The value 9 is considered an outlier and is excluded from the filtered dataset.

Data Transformation

Now, data transformation is required when data needs adjustment to suit a specific analysis or model-building exercise. The need might be to bring skewed data to normality or harmonize differing scales of variables. For this purpose, the scikit-learn library comes in handy. Though this course is not about this library, it is widely used with pandas dataframes, so we will take a look at it:

Output:

Standard scaler works simply: just create the StandardScaler object, then use its fit_transform method on data. We select the column with double square brackets [['Feature2']] to ensure it's treated as a DataFrame (required by scikit-learn transformers) rather than a Series. In the output, we see a new column Feature2_scaled, which is values of Feature2, but scaled so their mean is 0 and standard deviation is 1.

Lesson Summary and Practice

Kudos! You've completed the Data Cleaning and Transformation lesson. You learned about handling data inconsistencies and outliers and performing data transformations using Python's pandas and scikit-learn libraries.

Practice exercises are next, focusing on consolidating concepts through application. They will reinforce your understanding and hone your skills. So, ready, set, explore!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal