Detecting and Addressing Outliers

Welcome back to the Foundations of Feature Engineering course! So far, you have become acquainted with handling missing data and preparing your dataset for analysis. Now, we shift our focus to another crucial aspect of data preprocessing: detecting and addressing outliers.

An outlier is an observation that deviates significantly from the rest of the data. Outliers can skew your analysis, affecting measures like mean and standard deviation, which can lead to incorrect insights. Thus, identifying and handling outliers is a fundamental part of feature engineering to enhance data integrity. In this lesson, you will learn how to use the Interquartile Range (IQR) method to detect outliers and strategies for handling them.

Impact of Outliers on Data Integrity

Outliers can dramatically alter the interpretation of data. Consider the following small dataset illustrating salaries:

IDSalary
150,000
252,000
349,000
451,000
550,500
61,000,000

Here, the outlier salary of 1,000,000 significantly skews the mean, misleading the analysis by inflating the average salary beyond the typical range of this group. The mean shifts from a realistic central tendency of around 50,500 to an exaggerated value. By effectively identifying and managing outliers, you can achieve a more accurate data representation. Outliers can distort predictive models, leading to reduced accuracy and suboptimal performance. Addressing them ensures robust analytical results.

Checking for Outliers Using the Describe Method

Before delving into specific methods for detecting outliers, it can be helpful to start with a broad view of the data using the describe method. This statistical summary provides key metrics such as the mean, standard deviation, minimum, and maximum values of each numerical column, which can quickly highlight potential outliers.

Let's use the describe method on our Titanic dataset to gain initial insights:

By running this code, we obtain a statistical summary of the numerical columns, which will help us identify potential outliers.

The output provides the count, mean, standard deviation, minimum, maximum, and percentile values for each numerical column. You might notice terms like 25%, 50%, and 75%, which refer to percentiles—a concept we'll explore in detail soon. For now, observe that the max value for fare is 512.3292, significantly higher than the 75% percentile of 31.000000. While we haven't yet covered how these percentiles (quartiles) are calculated, this large difference suggests the presence of potential outliers in the fare column. Similar observations can be made for the age column, where the max value is 80.000000. In the upcoming sections, we'll delve deeper into quartiles and how they're used to detect outliers more precisely.

Understanding the Interquartile Range (IQR) Method

The Interquartile Range (IQR) method is a widely used statistical technique for detecting outliers. It helps identify data points that deviate significantly from the norm by focusing on the spread of the middle 50% of the dataset, known as the interquartile range. Let’s explore how this method works with a simple example.

Imagine you have these ages: [22, 23, 24, 24, 25, 25, 30, 35, 90]

Looking at these numbers, you might notice that 90 seems unusually high compared to the other ages. The Interquartile Range (IQR) method helps us confirm this mathematically in three simple steps:

  1. Find the Quartiles: Order your data from smallest to largest and divide it into four equal parts:

    • Q1: Marks the end of the first quarter (24 in our example)
    • Q3: Marks the end of the third quarter (30 in our example)
  2. Calculate the IQR: Subtract Q1 from Q3 to find how spread out the middle 50% of your data is:

    • IQR = Q3 - Q1 = 30 - 24 = 6 This tells us that most ages in our dataset vary within a range of 6 years.
  3. Set Boundaries for Outliers: Using 1.5 (a standard multiplier that statisticians have found works well for most datasets)

    • Lower Boundary = Q1 - 1.5 * IQR = 24 - (1.5 * 6) = 15
    • Upper Boundary = Q3 + 1.5 * IQR = 30 + (1.5 * 6) = 39
MeasureResult
Q124
Q330
IQR6
Lower Boundary15
Upper Boundary39

Now we can clearly see that 90 is an outlier because it's well above our upper boundary of 39. The IQR method helps us mathematically verify what we initially suspected by looking at the data.

Calculating Outlier Bounds with Pandas

We have previously understood the concept of detecting outliers using the Interquartile Range (IQR) method. Now, let's apply this understanding to the Titanic dataset using Pandas to find outliers programmatically.

First, we will create a function that calculates outlier bounds using the quantile method in Pandas. This method helps us determine the quartiles of a dataset, which are key to identifying outliers.

In this function, we calculate the first quartile (Q1) and the third quartile (Q3) by applying the quantile method at the 0.25 and 0.75 points, respectively. These values help us calculate the IQR and subsequently the bounds for detecting outliers. With this function ready, we can proceed to check for outliers in our dataset.

Checking for Outliers in Numerical Columns

With the function ready, let's move on to detect outliers in the numerical columns of the Titanic dataset:

This snippet iterates over the specified numerical columns (age and fare), applies the calculate_outlier_bounds function, and identifies outliers as those below the lower bound or above the upper bound. It then prints the count and percentage of outliers for each column. Running this code provides us with the number and percentage of outliers in each column.

From the output, we can see that there are 11 outliers in age, making up 1.23% of the data, and 116 outliers in fare, which constitute 13.02% of the data. This confirms that outliers are present in both columns, with fare having a larger proportion of outliers. Next, we'll address these outliers by capping them.

Managing Outliers in Numerical Columns

Now that we've identified the outliers in the numerical columns of the Titanic dataset, let's address them using a technique called capping. Capping involves setting any outlier values to the calculated lower or upper bounds, thus keeping these extreme values from skewing our data analysis.

Here’s how to cap the outlier values in our dataset:

For each column, we reuse the calculate_outlier_bounds function to determine the lower and upper boundaries. We then employ the clip method in Pandas, which restricts the data range to the specified bounds. Values below the lower bound are set to the lower bound, and values above the upper bound are set to the upper bound. After applying the changes, we use the describe method to display the summary statistics. This allows us to verify that the extreme values have been capped.

Comparing these statistics with the earlier ones, we can observe that the max values for age and fare have been reduced to 64.8125 and 65.6344, respectively, which are the calculated upper bounds. This indicates that the outliers have been effectively capped, leading to a more consistent data distribution. By capping the outliers, we ensure the dataset maintains its overall structure while mitigating the impact of abnormal values that could interfere with the data's integrity and the subsequent analysis.

Summary and Preparation for Practice Exercises

In this lesson, you have learned how to identify and handle outliers effectively using the IQR method and capping strategies. Recognizing and managing outliers is essential in maintaining the integrity of your dataset and ensuring accurate representation in your analyses. As you move forward to the practice exercises, you will apply these concepts, reinforcing your knowledge and experience with data preprocessing. These practical activities will deepen your understanding of outlier detection and handling techniques, equipping you with the skills necessary for robust feature engineering. Keep practicing and refining these methods as you advance in your feature engineering journey!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal