Welcome back to the Foundations of Feature Engineering course! So far, you have become acquainted with handling missing data and preparing your dataset for analysis. Now, we shift our focus to another crucial aspect of data preprocessing: detecting and addressing outliers.
An outlier is an observation that deviates significantly from the rest of the data. Outliers can skew your analysis, affecting measures like mean and standard deviation, which can lead to incorrect insights. Thus, identifying and handling outliers is a fundamental part of feature engineering to enhance data integrity. In this lesson, you will learn how to use the Interquartile Range (IQR) method to detect outliers and strategies for handling them.
Outliers can dramatically alter the interpretation of data. Consider the following small dataset illustrating salaries:
ID | Salary |
---|---|
1 | 50,000 |
2 | 52,000 |
3 | 49,000 |
4 | 51,000 |
5 | 50,500 |
6 | 1,000,000 |
Here, the outlier salary of 1,000,000 significantly skews the mean, misleading the analysis by inflating the average salary beyond the typical range of this group. The mean shifts from a realistic central tendency of around 50,500 to an exaggerated value. By effectively identifying and managing outliers, you can achieve a more accurate data representation. Outliers can distort predictive models, leading to reduced accuracy and suboptimal performance. Addressing them ensures robust analytical results.
Before delving into specific methods for detecting outliers, it can be helpful to start with a broad view of the data using the describe
method. This statistical summary provides key metrics such as the mean, standard deviation, minimum, and maximum values of each numerical column, which can quickly highlight potential outliers.
Let's use the describe
method on our Titanic dataset to gain initial insights:
Python1import pandas as pd 2 3# Load the dataset 4df = pd.read_csv("titanic.csv") 5 6# Display summary statistics for numerical columns 7print("Summary statistics for numerical columns:") 8print(df.describe())
By running this code, we obtain a statistical summary of the numerical columns, which will help us identify potential outliers.
Plain text1Summary statistics for numerical columns: 2 survived pclass age sibsp parch fare 3count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 4mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 5std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 6min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 725% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 850% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 975% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 10max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
The output provides the count, mean, standard deviation, minimum, maximum, and percentile values for each numerical column. You might notice terms like 25%
, 50%
, and 75%
, which refer to percentiles—a concept we'll explore in detail soon. For now, observe that the max
value for fare
is 512.3292, significantly higher than the 75%
percentile of 31.000000. While we haven't yet covered how these percentiles (quartiles) are calculated, this large difference suggests the presence of potential outliers in the fare
column. Similar observations can be made for the age
column, where the max
value is 80.000000. In the upcoming sections, we'll delve deeper into quartiles and how they're used to detect outliers more precisely.
The Interquartile Range (IQR) method is a widely used statistical technique for detecting outliers. It helps identify data points that deviate significantly from the norm by focusing on the spread of the middle 50% of the dataset, known as the interquartile range. Let’s explore how this method works with a simple example.
Imagine you have these ages: [22, 23, 24, 24, 25, 25, 30, 35, 90]
Looking at these numbers, you might notice that 90 seems unusually high compared to the other ages. The Interquartile Range (IQR) method helps us confirm this mathematically in three simple steps:
-
Find the Quartiles: Order your data from smallest to largest and divide it into four equal parts:
- Q1: Marks the end of the first quarter (24 in our example)
- Q3: Marks the end of the third quarter (30 in our example)
-
Calculate the IQR: Subtract Q1 from Q3 to find how spread out the middle 50% of your data is:
- IQR = Q3 - Q1 = 30 - 24 = 6 This tells us that most ages in our dataset vary within a range of 6 years.
-
Set Boundaries for Outliers: Using 1.5 (a standard multiplier that statisticians have found works well for most datasets)
- Lower Boundary = Q1 - 1.5 * IQR = 24 - (1.5 * 6) = 15
- Upper Boundary = Q3 + 1.5 * IQR = 30 + (1.5 * 6) = 39
Measure | Result |
---|---|
Q1 | 24 |
Q3 | 30 |
IQR | 6 |
Lower Boundary | 15 |
Upper Boundary | 39 |
Now we can clearly see that 90 is an outlier because it's well above our upper boundary of 39. The IQR method helps us mathematically verify what we initially suspected by looking at the data.
We have previously understood the concept of detecting outliers using the Interquartile Range (IQR) method. Now, let's apply this understanding to the Titanic dataset using Pandas to find outliers programmatically.
First, we will create a function that calculates outlier bounds using the quantile
method in Pandas. This method helps us determine the quartiles of a dataset, which are key to identifying outliers.
Python1# Function to calculate outlier bounds using the IQR method 2def calculate_outlier_bounds(df, column): 3 # Calculate the first quartile (Q1) and third quartile (Q3) using the quantile method 4 Q1 = df[column].quantile(0.25) # 25th percentile 5 Q3 = df[column].quantile(0.75) # 75th percentile 6 # Compute the Interquartile Range (IQR) as the difference between Q3 and Q1 7 IQR = Q3 - Q1 8 # Determine the lower and upper bounds for outliers 9 lower_bound = Q1 - 1.5 * IQR 10 upper_bound = Q3 + 1.5 * IQR 11 return lower_bound, upper_bound
In this function, we calculate the first quartile (Q1
) and the third quartile (Q3
) by applying the quantile
method at the 0.25 and 0.75 points, respectively. These values help us calculate the IQR and subsequently the bounds for detecting outliers. With this function ready, we can proceed to check for outliers in our dataset.
With the function ready, let's move on to detect outliers in the numerical columns of the Titanic dataset:
Python1# Check for outliers in numerical columns 2for column in ['age', 'fare']: 3 # Calculate outlier bounds for the current column 4 lower, upper = calculate_outlier_bounds(df, column) 5 # Identify the outliers by checking values outside the calculated bounds 6 outliers = df[(df[column] < lower) | (df[column] > upper)] 7 # Display outliers information for the current column 8 print(f"Outliers in {column}:") 9 print(f"Number of outliers: {len(outliers)}") 10 print(f"Percentage: {(len(outliers)/len(df)*100):.2f}%\n")
This snippet iterates over the specified numerical columns (age
and fare
), applies the calculate_outlier_bounds
function, and identifies outliers as those below the lower bound or above the upper bound. It then prints the count and percentage of outliers for each column. Running this code provides us with the number and percentage of outliers in each column.
Plain text1Outliers in age: 2Number of outliers: 11 3Percentage: 1.23% 4 5Outliers in fare: 6Number of outliers: 116 7Percentage: 13.02%
From the output, we can see that there are 11 outliers in age
, making up 1.23% of the data, and 116 outliers in fare
, which constitute 13.02% of the data. This confirms that outliers are present in both columns, with fare
having a larger proportion of outliers. Next, we'll address these outliers by capping them.
Now that we've identified the outliers in the numerical columns of the Titanic dataset, let's address them using a technique called capping. Capping involves setting any outlier values to the calculated lower or upper bounds, thus keeping these extreme values from skewing our data analysis.
Here’s how to cap the outlier values in our dataset:
Python1for column in ['age', 'fare']: 2 # Calculate outlier bounds for the current column 3 lower, upper = calculate_outlier_bounds(df, column) 4 # Use the clip method to cap the values at the lower and upper bounds 5 df[column] = df[column].clip(lower=lower, upper=upper) 6 7# Display summary statistics after managing outliers 8print("Summary statistics after capping outliers:") 9print(df[['age', 'fare']].describe())
For each column, we reuse the calculate_outlier_bounds
function to determine the lower and upper boundaries. We then employ the clip
method in Pandas, which restricts the data range to the specified bounds. Values below the lower bound are set to the lower bound, and values above the upper bound are set to the upper bound. After applying the changes, we use the describe
method to display the summary statistics. This allows us to verify that the extreme values have been capped.
Plain text1Summary statistics after capping outliers: 2 age fare 3count 714.000000 891.000000 4mean 29.622700 24.046813 5std 14.316665 20.481625 6min 0.420000 0.000000 725% 20.125000 7.910400 850% 28.000000 14.454200 975% 38.000000 31.000000 10max 64.812500 65.634400
Comparing these statistics with the earlier ones, we can observe that the max
values for age
and fare
have been reduced to 64.8125 and 65.6344, respectively, which are the calculated upper bounds. This indicates that the outliers have been effectively capped, leading to a more consistent data distribution. By capping the outliers, we ensure the dataset maintains its overall structure while mitigating the impact of abnormal values that could interfere with the data's integrity and the subsequent analysis.
In this lesson, you have learned how to identify and handle outliers effectively using the IQR method and capping strategies. Recognizing and managing outliers is essential in maintaining the integrity of your dataset and ensuring accurate representation in your analyses. As you move forward to the practice exercises, you will apply these concepts, reinforcing your knowledge and experience with data preprocessing. These practical activities will deepen your understanding of outlier detection and handling techniques, equipping you with the skills necessary for robust feature engineering. Keep practicing and refining these methods as you advance in your feature engineering journey!