Welcome back to the course on Foundations of Feature Engineering! In our previous lesson, we explored the fundamentals of feature engineering using the Titanic dataset. We examined the dataset's structure, looked at basic statistics, and understood how to use the pandas
library to manipulate data effectively. Now that you have a clear understanding of these basics, we will delve deeper into a critical aspect of feature engineering: handling missing data.
Missing data can be a significant obstacle in data analysis and can drastically impact the performance of machine learning models. In today's lesson, you will learn how to detect, quantify, and handle missing data to enhance the quality and effectiveness of your datasets.
To begin with, it's essential to identify where missing data exists within a dataset. In pandas
, this can be achieved using the isnull()
function, which detects missing values, returning a DataFrame of the same shape with Boolean values indicating True
for missing entries. Combined with the sum()
function, it provides an efficient way to count missing values in each column.
Let's see this in action with the Titanic dataset:
Python1import pandas as pd 2 3# Load the Titanic dataset 4df = pd.read_csv("titanic.csv") 5 6# Display missing values in each column 7print("Missing values in each column:") 8print(df.isnull().sum())
The output will show the number of missing values in each column. For example:
Plain text1Missing values in each column: 2survived 0 3pclass 0 4sex 0 5age 177 6sibsp 0 7parch 0 8fare 0 9embarked 2 10class 0 11who 0 12adult_male 0 13deck 688 14embark_town 2 15alive 0 16alone 0 17dtype: int64
This output indicates that there are various missing values, particularly in the age
, deck
, embarked
, and embark_town
columns. Identifying these is the first step in managing them effectively.
Understanding the extent of missing data helps in deciding how to handle it. By calculating the percentage of missing values for each column, you can assess the impact of missing information on your dataset.
Here's how to do it using the Titanic dataset:
Python1# Display percentage of missing values 2print("\nPercentage of missing values:") 3print((df.isnull().sum() / len(df) * 100).round(2))
This will yield an output like:
Plain text1Percentage of missing values: 2survived 0.00 3pclass 0.00 4sex 0.00 5age 19.87 6sibsp 0.00 7parch 0.00 8fare 0.00 9embarked 0.22 10class 0.00 11who 0.00 12adult_male 0.00 13deck 77.22 14embark_town 0.22 15alive 0.00 16alone 0.00 17dtype: float64
The output quantifies the missing data for each feature, revealing varying levels of missingness. The deck
column has a significant 77.22% of its data missing, indicating a substantial lack of information. The age
column shows a moderate level of incompleteness with 19.87% missing data. Meanwhile, both the embarked
and embark_town
columns have only 0.22% of their data missing, reflecting minor gaps that may have a negligible impact on the overall dataset. These insights are crucial for planning strategies to address the missing values effectively.
Imputation is a common technique for handling missing data by filling in the missing entries with estimated values. Depending on the nature of your data and the missingness mechanism, you may choose different imputation strategies. Common approaches include using the median for numerical data and the mode for categorical data.
Let's apply imputation to the Titanic dataset:
Python1# Impute missing 'age' with the median age 2df['age'] = df['age'].fillna(df['age'].median()) 3 4# Impute missing 'embarked' with the most frequent value (mode) 5df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0]) 6 7# Impute missing 'embark_town' with the most frequent value (mode) 8df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])
In the code above, the fillna
method is used to fill the missing age
values with the median age of available entries, which helps maintain the distribution of ages. For the embarked
and embark_town
columns, categorical values, the imputation uses the mode()
method to find the most frequently occurring category in each column. This approach helps ensure that the imputed values align naturally with the existing data while preserving important categorical distributions.
In some scenarios, particularly with categorical data, you might find it more appropriate to use placeholders to indicate missing values. This can preserve the uniqueness of the missing entries without skewing the feature distribution.
For the deck
column in the Titanic dataset, we can use 'Unknown'
as a placeholder:
Python1# Fill missing 'deck' values with 'Unknown' as a placeholder 2df['deck'] = df['deck'].fillna('Unknown')
Setting 'Unknown'
allows you to retain the information that certain data points are genuinely missing without introducing arbitrary values that could affect analysis.
After handling the missing values, it's important to verify that they have been appropriately addressed:
Python1# Verify missing values have been handled 2print("\nRemaining missing values:") 3print(df.isnull().sum())
The verification output should display zero missing values across all columns:
Plain text1Remaining missing values: 2survived 0 3pclass 0 4sex 0 5age 0 6sibsp 0 7parch 0 8fare 0 9embarked 0 10class 0 11who 0 12adult_male 0 13deck 0 14embark_town 0 15alive 0 16alone 0 17dtype: int64
This verification step confirms that all previously missing data in the dataset has been effectively managed. With no remaining missing values, the data is now in a suitable state for further feature engineering and analysis.
In this lesson, you have learned how to effectively detect, quantify, and address missing data within a dataset. Handling missing data is an essential skill in feature engineering, ensuring that your data is robust and ready for analysis. As you advance to the upcoming practice exercises, remember the techniques we've covered: identifying missing data, using numerical imputation with the median or mode, and strategically applying placeholders for categorical features. These exercises will give you the opportunity to apply the skills you've acquired and deepen your understanding of data cleaning. Keep practicing and refining your approach as you continue to enhance your feature engineering capabilities!