Lesson 2
Identifying and Handling Missing Data
Handling Missing Data in Feature Engineering

Welcome back to the course on Foundations of Feature Engineering! In our previous lesson, we explored the fundamentals of feature engineering using the Titanic dataset. We examined the dataset's structure, looked at basic statistics, and understood how to use the pandas library to manipulate data effectively. Now that you have a clear understanding of these basics, we will delve deeper into a critical aspect of feature engineering: handling missing data.

Missing data can be a significant obstacle in data analysis and can drastically impact the performance of machine learning models. In today's lesson, you will learn how to detect, quantify, and handle missing data to enhance the quality and effectiveness of your datasets.

Detecting Missing Data

To begin with, it's essential to identify where missing data exists within a dataset. In pandas, this can be achieved using the isnull() function, which detects missing values, returning a DataFrame of the same shape with Boolean values indicating True for missing entries. Combined with the sum() function, it provides an efficient way to count missing values in each column.

Let's see this in action with the Titanic dataset:

Python
1import pandas as pd 2 3# Load the Titanic dataset 4df = pd.read_csv("titanic.csv") 5 6# Display missing values in each column 7print("Missing values in each column:") 8print(df.isnull().sum())

The output will show the number of missing values in each column. For example:

Plain text
1Missing values in each column: 2survived 0 3pclass 0 4sex 0 5age 177 6sibsp 0 7parch 0 8fare 0 9embarked 2 10class 0 11who 0 12adult_male 0 13deck 688 14embark_town 2 15alive 0 16alone 0 17dtype: int64

This output indicates that there are various missing values, particularly in the age, deck, embarked, and embark_town columns. Identifying these is the first step in managing them effectively.

Quantifying Missing Data

Understanding the extent of missing data helps in deciding how to handle it. By calculating the percentage of missing values for each column, you can assess the impact of missing information on your dataset.

Here's how to do it using the Titanic dataset:

Python
1# Display percentage of missing values 2print("\nPercentage of missing values:") 3print((df.isnull().sum() / len(df) * 100).round(2))

This will yield an output like:

Plain text
1Percentage of missing values: 2survived 0.00 3pclass 0.00 4sex 0.00 5age 19.87 6sibsp 0.00 7parch 0.00 8fare 0.00 9embarked 0.22 10class 0.00 11who 0.00 12adult_male 0.00 13deck 77.22 14embark_town 0.22 15alive 0.00 16alone 0.00 17dtype: float64

The output quantifies the missing data for each feature, revealing varying levels of missingness. The deck column has a significant 77.22% of its data missing, indicating a substantial lack of information. The age column shows a moderate level of incompleteness with 19.87% missing data. Meanwhile, both the embarked and embark_town columns have only 0.22% of their data missing, reflecting minor gaps that may have a negligible impact on the overall dataset. These insights are crucial for planning strategies to address the missing values effectively.

Imputation Techniques for Handling Missing Data

Imputation is a common technique for handling missing data by filling in the missing entries with estimated values. Depending on the nature of your data and the missingness mechanism, you may choose different imputation strategies. Common approaches include using the median for numerical data and the mode for categorical data.

Let's apply imputation to the Titanic dataset:

Python
1# Impute missing 'age' with the median age 2df['age'] = df['age'].fillna(df['age'].median()) 3 4# Impute missing 'embarked' with the most frequent value (mode) 5df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0]) 6 7# Impute missing 'embark_town' with the most frequent value (mode) 8df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])

In the code above, the fillna method is used to fill the missing age values with the median age of available entries, which helps maintain the distribution of ages. For the embarked and embark_town columns, categorical values, the imputation uses the mode() method to find the most frequently occurring category in each column. This approach helps ensure that the imputed values align naturally with the existing data while preserving important categorical distributions.

Using Placeholders for Missing Data

In some scenarios, particularly with categorical data, you might find it more appropriate to use placeholders to indicate missing values. This can preserve the uniqueness of the missing entries without skewing the feature distribution.

For the deck column in the Titanic dataset, we can use 'Unknown' as a placeholder:

Python
1# Fill missing 'deck' values with 'Unknown' as a placeholder 2df['deck'] = df['deck'].fillna('Unknown')

Setting 'Unknown' allows you to retain the information that certain data points are genuinely missing without introducing arbitrary values that could affect analysis.

Verification of Missing Data Handling

After handling the missing values, it's important to verify that they have been appropriately addressed:

Python
1# Verify missing values have been handled 2print("\nRemaining missing values:") 3print(df.isnull().sum())

The verification output should display zero missing values across all columns:

Plain text
1Remaining missing values: 2survived 0 3pclass 0 4sex 0 5age 0 6sibsp 0 7parch 0 8fare 0 9embarked 0 10class 0 11who 0 12adult_male 0 13deck 0 14embark_town 0 15alive 0 16alone 0 17dtype: int64

This verification step confirms that all previously missing data in the dataset has been effectively managed. With no remaining missing values, the data is now in a suitable state for further feature engineering and analysis.

Summary and Practice Preparation

In this lesson, you have learned how to effectively detect, quantify, and address missing data within a dataset. Handling missing data is an essential skill in feature engineering, ensuring that your data is robust and ready for analysis. As you advance to the upcoming practice exercises, remember the techniques we've covered: identifying missing data, using numerical imputation with the median or mode, and strategically applying placeholders for categorical features. These exercises will give you the opportunity to apply the skills you've acquired and deepen your understanding of data cleaning. Keep practicing and refining your approach as you continue to enhance your feature engineering capabilities!

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.