Handling Missing Values and Encoding in Data Preprocessing

Lesson Introduction

Welcome to this lesson on handling missing values and encoding categorical variables! In data science, preparing your data is crucial before analysis or building models. Think of it like a chef preparing ingredients before cooking. Similarly, we need to ensure our dataset is complete and in a format that models can understand. Today, we'll focus on handling missing values and encoding categorical variables. By the end, you'll be ready to clean and prepare your dataset for analysis.

Identifying Missing Values

First, we need to identify missing values. These can occur due to data entry errors or incomplete data collection. Identifying them is the first step in ensuring data integrity. Pandas provides a convenient way to check for missing values using the isnull() method, which returns a DataFrame indicating missing values.

Let's identify missing values in our dataset:

Output:

Note that other columns are omitted for brevity.

Handling Missing Numerical Values with Median

For certain numerical columns like age, income, and credit_score, filling missing values with the median is a common strategy. The median is often preferred over the mean in real-world datasets because it is less sensitive to outliers and skewed data. Using the median helps maintain the overall distribution of the data without being affected by extreme values, which can distort the mean.

Here's how to fill these specific numerical values with the median:

Handling Missing Numerical Values with Zero

For other numerical columns like num_purchases, time_on_site, and num_visits, filling missing values with zero is appropriate.

Here's how to fill these columns with zero:

This approach is useful only when zero is a meaningful value in the context of the data—for example, when it logically represents the absence of activity or measurement (such as zero purchases or zero visits). It's important to use zero only when it accurately reflects a possible real-world scenario, as using zero in other contexts could introduce bias or misrepresent the data.

Example of a bad usage:
Suppose you have a column income where missing values are filled with zero. In most real-world datasets, an income of zero is rare and usually not a valid value for most individuals. Filling missing income values with zero would incorrectly suggest that those individuals have no income, which can distort analysis and model predictions. In such cases, using the median or another appropriate statistic is preferred.

Handling Missing Categorical Values with "Unknown"

For categorical columns like gender, region, preferred_category, and referral_source, filling missing values with "Unknown" is a practical strategy. This explicitly introduces new category, which separates missing values from present values. Here's how to fill these specific categorical values with "Unknown":

Handling Other Missing Categorical Values with Mode

For other categorical columns not specified, filling missing values with the mode (most frequent value) is a common approach. This assumes the most common category is a reasonable substitute. Here's how to fill these remaining categorical values with the mode:

Encoding Categorical Variables

After handling missing values, we encode categorical variables. Machine learning models require numerical input, so we convert categorical data into a numerical format. One-hot encoding is a popular method, representing each category as a binary vector. Let's apply one-hot encoding:

Output:

We use pd.get_dummies() for one-hot encoding, transforming categorical variables into binary columns, making the data suitable for models.

Lesson Summary and Practice Introduction

Congratulations! You've learned to handle missing values using different strategies and encode categorical variables. These are essential skills in data preprocessing, ensuring your data is clean and ready for analysis or modeling. By filling missing values and encoding data, you're ready to build robust models.

Now, it's time to practice. In the upcoming session, you'll apply these techniques to a new dataset in the CodeSignal IDE. This hands-on experience will reinforce your understanding and prepare you for real-world data science challenges. Let's get started!

Previous Lesson

Next Lesson: Training a Baseline Model

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal