Welcome back! In the last lesson, you learned how to build a complete insurance cost prediction model using both numerical and categorical features. You also practiced encoding categorical variables and building a modeling pipeline. Now, you are ready to take the next step: preparing your data for real-world challenges.
In practice, data is rarely perfect. Customer databases often contain missing values, outliers, duplicates, and inconsistencies. If these issues are not addressed, your models may produce unreliable results or even fail to run. That is why data cleaning is a critical step in any data science project.
In this lesson, you will learn how to clean PredictHealth's customer database so it is ready for modeling. You will inspect the data for problems, handle missing values, remove duplicate records, detect and treat outliers, and normalize numerical features. By the end, you will have a clean dataset that is ready for building robust predictive models. This lesson will build directly on your previous work, but with a focus on making your data as reliable as possible.
Before you can clean your data, you need to know what problems exist. In real-world datasets, it is common to find missing values and outliers. Missing values are simply empty cells in your data, while outliers are values that are unusually high or low compared to the rest of the data.
Let's look at an example. Suppose you have a copy of the insurance data, but it has been made "messy" for demonstration purposes. This messy dataset has missing values added to all columns (about 6% missing rate) and some extreme outliers in the charges
and bmi
columns.
Here is how you can check for missing values in the dataset:
The output might look like this:
This tells you how many missing values are in each column. Even a few missing values can cause problems for your model, especially if they are in important columns. Notice that in this messy dataset, all columns have missing values, which is common in real-world scenarios.
Outliers are another issue. The messy dataset contains some extreme outliers: insurance charges ranging from 200,000, and BMI values between 50 and 70. These values are far outside the normal range and can distort your model's understanding of the data, leading to poor predictions.
Once you have identified missing values, you need to decide how to handle them. For numerical features like age
, bmi
, and children
, a common approach is to fill in missing values with the median of that column. The median is less affected by outliers than the mean, so it is a robust choice.
Here is how you can fill missing values for numerical features:
For categorical features such as sex
, smoker
, and region
, you can fill missing values with the most common value, also known as the mode. This ensures that the filled value is a valid category.
If the target variable (charges
) is missing, it is best to remove those rows entirely, since you cannot train or evaluate a model without a target value.
The output will be:
After completing these steps, your dataset will be free of missing values and ready for the next cleaning step.
Outliers are values that are much higher or lower than most of the data. They can have a big impact on your model, especially in regression tasks. One common way to detect outliers is the Interquartile Range (IQR) method. The IQR is the range between the 25th and 75th percentiles of the data. Any value outside 1.5 times the IQR from the lower or upper quartile is considered an outlier.
Here is a function that detects and caps outliers using the IQR method:
You can apply this function to numerical columns like age
, bmi
, and charges
:
The output will show how many outliers were found and capped in each column:
This step helps keep your data realistic and prevents extreme values from skewing your model.
Real-world categorical data often contains inconsistencies like different casing ("Male" vs "male") and extra whitespace. These can create artificial categories that hurt model performance.
Here's how to standardize categorical values:
The output will be:
This ensures all categorical values are lowercase and have no extra spaces, making them consistent for modeling.
After handling missing values, duplicates, and outliers, it is a good idea to normalize your numerical features. Normalization scales all values to a similar range, usually between 0 and 1. This is especially important when your features have very different scales, as it helps the model treat all features fairly.
You can use the MinMaxScaler
from scikit-learn to normalize your data:
After normalization, you can check the results:
The output will show that all normalized values are between 0 and 1:
Notice that we don't normalize the charges
column since it's our target variable, and we want to keep it in its original scale for interpretability.
You have now completed all the key steps in cleaning your dataset. You started by inspecting the data for missing values and outliers, then filled in or removed problematic values. You normalized the numerical features so they are on the same scale, ensuring your data is clean and consistent.
The cleaned dataset is now ready for preprocessing steps like categorical encoding (which you learned in the previous lesson) before modeling. You can check its final shape:
The output will show your cleaned dataset dimensions:
All data quality issues have been resolved and the data is ready for modeling.
In this lesson, you learned how to clean a real-world customer database for predictive modeling. You practiced identifying and handling missing values, detecting and capping outliers, standardizing categorical values, and normalizing numerical features. Each of these steps is essential for building reliable and accurate models.
You are now ready to apply these data cleaning techniques in hands-on exercises. As you practice, remember that clean data is the foundation of every successful data science project. Good luck, and I look forward to seeing your progress in the next section!
