Incorporating Categorical Variables into Insurance Cost Models

Introduction and Lesson Overview

Welcome back to PredictHealth's Multi-Factor Cost Models course. In our previous lesson, you successfully built your first multiple regression model using numerical variables — age and BMI — to predict insurance costs. You learned how to prepare data, train models, and interpret coefficients in a business context. That model achieved solid performance, but we left some important information on the table.

In real-world insurance pricing, some of the most significant cost drivers aren't numerical at all — they're categorical. Think about it: whether someone smokes or not can dramatically impact their health risks and insurance costs. Similarly, gender and geographic region can influence pricing due to different risk profiles and healthcare costs across areas.

In this lesson, you will learn how to incorporate these categorical variables — smoking status, gender, and region — into your regression models. This means moving from a simple two-feature model to a comprehensive model that captures the full picture of what drives insurance costs. You'll discover how to prepare categorical data for machine learning, build sophisticated preprocessing pipelines, and interpret the results to make better business decisions.

By the end of this lesson, you'll have a much more powerful and realistic insurance cost prediction model that PredictHealth could actually use in practice. Let's dive into the world of categorical variables and see how they transform our modeling capabilities.

Understanding Categorical Variables in Insurance Data

As a reminder, our insurance dataset contains several types of variables. In the previous lesson, we focused on the numerical ones: age and BMI. Now, let's examine the categorical features that we haven't used yet: sex, smoker, and region.

These categorical variables represent different categories or groups rather than measurable quantities. The sex column contains values like "male" and "female," the smoker column has "yes" and "no," and the region column includes areas like "northeast," "southeast," "southwest," and "northwest."

Why do these categories matter so much in insurance pricing? Each category represents a different risk profile. For example, statistical data show that smokers typically have higher healthcare costs due to smoking-related health issues. Geographic regions might have different healthcare costs, lifestyle factors, or even natural disaster risks that affect insurance claims. Gender can also correlate with different health patterns and life expectancy rates.

The challenge is that machine learning algorithms, including linear regression, work with numbers, not text categories. When we fed age and BMI into our previous model, those were already numbers that the algorithm could use directly. But how do we handle "male" versus "female" or "yes" versus "no" for smoking status? This is where categorical data preprocessing becomes essential.

Unlike numerical features, where the values have a natural mathematical relationship (age 30 is greater than age 25), categorical variables don't have this inherent ordering. We need a different approach to convert these categories into a format that our regression model can understand and use effectively.

Preparing Categorical Data with One-Hot Encoding

The solution to working with categorical data is called one-hot encoding. This technique converts each category into its own binary column, where 1 means the category applies and 0 means it doesn't.

Let's see how this works with a simple example. If we have a smoker column with values "yes" and "no," one-hot encoding creates two new columns: smoker_yes and smoker_no. For a person who smokes, smoker_yes would be 1 and smoker_no would be 0. For a non-smoker, it's the opposite.

However, there's an important detail here: we typically "drop the first" category to avoid redundancy. If we know that smoker_yes is 0, we automatically know that smoker_no must be 1. Keeping both columns would give our model redundant information and could cause mathematical problems. So we usually keep just smoker_yes — when it's 1, the person smokes; when it's 0, they don't.

Here's how we set up the preprocessing for our mixed data types using scikit-learn's ColumnTransformer:

The ColumnTransformer allows us to apply different preprocessing steps to different types of columns. For numerical features, we use , which means they go through unchanged. For categorical features, we apply to convert them into the binary format our model needs.

Building Comprehensive Modeling Pipelines

Now that we have our preprocessing set up, we need to combine it with our regression model. This is where scikit-learn's Pipeline becomes incredibly useful. A pipeline chains together preprocessing steps and the final model into one cohesive workflow.

Here's how we create our complete modeling pipeline:

The beauty of using a pipeline is that it handles all the preprocessing automatically. When we call fit() on the pipeline, it first applies the preprocessing to transform our mixed data types, then trains the linear regression model on the processed data. When we make predictions, it applies the same preprocessing to new data before generating predictions.

This approach ensures consistency between training and prediction, eliminates the risk of data leakage (where information from the test set accidentally influences preprocessing), and makes our code much cleaner and more maintainable. In a real business environment like PredictHealth, this reproducible workflow is essential for reliable model deployment.

Training and Evaluating the Enhanced Model

Let's train our comprehensive model and see how it performs compared to our previous numerical-only model:

Compare this to our previous model that used only age and BMI, which had an R² of about 0.15. The improvement to 0.75 means our model now explains 75% of the variation in insurance costs, compared to 15% before. This might seem like a small improvement, but in the insurance industry, even small improvements in prediction accuracy can translate to significant business value.

The lower RMSE also indicates that our predictions are more accurate on average. This enhanced accuracy comes from capturing the important categorical factors that influence insurance costs — factors that our numerical-only model completely missed.

Once you've trained and validated your model, it's crucial to save it for future use. The joblib.dump() function saves the entire pipeline — including both the preprocessing steps and the trained regression model — into a single file. This means PredictHealth can load this saved model later to make predictions on new customers without having to retrain from scratch. To load the model later, simply use loaded_model = joblib.load('insurance_cost_model.pkl').

Interpreting Categorical Variable Coefficients

Understanding what our model learned requires us to examine the coefficients, but with categorical variables, this becomes more complex. After one-hot encoding, we have more features than we started with, and we need to extract their names properly.

The categorical coefficients tell us fascinating stories about insurance costs. The smoker_yes coefficient of 23,615.96 means that being a smoker increases insurance costs by approximately $23,616 compared to being a non-smoker, all else being equal. This massive impact demonstrates why smoking status is such a critical factor in insurance pricing.

The sex_male coefficient of 131.31 suggests that being male increases costs by about $131 compared to being female. The regional coefficients are all negative, which means that compared to the baseline region (northeast, which was dropped), all other regions have lower insurance costs.

Making Predictions with Customer Categories

Let's see how our comprehensive model works with a real customer example:

This prediction demonstrates the power of our comprehensive model. The high predicted cost of nearly $40,000 is driven primarily by the smoking status, which adds over$ 23,000 to the base cost. The model automatically handles all the categorical encoding behind the scenes — we simply provide the customer information in its natural format, and the pipeline takes care of the rest.

Summary and Practice Preparation

In this lesson, you've made a significant leap forward in your modeling capabilities. You started with a simple numerical model and transformed it into a comprehensive insurance pricing model that incorporates the most important categorical factors: smoking status, gender, and region.

You learned how to handle the technical challenges of categorical data through one-hot encoding, how to build robust preprocessing pipelines that handle mixed data types, and how to interpret the business meaning of categorical coefficients. Most importantly, you saw how adding these categorical variables improved your model's predictive power and business relevance.

The techniques you've mastered — ColumnTransformer, Pipeline, and OneHotEncoder — are fundamental tools in the data scientist's toolkit. These same approaches work whether you're predicting insurance costs, house prices, or any other business outcome that involves both numerical and categorical factors.

You're now ready to practice these skills with hands-on exercises, where you'll apply categorical variable modeling to different scenarios and datasets. In the upcoming practice problems, you'll get to experiment with different categorical encodings, compare model performance, and make business recommendations based on your enhanced models. Great work on completing this important lesson in your journey toward mastering multi-factor cost models!

Previous Lesson

Next Lesson: Cleaning PredictHealth's Customer Database

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal