Building Your First Insurance Cost Prediction Model with Simple Linear Regression

Introduction & Lesson Overview

Welcome back! In the last lesson, you learned how to identify which factors most strongly influence insurance costs in the PredictHealth dataset. You explored correlations, visualized relationships, and compared group averages to spot the most important predictors of insurance charges. Now, you are ready to take the next step: building your very first cost prediction model.

In this lesson, you will learn how to use simple linear regression to predict insurance charges based on a single feature — age. By the end of this lesson, you will know how to prepare your data for modeling, train a regression model, evaluate its performance, visualize the results, and use the model to make predictions for new customers. This is a key skill in data science and will help you understand how predictive models work in real-world insurance scenarios.

Understanding Simple Linear Regression

Simple linear regression is a basic but powerful tool in data science. It helps you predict a numeric outcome (like insurance charges) using just one input feature (like age). The model fits a straight line to the data, which can be described by the equation:

Here, y is the predicted value (insurance charges), x is the input feature (age), the slope shows how much y changes for each unit increase in x, and the intercept is the value of y when x is zero.

In the context of PredictHealth, you will use age as the predictor. This means you are asking, "If I know a customer's age, how much can I expect their insurance cost to be?" The regression model will find the best-fitting line through the data so you can make these predictions.

Data Preparation For Regression

Before building a prediction model, it is important to select the right variables. In previous lessons, you discovered that age is one of the features most strongly related to insurance charges. For this first regression model, you will use age as the input (feature) and charges as the output (target) you want to predict.

To prepare the data, you need to create two variables: X for the feature and y for the target. In this case, X will be a DataFrame containing the age column, and y will be a Series containing the charges column. Here is how you can do this:

This code selects the age column as the feature and the charges column as the target. Remember, you are using double square brackets for X to keep it as a DataFrame, which is required by scikit-learn. This step sets up your data for the regression model.

Splitting The Data & Model Training

To build a reliable model, you need to test how well it works on new, unseen data. This is why you split your data into two parts: a training set and a testing set. The training set is used to teach the model, and the testing set is used to check how well the model predicts new data.

You can use the train_test_split function from scikit-learn to do this. In the example below, 80% of the data is used for training and 20% for testing. Setting random_state=42 ensures you get the same split every time you run the code, which is helpful for reproducibility.

Now that you have your training and testing sets, you can create and train the linear regression model. The LinearRegression class from scikit-learn makes this easy. You fit the model using the training data:

When you call fit, the model finds the best slope and intercept to minimize the difference between the actual and predicted charges in the training set. This process is called training the model. It is important because it allows the model to "learn" the relationship between age and insurance charges.

Evaluating The Model Performance

After training the model, it is important to check how well it predicts insurance charges on the test data. You can do this by making predictions and then comparing them to the actual values.

First, use the model to predict charges for the test set:

You can use several metrics to measure performance:

Mean Squared Error (MSE): The average squared difference between actual and predicted values. $\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

Visualizing The Regression Line

Visualizing your model’s predictions helps you understand how well it fits the data. A common way to do this is to plot the actual charges versus age as a scatter plot, and then add the regression line showing the predicted charges.

Here is how you can create this plot:

In this plot, each blue dot represents an actual customer from the test set, showing their age and insurance charges. The red line shows the model’s predicted charges for each age. If the line fits the points well, it means the model is making good predictions. In this case, you may notice that the points are quite spread out, which matches the low R² value you saw earlier. This tells you that while age is related to charges, there are other important factors not included in this simple model.

Making Predictions For New Data

One of the most useful things about a regression model is that you can use it to predict insurance costs for new customers. For example, suppose you want to estimate the insurance cost for a 45-year-old customer. You can use the trained model to make this prediction:

The output might look like this:

This means that, based on your model, a 45-year-old customer would be expected to pay about $14,762.29 in insurance charges. This is a simple but powerful way to use data to make informed decisions in real-world scenarios.

Summary & Preparation For Practice Exercises

In this lesson, you learned how to build your first cost prediction model using simple linear regression. You prepared the data by selecting age as the feature and charges as the target, split the data into training and testing sets, and trained a linear regression model. You evaluated the model’s performance using MSE, RMSE, and R², and learned how to interpret the model’s coefficients and equation. You also visualized the regression line and made predictions for new customers.

These are essential skills for any data scientist or analyst working with predictive models. In the next set of practice exercises, you will get hands-on experience with these steps. Try preparing the data, training the model, evaluating its performance, and making predictions on your own. Each time you practice, you will become more confident in building and applying regression models to real-world data.

Congratulations on reaching this important milestone in your learning journey! Keep up the great work, and get ready to put your new skills into practice.

Previous Lesson

Next Lesson: Evaluating PredictHealth’s Prediction Accuracy: Comparing Regression Models and Metrics

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal