Welcome back! In the previous lessons, you learned how to explore the California housing dataset, engineer new features, handle outliers, and split your data into training and testing sets. These are essential steps that set the foundation for building reliable machine learning models.
In this lesson, we will take the next big step: training your first machine learning model using the preprocessed data. By the end of this lesson, you will know how to build, evaluate, and save a simple regression model using Scikit-Learn. This is a key milestone, as it marks your transition from preparing data to actually making predictions. The skills you learn here will be useful in many real-world projects and will prepare you for more advanced modeling techniques in the future.
Before we start building a model, let's quickly review the data you have prepared so far. In the last lesson, you created new features such as RoomsPerHousehold
, capped extreme values to handle outliers, and split your data into training and testing sets. You then saved these processed datasets as CSV files, which makes it easy to load them for modeling.
The training data contains 16,512 samples, and the test data contains 4,128 samples. Each sample includes the following features:
MedInc
HouseAge
AveRooms
AveBedrms
Population
AveOccup
Latitude
Longitude
RoomsPerHousehold
The target variable we want to predict is MedHouseVal
, which represents the median house value in each district. By keeping your features and target variable organized, you are now ready to train a model that can learn from this data.
The first step in training our model is to load the preprocessed training data. On CodeSignal, all necessary libraries are pre-installed, so you can focus on the code itself. Here is how we can load the data and prepare it for training:
In this code, we use pandas
to load the training data from a CSV file. We then separate the features (X_train
) from the target variable (y_train
). The features are all the columns except MedHouseVal
, which is our target variable that we want to predict.
When we run this code, we will see output confirming the number of training samples and the features being used:
This confirms that our data is loaded correctly and ready for training.
Now that our data is loaded, we need to create a machine learning model. In this course, we will use linear regression because its simplicity makes our life easier, especially when we later work with tools like SageMaker, while still allowing us to learn the core concepts of model training and prediction.
Linear regression is one of the most fundamental algorithms for predicting a continuous target variable like house value, and the concepts you learn here—such as fitting a model and interpreting coefficients—will apply to more complex models and platforms as well.
The Scikit-Learn library makes it easy for us to work with a linear regression model. Here is how we create a linear regression model using Scikit-Learn:
At this point, we have created an instance of the LinearRegression
class, but the model hasn't learned anything yet. The actual learning happens in the next step when we fit the model to our training data.
Now comes the crucial step: fitting (or training) our model on the training data. This is where the model actually learns the relationship between the features and the target variable. During fitting, the linear regression algorithm finds the best coefficients (weights) for each feature that minimize the prediction error.
Here is how we fit our model to the training data:
The fit
method takes two arguments: the features (X_train
) and the target values (y_train
). During this process, the algorithm analyzes the training data to learn patterns and relationships. Once fitting is complete, our model is ready to make predictions on new data.
After training our model, it is important to evaluate how well it fits the training data. This helps us understand whether our model is learning useful patterns or if it might be underfitting or overfitting.
Two common metrics for regression problems are Mean Squared Error (MSE) and R² (R-squared). MSE
measures the average squared difference between the predicted and actual values. A lower MSE
means better performance. R²
measures how much of the variation in the target variable is explained by the model. An R²
value closer to 1 means the model explains more of the variance.
Here is how we can calculate these metrics using Scikit-Learn
:
When we run this code, we might see output like:
This tells us that the average squared error of our model's predictions is 0.44, and about 67.07% of the variance in house values is explained by our model. These numbers give us a baseline for how well our model is performing on the training data. In future lessons, we will learn how to evaluate our model on the test set to see how well it generalizes to new data.
Once we have trained and evaluated our model, it is good practice to save it for future use. This allows us to reuse the trained model without having to retrain it every time, which can save time and ensure consistency in our results.
We can use the joblib
library to save our model to a file. Here is how we can do it:
This code saves our trained model to a file called trained_model.joblib
. Later, we can load this file and use the model to make predictions on new data without retraining. Saving models is especially important in real-world projects, where we may want to deploy our model to a web application or share it with others.
In this lesson, we learned how to train a machine learning model using the preprocessed California housing data. We loaded the training data, created a simple linear regression model, fitted it to our data, evaluated its performance using MSE
and R²
, and saved the trained model for future use. These are essential skills for any machine learning practitioner and form the basis for more advanced modeling techniques.
You are now ready to move on to practice exercises, where you will apply what you have learned and gain hands-on experience. Keep up the great work! Each step you take brings you closer to building more accurate and powerful models.
