Introduction to Baseline Models

Now that your data is clean and properly formatted, it's time to build your first machine learning models. In this lesson, we'll focus on creating baseline models—simple models that serve as a point of comparison for more complex models you might build later.

We'll implement two different types of baseline models: Linear Regression and LightGBM. By comparing these two approaches, you'll see how different algorithms handle the same data and which might be more suitable for your specific problem.

Let's begin by preparing our data for modeling!

What is a Baseline Model?

A baseline model is a simple model that helps you understand the minimum level of performance you should expect. It provides a benchmark against which to measure improvements as you try more advanced models.

Model Evaluation Metric: RMSE

To evaluate our models, we'll use the Root Mean Squared Error (RMSE), a common metric for regression problems. RMSE measures the average magnitude of the errors in our predictions, with higher weight given to larger errors. Mathematically, it's defined as:

RMSE=1n∑i=1n(y^i−yi)2\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} ( \hat{y}_i - y_i )^2 }

Preparing Preprocessed Data for Modeling

In the previous lesson, we cleaned our data by handling missing values and encoding categorical features. Now, we need to organize this preprocessed data into the format required for training machine learning models.

For supervised learning tasks like regression, we need to separate our data into:

  • Features (X): The input variables our model will use to make predictions
  • Target (y): The variable we're trying to predict

Let's start by loading our preprocessed data:

Notice that we're using the preprocess function from our scripts module, which encapsulates all the preprocessing steps we learned in the previous lesson. This is good practice, as it keeps our code organized and reusable.

Now, let's prepare our features and target variables:

In this code, we:

  1. Create our feature matrix X by dropping the id column (which isn't useful for prediction) and the target column
Building a Linear Regression Baseline

Linear Regression is one of the simplest and most interpretable machine learning algorithms. It models the relationship between features and the target variable as a linear equation:

y=β0+β1x1+β2x2+⋯+βnxn+ϵy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon

Building a LightGBM Baseline

LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework that uses tree-based learning algorithms. Unlike Linear Regression, which models relationships as linear equations, tree-based models can capture non-linear patterns and interactions between features.

Gradient boosting works by building an ensemble of decision trees sequentially, with each tree correcting the errors made by the previous ones. This approach often results in more accurate predictions, especially for complex datasets.

Let's implement a LightGBM model and evaluate it on the test set:

This code follows a similar pattern to our Linear Regression implementation:

  1. Imports the lightgbm library
  2. Creates a LightGBM regressor and fits it to our training data
  3. Makes predictions on the test data (X_test)
  4. Calculates the RMSE on the test set (y_test)

When you run this code, you might see output similar to:

Notice that LightGBM provides some additional information about its training process. The most important part is the RMSE value, which in this example is actually higher than what we achieved with Linear Regression. That tells us something important: a more complex model does not automatically perform better. Model performance is data-dependent, and sometimes a simpler model can be the stronger baseline.

Comparing Model Performance

To make it easier to compare our models, let's create a DataFrame that shows their RMSE values side by side (on the test set):

This code creates a DataFrame with model names as the index and RMSE values as a column. When you run it, you might see output similar to:

This comparison shows that, for this particular example, Linear Regression performs better than LightGBM on the test set because it achieves the lower RMSE. That does not mean Linear Regression is always the better choice. On a different dataset, or after tuning, LightGBM may outperform it. The key lesson is to compare models using the same evaluation metric on the same test data rather than assuming the more complex model will always win.

Beyond just comparing RMSE values, it's also valuable to understand which features are driving our predictions. LightGBM provides a feature_importances_ attribute that tells us how much each feature contributes to the model's predictions:

This code:

  1. Creates a DataFrame with feature names and their importance scores
  2. Sorts the DataFrame by importance in descending order
  3. Displays the top 5 most important features

When you run this code, you might see output similar to:

Summary

In this lesson, you learned how to build and evaluate baseline regression models:

  1. Prepared preprocessed data by separating features and target variables.
  2. Built and evaluated a Linear Regression model using RMSE on the test set.
  3. Built and evaluated a LightGBM model using the same metric on the test set.
  4. Compared both models using test-set RMSE and saw that the better baseline depends on the data rather than the model's complexity alone.
  5. Used LightGBM’s feature importance to identify the most influential features, such as Episode_Length_minutes and Host_Popularity_percentage.

In the upcoming practice exercises, you’ll apply these steps to new datasets: building baseline models, comparing their RMSE on the test set, and analyzing feature importance. This will reinforce your understanding of baseline modeling and prepare you for more advanced techniques.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal