Building and Evaluating Baseline Regression Models

Introduction to Baseline Models

Now that your data is clean and properly formatted, it's time to build your first machine learning models. In this lesson, we'll focus on creating baseline models —simple models that serve as a point of comparison for more complex models you might build later. We'll implement two different types of baseline models: Linear Regression and LightGBM . By comparing these two approaches, you'll see how different algorithms handle the same data and which might be more suitable for your specific problem. Let's begin by preparing our data for modeling!

What is a Baseline Model?

A baseline model is a simple model that helps you understand the minimum level of performance you should expect. It provides a benchmark against which to measure improvements as you try more advanced models.

Model Evaluation Metric: RMSE

\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} ( \hat{y}_i - y_i )^2 }

Preparing Preprocessed Data for Modeling

Building a Linear Regression Baseline

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n + \epsilon

Building a LightGBM Baseline

Comparing Model Performance

To make it easier to compare our models, let's create a DataFrame that shows their RMSE values side by side (on the test set): Python# Create a comparison table of model performance model_comparison = pd.DataFrame({ 'RMSE (Test)': [lr_rmse, lgb_rmse] }, index=['Linear Regression', 'LightGBM']) print("\nModel Performance Comparison (Test Set):") print(model_comparison)# Create a comparison table of model performance model_comparison = pd.DataFrame({ 'RMSE (Test)': [lr_rmse, lgb_rmse] }, index=['Linear Regression', 'LightGBM']) print("\nModel Performance Comparison (Test Set):") print(model_comparison) This code creates a DataFrame with model names as the index and RMSE values as a column. When you run it, you might see output similar to: Model Performance Comparison (Test Set): RMSE (Test) Linear Regression 13.871715 LightGBM 15.576367Model Performance Comparison (Test Set): RMSE (Test) Linear Regression 13.871715 LightGBM 15.576367 This comparison shows that, for this particular example, Linear Regression performs better than LightGBM on the test set because it achieves the lower RMSE. That does not mean Linear Regression is always the better choice. On a different dataset, or after tuning, LightGBM may outperform it. The key lesson is to compare models using the same evaluation metric on the same test data rather than assuming the more complex model will always win. Beyond just comparing RMSE values, it's also valuable to understand which features are driving our predictions. LightGBM provides a feature_importances_ attribute that tells us how much each feature contributes to the model's predictions: Python# Extract feature importances from LightGBM feature_importance = pd.DataFrame({ 'Feature': X.columns, 'Importance': lgb_model.feature_importances_ }) # Sort by importance (descending) and show top 5 feature_importance = feature_importance.sort_values('Importance', ascending=False) print("\nTop 5 Most Important Features:") print(feature_importance.head(5))# Extract feature importances from LightGBM feature_importance = pd.DataFrame({ 'Feature': X.columns, 'Importance': lgb_model.feature_importances_ }) # Sort by importance (descending) and show top 5 feature_importance = feature_importance.sort_values('Importance', ascending=False) print("\nTop 5 Most Important Features:") print(feature_importance.head(5)) This code: Creates a DataFrame with feature names and their importance scores Sorts the DataFrame by importance in descending order Displays the top 5 most important features When you run this code, you might see output similar to: Top 5 Most Important Features: Feature Importance 2 Episode_Length_minutes 571 4 Host_Popularity_percentage 492 1 Episode_Title 458 7 Guest_Popularity_percentage 429 0 Podcast_Name 347Top 5 Most Important Features: Feature Importance 2 Episode_Length_minutes 571 4 Host_Popularity_percentage 492 1 Episode_Title 458 7 Guest_Popularity_percentage 429 0 Podcast_Name 347 This output tells us which features have the most influence on our LightGBM model's predictions. In this example, Episode_Length_minutes is the most important feature, followed by Host_Popularity_percentage. This information is valuable for understanding your data and potentially focusing on the most important features in future modeling efforts. Feature importance can also guide feature engineering efforts. If you notice that certain features are particularly important, you might want to create new features based on them or explore them in more detail.

Summary

In this lesson, you learned how to build and evaluate baseline regression models: Prepared preprocessed data by separating features and target variables. Built and evaluated a Linear Regression model using RMSE on the test set. Built and evaluated a LightGBM model using the same metric on the test set. Compared both models using test-set RMSE and saw that the better baseline depends on the data rather than the model's complexity alone. Used LightGBM’s feature importance to identify the most influential features, such as Episode_Length_minutes and Host_Popularity_percentage . In the upcoming practice exercises, you’ll apply these steps to new datasets: building baseline models, comparing their RMSE on the test set, and analyzing feature importance. This will reinforce your understanding of baseline modeling and prepare you for more advanced techniques.

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal

Python

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from scripts.data_preprocess import preprocess

# Read the full dataset
df = pd.read_csv('data/data.csv')

# Split into train and test sets
train, test = train_test_split(df, test_size=0.2, random_state=42)

# Identify numerical and categorical features
numerical_features = train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = train.select_dtypes(include=['object', 'category']).columns.tolist()

# Remove target column from features if present
target_col = 'Listening_Time_minutes'
if target_col in numerical_features:
    numerical_features.remove(target_col)
if target_col in categorical_features:
    categorical_features.remove(target_col)

# Preprocess the data
train_processed, test_processed = preprocess(train, test, numerical_features, categorical_features)

Python

# Prepare features and target
X = train_processed.drop(['id', 'Listening_Time_minutes'], axis=1)
y = train_processed['Listening_Time_minutes']
X_test = test_processed.drop(['id', 'Listening_Time_minutes'], axis=1)
y_test = test_processed['Listening_Time_minutes']

Python

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Train Linear Regression model
lr_model = LinearRegression().fit(X, y)

# Make predictions on test data
lr_predictions = lr_model.predict(X_test)

# Calculate RMSE on test data
lr_rmse = np.sqrt(mean_squared_error(y_test, lr_predictions))
print("Linear Regression RMSE (Test):", lr_rmse)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Train Linear Regression model
lr_model = LinearRegression().fit(X, y)

# Make predictions on test data
lr_predictions = lr_model.predict(X_test)

# Calculate RMSE on test data
lr_rmse = np.sqrt(mean_squared_error(y_test, lr_predictions))
print("Linear Regression RMSE (Test):", lr_rmse)

Linear Regression RMSE (Test): 13.87171483258841

Linear Regression RMSE (Test): 13.87171483258841

Python

import lightgbm as lgb

# Train LightGBM model
lgb_model = lgb.LGBMRegressor().fit(X, y)

# Make predictions on test data
lgb_predictions = lgb_model.predict(X_test)

# Calculate RMSE on test data
lgb_rmse = np.sqrt(mean_squared_error(y_test, lgb_predictions))
print("LightGBM RMSE (Test):", lgb_rmse)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000103 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 940
[LightGBM] [Info] Number of data points in the train set: 877, number of used features: 10
[LightGBM] [Info] Start training from score 45.056804
LightGBM RMSE (Test): 15.576366542503475

Python

# Create a comparison table of model performance
model_comparison = pd.DataFrame({
    'RMSE (Test)': [lr_rmse, lgb_rmse]
}, index=['Linear Regression', 'LightGBM'])
print("\nModel Performance Comparison (Test Set):")
print(model_comparison)

Model Performance Comparison (Test Set):
                   RMSE (Test)
Linear Regression    13.871715
LightGBM             15.576367

Python

# Extract feature importances from LightGBM
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': lgb_model.feature_importances_
})

# Sort by importance (descending) and show top 5
feature_importance = feature_importance.sort_values('Importance', ascending=False)
print("\nTop 5 Most Important Features:")
print(feature_importance.head(5))

Top 5 Most Important Features:
                       Feature  Importance
2       Episode_Length_minutes         571
4   Host_Popularity_percentage         492
1                Episode_Title         458
7  Guest_Popularity_percentage         429
0                 Podcast_Name         347

Python

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Train Linear Regression model
lr_model = LinearRegression().fit(X, y)

# Make predictions on test data
lr_predictions = lr_model.predict(X_test)

# Calculate RMSE on test data
lr_rmse = np.sqrt(mean_squared_error(y_test, lr_predictions))
print("Linear Regression RMSE (Test):", lr_rmse)

Linear Regression RMSE (Test): 13.87171483258841