Building Your First Tree

Introduction

Welcome to the first lesson of our Foundations of Gradient Boosting course! This is the beginning of an exciting journey that will take you from the basics of tree-based models all the way to mastering advanced gradient boosting techniques. Whether you're looking to enhance your machine learning skills or dive into one of the most powerful prediction methods available today, you're in the right place!

Before we begin, let's ensure you have the necessary foundation. We expect you to be familiar with basic Python programming, pandas for data manipulation, and fundamental machine learning concepts like training and testing sets. If you're comfortable with these topics, you're ready to explore the world of gradient boosting!

Our learning path consists of three comprehensive courses designed to build your expertise step by step:

Foundations of Gradient Boosting: We'll start by loading real-world data and building baseline decision trees, then progress through Random Forests, boosting concepts, and complete gradient boosting implementation with feature importance analysis.
XGBoost for Beginners: You'll master the popular XGBoost library, learning to build models, control complexity, implement early stopping, and optimize parameters through automated tuning.
LightGBM Made Simple: We'll explore Microsoft's efficient LightGBM library, focusing on its speed advantages, native categorical feature support, unique leaf-wise growth strategy, and parameter optimization.

By the end of this learning path, you'll confidently build, tune, and interpret gradient boosting models using multiple industry-standard libraries, equipped with the skills to tackle real-world classification problems effectively. Today's lesson focuses on building your first tree, where we'll establish our foundation by working with actual banking data and creating our first decision tree classifier.

Understanding Decision Trees

Decision trees are among the most intuitive machine learning algorithms, mimicking the way humans naturally make decisions. Imagine you're deciding whether to go outside: you might first check if it's raining, then consider the temperature, and finally decide based on these factors. A decision tree works similarly, asking a series of yes/no questions about data features to reach a prediction.

The beauty of decision trees lies in their interpretability. Each internal node represents a question about a feature, each branch represents an answer, and each leaf represents a final prediction. For classification tasks, the tree learns to split the data in ways that best separate different classes, using measures like the Gini index or entropy to determine the most informative questions to ask.

In our journey toward gradient boosting, decision trees serve as the fundamental building blocks. While a single tree might not be the most accurate predictor, understanding how individual trees work is crucial because gradient boosting combines many trees to create powerful ensemble models. Let's start by building our first tree and establishing a baseline performance that we'll improve upon in subsequent lessons.

Loading Real-World Data

To make our learning experience practical and relevant, we'll work with the Bank Marketing dataset from the UCI Machine Learning Repository. This dataset contains information about direct marketing campaigns of a Portuguese bank, where the goal is to predict whether clients will subscribe to a term deposit.

The ucimlrepo library provides a convenient way to access datasets directly from the UCI repository. If you were working in your own environment, you could install it with pip install ucimlrepo, but in our CodeSignal environment, all necessary libraries are already pre-configured for you. We specify the dataset ID (222 for Bank Marketing) and fetch the complete dataset. The fetch_ucirepo function returns an object containing features and targets separately, so we use pd.concat() with axis=1 to combine them into a single DataFrame for easier manipulation.

Exploring Our Dataset

Before building any model, it's essential to understand our data thoroughly. Let's examine the structure and characteristics of our dataset to make informed decisions about preprocessing and feature selection.

The info() method provides crucial insights about our dataset structure, including column names, data types, and missing values. This exploration reveals that our dataset contains 45,211 records with 17 columns, including both categorical and numeric features. Some columns have missing values, which we'll need to consider in our preprocessing steps.

Here's what we discover about our data:

Notice that several columns are marked as object dtype, which indicates categorical variables. For example, marital contains values like 'married', 'single', 'divorced', and housing contains 'yes' or 'no' values. Also worth noting is that some features like contact and poutcome have significant missing values, which we'll need to handle carefully.

Preparing Our Data with Mixed Feature Types

Real-world datasets typically contain both numeric and categorical features, and learning to handle both is essential for building robust models. Let's carefully select and preprocess our features to create a comprehensive yet manageable dataset.

We carefully select three numeric features that represent meaningful customer attributes: client age, account balance, and number of campaign contacts. Note that we deliberately exclude duration (call duration) because it represents data leakage — the length of a marketing call is not known before the call is made and is largely determined by the outcome we're trying to predict.

For categorical features, we select four variables with no missing values: marital status, credit default history, housing loan status, and personal loan status. These represent important customer characteristics that are available before any marketing contact.

Simple Categorical Encoding

Machine learning algorithms require numeric input, so we need to convert our categorical features to numeric format. Let's use a simple but effective approach:

The LabelEncoder converts categorical values to integers (e.g., 'married' → 0, 'single' → 1, 'divorced' → 2). While this approach assumes ordinal relationships that may not exist, it's simple and often works well with tree-based models, which can handle the resulting splits effectively. We apply the encoder to each categorical column and then combine the encoded categorical features with our numeric features into a single feature matrix.

Target Variable Preparation

Now let's prepare our target variable for the classification task:

The target variable originally contains string values 'yes' and 'no', so we convert them to integers (1 and 0) using the map() function. This binary encoding is the standard format expected by scikit-learn classifiers.

Splitting Our Data

Before training our model, we need to divide our data into training and testing sets. This separation is crucial for obtaining an unbiased evaluation of our model's performance on unseen data.

The train_test_split function randomly divides our data into 80% for training and 20% for testing. We set random_state=42 to ensure reproducible results, meaning we'll get the same split every time we run the code. This reproducibility is essential for comparing different models and approaches consistently.

Building Our First Decision Tree

Now comes the exciting part: creating our first decision tree classifier! We'll use scikit-learn's DecisionTreeClassifier, which implements the popular CART (Classification and Regression Trees) algorithm.

The DecisionTreeClassifier with default parameters will automatically determine the best way to split our data at each node. It examines all possible splits for each feature and chooses the one that provides the best separation between classes according to the Gini impurity criterion. The fit() method performs the actual training, building the tree structure by recursively splitting the data until stopping criteria are met.

Understanding Classification Performance

After training our model, we need to assess how well it performs on data it hasn't seen before. However, simply looking at accuracy can be misleading, especially with imbalanced datasets like ours, so let's use a comprehensive evaluation approach that reveals the full picture of our model's performance:

The classification_report function provides several key metrics for each class:

Precision: Of all instances predicted as positive, how many were actually positive? High precision means few false positives.
Recall: Of all actual positive instances, how many did we correctly identify? High recall means few false negatives.
F1-score: The harmonic mean of precision and recall, providing a balanced measure of performance.

Here's what our report reveals:

This report tells a much more nuanced story than accuracy alone! While our overall accuracy is 82%, the class-wise breakdown reveals important insights:

Class 0 (no subscription): The model performs well with 90% precision and 90% recall, correctly identifying most customers who won't subscribe.
Class 1 (subscription): The model struggles significantly with only 26% precision and 26% recall, missing many customers who would actually subscribe.

The class imbalance is evident from the support column: we have 7,952 negative examples but only 1,091 positive examples. This imbalance explains why the model achieves high overall accuracy by being very good at predicting the majority class while struggling with the minority class.

Why This Matters for Real-World Applications

Understanding these metrics is crucial for real-world applications. In a banking context, missing potential subscribers (low recall for class 1) means lost revenue opportunities, while incorrectly targeting non-subscribers (low precision for class 1) means wasted marketing resources. This analysis helps us understand that while our model achieves respectable overall accuracy, it has significant room for improvement in identifying the minority class.

This comprehensive evaluation approach sets the foundation for understanding model performance throughout this course. As we progress to ensemble methods and gradient boosting, we'll continue using these detailed metrics to gain deeper insights into how different techniques address the challenges of imbalanced classification problems.

Conclusion and Next Steps

Congratulations on building your first decision tree classifier! We've successfully loaded real-world banking data, explored its structure, performed preprocessing for both numeric and categorical features, and created a model that achieves 82% overall accuracy. More importantly, we've learned to look beyond simple accuracy to understand the nuanced performance characteristics revealed by precision, recall, and F1-scores.

The comprehensive evaluation we've conducted demonstrates the essential workflow of responsible machine learning: data loading, exploration, preprocessing, model training, and thorough evaluation. We've also learned to handle mixed data types and avoid data leakage by carefully selecting features. The detailed classification report has revealed both the strengths and limitations of our single decision tree, particularly its struggle with the minority class in this imbalanced dataset.

While our single decision tree provides a solid foundation, you'll soon discover how ensemble methods like Random Forests and gradient boosting can significantly improve upon these results, especially for challenging class imbalance scenarios. Get ready to apply what you've learned through hands-on practice exercises that will deepen your understanding and help you experiment with different aspects of decision tree modeling!

Next Lesson: Ensemble Learning Fundamentals

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal