Building a Decision Tree from Scratch in C++

Introduction

Greetings! In this session on Decision Trees, we aim to implement a full Decision Tree from scratch in C++. Decision Trees are a type of Supervised Machine Learning in which data is continuously split according to certain parameters.

Refreshing the Structure of Decision Tree

A Decision Tree has a tree-like structure with each internal node denoting a test on an attribute, each branch representing an outcome of the test, and each terminal node (leaf) holding a class label. Here are the parts of a Decision Tree:

Root Node: This houses the entire dataset.
Internal Nodes: These make decisions based on conditions.
Edges/branches: These connections implement decision rules.
Leaves: These are terminal nodes for making predictions.

Decisions on attributes depend on how well they help to purify the data.

The tree-building process begins with the full dataset at the root node, iteratively partitioning the data based on chosen attributes. Each child node becomes a new root that can be split further. This recursive process continues until predefined stopping criteria are met.

Stopping Criteria for Tree Building

Standard stopping criteria include:

Maximum Tree Depth: Limiting the maximum depth of the tree.
Minimum Node Records: No more partitioning if less than a threshold number of records.
Node Purity: Stop if all instances at a node belong to the same class.

These criteria ensure the model is consistent, which prevents overfitting.

Implementing Decision Tree Building in C++

Now, we'll use C++ to build the decision tree. We'll rely on the existing get_split function from the previous lesson to find the optimal split for our data.

Here is how a terminal node is created:

The create_terminal function determines the most common class value in a group of rows and assigns that value as the final decision for that subset of data.

Let's proceed to the actual tree building:

This function begins the tree-building process.

Implementing the Recursive Split

The recurse_split function is responsible for creating children nodes:

Building and Printing a Tree

We can then build and print the decision tree based on the dataset and chosen parameters:

The print_tree output shows a decision tree. [X1 < 10.000] checks if feature X1 is less than 10.000. Left branch ([X1 < 5.000]) and its subsequent nodes ([0] and [0]) indicate the conditions and predictions if 'Yes'. The right branch ([X2 < 8.000]) and its nodes ([1] and [0]) cover the cases for 'No'. Indentations imply tree depth.

Lesson Summary

Congratulations! You've now built a decision tree from scratch in C++. Proceed to the practice to reinforce your understanding. Keep progressing and keep implementing! Happy tree building!

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal