Greetings! In this session on Decision Trees, we aim to implement a full Decision Tree from scratch in C++. Decision Trees are a type of Supervised Machine Learning in which data is continuously split according to certain parameters.
A Decision Tree has a tree-like structure with each internal node denoting a test on an attribute, each branch representing an outcome of the test, and each terminal node (leaf) holding a class label. Here are the parts of a Decision Tree:
- Root Node: This houses the entire dataset.
- Internal Nodes: These make decisions based on conditions.
- Edges/branches: These connections implement decision rules.
- Leaves: These are terminal nodes for making predictions.
Decisions on attributes depend on how well they help to purify the data.
The tree-building process begins with the full dataset at the root node, iteratively partitioning the data based on chosen attributes. Each child node becomes a new root that can be split further. This recursive process continues until predefined stopping criteria are met.
Standard stopping criteria include:
- Maximum Tree Depth: Limiting the maximum depth of the tree.
- Minimum Node Records: No more partitioning if less than a threshold number of records.
- Node Purity: Stop if all instances at a node belong to the same class.
These criteria ensure the model is consistent, which prevents overfitting.
Now, we'll use C++ to build the decision tree. We'll rely on the existing get_split function from the previous lesson to find the optimal split for our data.
Here is how a terminal node is created:
The create_terminal function determines the most common class value in a group of rows and assigns that value as the final decision for that subset of data.
Let's proceed to the actual tree building:
This function begins the tree-building process.
The recurse_split function is responsible for creating children nodes:
We can then build and print the decision tree based on the dataset and chosen parameters:
The print_tree output shows a decision tree. [X1 < 10.000] checks if feature X1 is less than 10.000. Left branch ([X1 < 5.000]) and its subsequent nodes ([0] and [0]) indicate the conditions and predictions if 'Yes'. The right branch ([X2 < 8.000]) and its nodes ([1] and [0]) cover the cases for 'No'. Indentations imply tree depth.
Congratulations! You've now built a decision tree from scratch in C++. Proceed to the practice to reinforce your understanding. Keep progressing and keep implementing! Happy tree building!
