Building a Decision Tree from Scratch in Python

Introduction

Greetings! In this session on Decision Trees, we aim to implement a full Decision Tree from scratch in Python. Decision Trees are a type of Supervised Machine Learning in which data is continuously split according to certain parameters.

Refreshing the Structure of Decision Tree

A Decision Tree has a tree-like structure with each internal node denoting a test on an attribute, each branch representing an outcome of the test, and each terminal node (leaf) holding a class label. Here are the parts of a Decision Tree:

Root Node: This houses the entire dataset.
Internal Nodes: These make decisions based on conditions.
Edges/branches: These connections implement decision rules.
Leaves: These are terminal nodes for making predictions.

Decisions on attributes depend on how well they help to purify the data.

The tree-building process begins with the full dataset at the root node, iteratively partitioning the data based on chosen attributes. Each child node becomes a new root that can be split further. This recursive process continues until predefined stopping criteria are met.

Stopping Criteria for Tree Building

Standard stopping criteria include:

Maximum Tree Depth: Limiting the maximum depth of the tree.
Minimum Node Records: No more partitioning if less than a threshold number of records.
Node Purity: Stop if all instances at a node belong to the same class.

These criteria ensure the model is consistent, which prevents overfitting.

Implementing Decision Tree Building in Python

Now, we'll use Python to build the decision tree. We'll rely on the existing function from the previous lesson to find the optimal split for our data.

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal