Welcome to our exploration of Implementing Bagging. This lesson expands upon your machine learning toolkit by introducing you to the bagging technique and illustrating its use with decision trees. You will also gain hands-on experience with these concepts through a Python implementation. So, let's embark on our bagging adventure!
Bagging, or bootstrap aggregating, is a technique in ensemble learning that aims to reduce the variance of the machine learning model. The essence of bagging involves generating multiple subsets from the original dataset and then using these subsets to train separate models. Note that the subsets are chosen with replacement, so it is possible to have duplicate data points in a single subset. The final prediction is then made by aggregating the predictions from these individual models. Essentially, it is a voting for the best answer: the final class prediction is the class that was predicted by the majority of votes.
We will use decision trees as our base models. Capable of supporting both categorical and continuous input variables, decision trees follow sequential, hierarchical decision rules to output a final decision.
Our Python implementation calls upon several libraries, such as numpy for advanced mathematical computations on multi-dimensional arrays, sklearn for providing machine learning and statistical modeling tools, and scipy for statistical functions.
First, we load our dataset. For this lesson, we use the widely recognized iris dataset, which we split into training and test data. The iris dataset is popular in data science and machine learning. It contains measurements of 150 iris flowers from three different species - setosa, versicolor and virginica. The measurements include the lengths and the widths of the sepals and petals of the flowers.
The variable n_models
, set here as 100, determines the number of decision tree classifiers we plan to build.
Next, we define our helper functions, bootstrapping
and predict
, which are pivotal to constructing our bagging
model.
Subsequently, we iteratively train our decision tree models, make predictions, and calculate the model's accuracy using sklearn's accuracy_score()
function.
The bootstrapping
function generates bootstrapped datasets, choosing random subsets from the data.
The predict
function consolidates predictions from various trained models to deliver the final decision. We use mode (the most frequent prediction) as the final answer.
We can freely use another model instead of decision trees; they are chosen as an example.
After implementing a model, we must evaluate its performance. In our bagging model, the accuracy score serves as a performance metric: the ratio of correct predictions to the total number of predictions. We utilize sklearn's accuracy_score()
function to calculate this metric and gauge the performance of our model.
Congratulations! You've successfully navigated the basics of bagging with decision trees. You've learned about the fundamentals of bagging, implemented a bagging algorithm using decision trees in Python, and assessed the model's accuracy. Your understanding of these concepts will be further solidified through exercises in the next section. Have fun practicing!
