Welcome to our exploration of Implementing Bagging. This lesson expands upon your machine learning toolkit by introducing you to the bagging technique and illustrating its use with decision trees. You will also gain hands-on experience with these concepts through a Python implementation. So, let's embark on our bagging adventure!
Bagging, or bootstrap aggregating, is a technique in ensemble learning that aims to reduce the variance of the machine learning model. The essence of bagging involves generating multiple subsets from the original dataset and then using these subsets to train separate models. Note that the subsets are chosen with replacement, so it is possible to have duplicate data points in a single subset. The final prediction is then made by aggregating the predictions from these individual models. Essentially, it is a voting for the best answer: the final class prediction is the class that was predicted by the majority of votes.
We will use decision trees as our base models. Capable of supporting both categorical and continuous input variables, decision trees follow sequential, hierarchical decision rules to output a final decision.
Our Python implementation calls upon several libraries, such as numpy for advanced mathematical computations on multi-dimensional arrays, sklearn for providing machine learning and statistical modeling tools, and scipy for statistical functions.
First, we load our dataset. For this lesson, we use the widely recognized iris dataset, which we split into training and test data. The iris dataset is popular in data science and machine learning. It contains measurements of 150 iris flowers from three different species - setosa, versicolor and virginica. The measurements include the lengths and the widths of the sepals and petals of the flowers.
The variable n_models
, set here as 100, determines the number of decision tree classifiers we plan to build.
