Hello, and welcome to this lesson on univariate statistical tests for feature selection in machine learning. The efficient handling of dataset features can drastically impact your machine learning model's performance. We strive to improve our model's accuracy, reduce overfitting, and lessen the training time by intelligently choosing the most relevant features. In the journey to achieve this, we will encounter a dominant concept known as univariate selection for feature selection. We will apply SelectKBest
to select the most informative features from our dataset. By the end of this session, you will grasp how to use univariate feature selection in Python and appreciate its strengths and limitations.
Univariate statistical tests examine each feature independently to determine the strength of the relationship between the feature and the response variable. These tests are simple to run and understand and often provide good intuition about your features. The scikit-learn library provides the SelectKBest
class, which uses a set of statistical tests to select a specific number of features.
The SelectKBest
class simply retains the first 'k' features of X with the highest scores. In this lesson, we'll use the chi-squared statistical test for non-negative features to select ‘k’ best features. The chi-square test is used to determine whether there's a significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.
We'll use the Iris dataset from Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, as the dataset for this tutorial. The Iris dataset is one of the datasets scikit-learn comes with that do not require the downloading of any file from some external website. It's a beginner-friendly dataset that contains measurements for 150 iris flowers from three different species.
