Welcome back to our lesson on Feature Selection with Statistical Tests. In the previous lesson, you diligently prepared the Titanic dataset, addressed missing values, encoded categorical variables, and saved the refined dataset. Now, we progress from preparation to the critical task of feature selection, an important step in enhancing model performance. This lesson will introduce you to using statistical tests as a method for selecting features. These tests help us spotlight the most impactful features, streamlining our dataset. Let’s dive into how statistical tests, specifically the chi-square test, can assist us in this objective.
In this section, we introduce the chi-square test, a fundamental statistical tool used to evaluate the importance of relationships between categorical features and the target variable. The chi-square test helps us understand whether the variations we observe are due to chance or actual relationships between variables. In simpler terms, it checks how certain features affect the outcome or target variable.
To implement this, we use a function called SelectKBest
from the sklearn
library. This tool leverages the chi-square test to automatically pick the "K" most important features for us. The "K" here is a number you decide, representing how many top features you want to select. This is particularly useful in datasets with categorical variables, such as the Titanic dataset, where both features and the target variable (whether someone survived) are categorical.
With our prepared Titanic dataset, we are ready to implement feature selection using the chi-square test. Let's begin by loading the dataset using Pandas
and separating it into features (X
) and the target variable (y
). The chi-square test will help us identify which features have the strongest relationships with the target variable, survived
.
Python1import pandas as pd 2from sklearn.feature_selection import SelectKBest, chi2 3 4# Load the updated dataset 5df = pd.read_csv("titanic_updated.csv") 6 7# Drop 'survived' to create feature set X 8X = df.drop(columns=['survived']) 9 10# Assign 'survived' to target variable y 11y = df['survived'] 12 13# Initialize SelectKBest to select top 5 features using chi-square 14selector = SelectKBest(score_func=chi2, k=5) 15 16# Fit and transform X to select top features 17X_selected = selector.fit_transform(X, y)
In this code, we load the prepared "titanic_updated.csv"
file into a DataFrame. We then separate our dataset into features (X
) by dropping the survived
column, which serves as our target variable (y
). We utilize SelectKBest
with chi2
to select the top 5 features. The fit_transform
method performs the selection based on the chi-square scores.
To understand the chi-square test results, we need to examine the feature scores it generates. This step involves printing out the scores for each feature:
Python1# Display features scores 2print("Feature scores:") 3for feature, score in zip(X.columns, selector.scores_): 4 print(f"{feature}: {score:.2f}")
In this code, we use a zip
function to pair each feature name from X.columns
with its corresponding score from selector.scores_
, a NumPy array storing the chi-square scores computed during feature selection. We format and print the feature names alongside their scores to provide a clear view of their significance.
Feature scores:
Plain text1pclass: 30.87 2sex: 92.70 3age: 21.65 4sibsp: 2.58 5parch: 10.10 6fare: 4518.32 7embarked: 10.20 8class: 54.47 9who: 27.54 10adult_male: 109.86 11embark_town: 10.20 12alive: 549.00 13alone: 14.64
These scores indicate the statistical significance of each feature's relationship with the target variable. Higher scores suggest stronger associations, making those features more impactful in predicting the target variable.
Once the scores have been interpreted, the next task is to isolate the top features based on these scores. We use the get_support()
method from the SelectKBest
instance to identify which features have been selected. This method returns a boolean array indicating whether each feature is included in the final set.
Python1# Get selected feature names 2selected_features = X.columns[selector.get_support()].tolist()
In this line, selector.get_support()
yields an array of True
and False
values, where True
corresponds to features that have been selected. We use these boolean values to filter X.columns
, generating a list of selected feature names using .tolist()
to convert this from a Pandas Index object to a standard Python list.
After retrieving the selected feature names, we display them to understand which variables will be retained for model training and predictions:
Python1# Display top 5 selected features 2print("\nSelected top 5 features:", selected_features)
Selected top 5 features:
Plain text1['sex', 'fare', 'class', 'adult_male', 'alive']
These features, being the most statistically significant, will form the streamlined dataset, enhancing the predictive model’s performance and efficiency.
Finally, by creating a DataFrame with the selected features, we can view a sample of the data, allowing us to visualize how our feature selection has streamlined the dataset.
Python1# Create DataFrame with the selected top 5 features 2X_selected_df = pd.DataFrame(X_selected, columns=selected_features) 3 4# Display sample of selected features 5print(X_selected_df.head())
Sample of selected features:
Plain text1 sex fare class adult_male alive 20 1.0 7.2500 2.0 1.0 0.0 31 0.0 71.2833 0.0 0.0 1.0 42 0.0 7.9250 2.0 0.0 1.0 53 0.0 53.1000 0.0 0.0 1.0 64 1.0 8.0500 2.0 1.0 0.0
This output provides a quick view of how the reduced dataset looks, with only the most significant features retained for further analysis or modeling.
In this lesson, you gained skills in selecting features using the chi-square test—enabling you to identify and retain the most impactful features in a dataset. By applying statistical tests, you’ve moved a step closer to crafting efficient predictive models. As you transition to the exercises, focus on practicing these techniques using different datasets and evaluate how your choice of features impacts the predictive performance. This hands-on experience will reinforce the lessons learned and prepare you for mastering the art of feature selection, driving your future data projects toward success.