Welcome to our lesson on Feature Ranking with Random Forests. In previous lessons, we delved into data preparation and the selection of key features using statistical tests on the Titanic dataset. Having laid that groundwork, we will now transition from statistical methods to leveraging machine learning models, specifically Random Forests, to rank the features by importance.
Feature ranking is crucial for understanding which features most influence the predictions of a model, allowing more strategic decisions in model refinement. Random Forests, being an ensemble of decision trees, provide a robust mechanism for assessing feature importance. Let’s explore how this method can enhance our feature selection process.
Let's begin by reading the "titanic_updated.csv"
file into a DataFrame using Pandas. As a recap, we define our feature set X
by dropping the survived
column and assign survived
to y
, which is our target variable.
Python1import pandas as pd 2 3# Load the updated dataset 4df = pd.read_csv("titanic_updated.csv") 5 6# Prepare features 7X = df.drop(columns=['survived']) 8y = df['survived']
Random Forest is a machine learning technique that builds multiple decision trees to make predictions. Think of it as a group of experts (trees) that work together to make decisions. Each tree provides a prediction, and the Random Forest takes a "vote" from all trees to determine the final prediction, making it more accurate and less prone to errors compared to a single decision tree.
To get started with Random Forests using Scikit-learn, we should first initialize the RandomForestClassifier
. Here, n_estimators
controls the number of decision trees in the forest, and random_state
ensures the model gives consistent results every time you run it by controlling the randomness in the bootstrapping of samples and the feature selection process within each tree.
Python1from sklearn.ensemble import RandomForestClassifier 2 3# Initialize the Random Forest model 4model = RandomForestClassifier(n_estimators=100, random_state=42)
After setting up the model with RandomForestClassifier
, we train it with our data using the fit
method, which involves feeding it our feature data X
and target variable y
.
Python1# Train the Random Forest model on the dataset 2model.fit(X, y)
This process allows the model to learn patterns in the data, which it will later use to determine the importance of each feature, helping us understand which features are most significant in predicting our target variable.
Once our Random Forest model is trained, we can determine how important each feature is to the model's predictions. This is achieved by extracting the feature importance scores, which tell us the influence each feature has on the final prediction.
To do this, we use the feature_importances_
attribute of our trained model. This gives us an array of scores corresponding to each feature in our dataset. We’ll organize these scores into a Pandas DataFrame for better visualization and analysis.
Python1# Extract feature importance from the trained model 2feature_importance = pd.DataFrame({ 3 'feature': X.columns, 4 'importance': model.feature_importances_ 5})
The code above creates a DataFrame that pairs each feature with its corresponding importance score. This setup simplifies the subsequent analysis of which features contribute most to the model predictions, preparing us for the sorting and ranking process.
Now that we have a DataFrame containing features and their respective importance scores, the next step is to sort this DataFrame. Sorting helps in easily identifying the features that have the greatest impact. We will sort the features by their importance scores in descending order.
Python1# Sort the features based on importance in descending order 2feature_importance_sorted = feature_importance.sort_values('importance', ascending=False) 3 4# Display the sorted feature importance ranking 5print("Feature Importance Ranking:") 6print(feature_importance_sorted)
What this does is arrange our features from the most important to the least important, making it straightforward to see which features the model relies on most for making predictions. For instance, if the feature age
has a high importance score, it means changes in the age feature significantly affect the model’s prediction.
Understanding this order helps you make informed decisions about which features might be worth focusing on or preserving during model refinement and feature selection.
Plain text1Feature Importance Ranking: 2 feature importance 311 alive 0.632187 49 adult_male 0.083743 51 sex 0.071968 68 who 0.054135 75 fare 0.040671 80 pclass 0.029113 92 age 0.029102 107 class 0.025879 113 sibsp 0.009835 1212 alone 0.006287 136 embarked 0.005996 1410 embark_town 0.005747 154 parch 0.005339
In the output above, features such as alive
and adult_male
are among the most significant, indicating they heavily influence model predictions. On the other hand, features like parch
and embark_town
are less impactful. This detailed ranking provides clear insights into which features have the strongest relationships with the survival outcome, guiding your focus for further analysis and refinement.
In this lesson, we explored how Random Forests can serve as a powerful tool for ranking features based on their importance in model predictions. By identifying which features truly matter, you enhance your model's efficiency and interpretability. This process builds upon the statistical feature selection we've covered in previous lessons, providing a deeper level of analysis.
As you move forward to practice exercises, I encourage you to apply these concepts and experiment with different datasets. The hands-on experience will reinforce what you’ve learned and prepare you for more advanced feature engineering techniques in upcoming lessons. Keep exploring and refining your skills as you continue your journey in mastering feature selection and machine learning.