Lesson 3
Feature Ranking with Random Forests
Introduction to Feature Ranking with Random Forests

Welcome to our lesson on Feature Ranking with Random Forests. In previous lessons, we delved into data preparation and the selection of key features using statistical tests on the Titanic dataset. Having laid that groundwork, we will now transition from statistical methods to leveraging machine learning models, specifically Random Forests, to rank the features by importance.

Feature ranking is crucial for understanding which features most influence the predictions of a model, allowing more strategic decisions in model refinement. Random Forests, being an ensemble of decision trees, provide a robust mechanism for assessing feature importance. Let’s explore how this method can enhance our feature selection process.

Loading the Dataset and Preparing Features

Let's begin by reading the "titanic_updated.csv" file into a DataFrame using Pandas. As a recap, we define our feature set X by dropping the survived column and assign survived to y, which is our target variable.

Python
1import pandas as pd 2 3# Load the updated dataset 4df = pd.read_csv("titanic_updated.csv") 5 6# Prepare features 7X = df.drop(columns=['survived']) 8y = df['survived']
Understanding and Implementing Random Forests

Random Forest is a machine learning technique that builds multiple decision trees to make predictions. Think of it as a group of experts (trees) that work together to make decisions. Each tree provides a prediction, and the Random Forest takes a "vote" from all trees to determine the final prediction, making it more accurate and less prone to errors compared to a single decision tree.

To get started with Random Forests using Scikit-learn, we should first initialize the RandomForestClassifier. Here, n_estimators controls the number of decision trees in the forest, and random_state ensures the model gives consistent results every time you run it by controlling the randomness in the bootstrapping of samples and the feature selection process within each tree.

Python
1from sklearn.ensemble import RandomForestClassifier 2 3# Initialize the Random Forest model 4model = RandomForestClassifier(n_estimators=100, random_state=42)

After setting up the model with RandomForestClassifier, we train it with our data using the fit method, which involves feeding it our feature data X and target variable y.

Python
1# Train the Random Forest model on the dataset 2model.fit(X, y)

This process allows the model to learn patterns in the data, which it will later use to determine the importance of each feature, helping us understand which features are most significant in predicting our target variable.

Extracting Feature Importances

Once our Random Forest model is trained, we can determine how important each feature is to the model's predictions. This is achieved by extracting the feature importance scores, which tell us the influence each feature has on the final prediction.

To do this, we use the feature_importances_ attribute of our trained model. This gives us an array of scores corresponding to each feature in our dataset. We’ll organize these scores into a Pandas DataFrame for better visualization and analysis.

Python
1# Extract feature importance from the trained model 2feature_importance = pd.DataFrame({ 3 'feature': X.columns, 4 'importance': model.feature_importances_ 5})

The code above creates a DataFrame that pairs each feature with its corresponding importance score. This setup simplifies the subsequent analysis of which features contribute most to the model predictions, preparing us for the sorting and ranking process.

Analyzing and Interpreting Feature Importances

Now that we have a DataFrame containing features and their respective importance scores, the next step is to sort this DataFrame. Sorting helps in easily identifying the features that have the greatest impact. We will sort the features by their importance scores in descending order.

Python
1# Sort the features based on importance in descending order 2feature_importance_sorted = feature_importance.sort_values('importance', ascending=False) 3 4# Display the sorted feature importance ranking 5print("Feature Importance Ranking:") 6print(feature_importance_sorted)

What this does is arrange our features from the most important to the least important, making it straightforward to see which features the model relies on most for making predictions. For instance, if the feature age has a high importance score, it means changes in the age feature significantly affect the model’s prediction.

Understanding this order helps you make informed decisions about which features might be worth focusing on or preserving during model refinement and feature selection.

Plain text
1Feature Importance Ranking: 2 feature importance 311 alive 0.632187 49 adult_male 0.083743 51 sex 0.071968 68 who 0.054135 75 fare 0.040671 80 pclass 0.029113 92 age 0.029102 107 class 0.025879 113 sibsp 0.009835 1212 alone 0.006287 136 embarked 0.005996 1410 embark_town 0.005747 154 parch 0.005339

In the output above, features such as alive and adult_male are among the most significant, indicating they heavily influence model predictions. On the other hand, features like parch and embark_town are less impactful. This detailed ranking provides clear insights into which features have the strongest relationships with the survival outcome, guiding your focus for further analysis and refinement.

Conclusion and Preparation for Practice

In this lesson, we explored how Random Forests can serve as a powerful tool for ranking features based on their importance in model predictions. By identifying which features truly matter, you enhance your model's efficiency and interpretability. This process builds upon the statistical feature selection we've covered in previous lessons, providing a deeper level of analysis.

As you move forward to practice exercises, I encourage you to apply these concepts and experiment with different datasets. The hands-on experience will reinforce what you’ve learned and prepare you for more advanced feature engineering techniques in upcoming lessons. Keep exploring and refining your skills as you continue your journey in mastering feature selection and machine learning.

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.