This lesson will focus on tackling redundant or correlated features in a dataset. These features provide similar, overlapping information that can potentially affect the performance of machine learning models. Learning to handle such features is critical to data cleaning and preprocessing.
Why is handling redundant or correlated features necessary, you ask? Here's the reason: Machine learning models are grounded in mathematics, and we need to ensure that the input data doesn't contain multicollinearity, meaning predictors are not independent, as it may cause issues with mathematical calculations. By identifying and eliminating redundant or correlated features, we can ensure that each feature in our dataset offers unique and valuable information that improves the predictive model's performance.
In statistics, correlation is a term that indicates the degree to which two variables move in relation to each other. If two features are highly correlated, they carry similar information.
In the context of our Titanic
dataset, let's consider the pclass
(passenger class) and fare
(ticket cost) columns. Intuitively, passengers belonging to higher class (1st) would have paid a higher fare. Therefore, these two columns are likely to be strongly correlated.
To quantify this relationship, we use the correlation coefficient, a value between -1 and 1. If the correlation coefficient is close to 1, it indicates a strong positive correlation. Conversely, a coefficient near -1 indicates a strong negative correlation. A coefficient close to zero suggests no correlation.
To calculate the correlation between features in our dataset, we use the corr()
function from the Pandas library. Let's see how it's done:
Here, titanic_df.corr(numeric_only=True)
returns a DataFrame with the correlation coefficients between all pairs of numeric columns in titanic_df
.
This correlation matrix displays the relationship between each pair of numerical columns. For instance, the correlation between fare
and pclass
is -0.549500
. This negative sign indicates a negative correlation, meaning that passenger class decreases as the fare increases, which is consistent with our initial assumption.
A correlation matrix can be difficult to read and understand, especially when we have many features. To ease this process, we can visualize the matrix using a heatmap with the help of the seaborn
library.
The heatmap()
function in Seaborn provides a graphical representation of the correlation matrix where colors represent values:
In the heatmap, a dark color represents a high negative correlation, and a light color represents a high positive correlation.
For example, fare
and pclass
are displayed in a dark color, meaning they have a high negative correlation close to -0.55
. This observation from the heatmap aligns with our initial assumption: as the passenger class decreases (3rd class to 1st class), the ticket fare increases.
When two features are highly correlated, they carry similar information; hence, one can be removed without losing important information.
Here's how to drop a column in a Pandas DataFrame:
In this case, we are dropping the fare
column. The axis=1
parameter indicates that we want to drop a column (for dropping a row, we would have used axis=0
). The resulting clean_df
DataFrame contains all the original columns except fare
.
We removed the fare
column because it's highly correlated with pclass
, and it's not uncommon for ticket prices to depend on the passenger class.
Congratulations! You've learned how to handle redundant or correlated features in a dataset—a crucial step in preparing your data for machine learning models. You've tackled correlation, heatmaps, and the practical aspect of dropping a highly correlated feature.
This skill helps streamline the features used for modeling, which can have a huge impact on the final model's performance and interpretability.
Now, it's time to solidify your understanding through hands-on practice! The following exercises will provide real-world challenges to test and augment your newly gained knowledge. Let's dive deep into data cleaning with Python!
