Welcome to the lesson on Dimensionality Reduction with PCA. As you've progressed through the course, you've learned to clean and choose important features in data using various methods, such as statistical tests and models like Random Forests. In this lesson, we shift focus to Principal Component Analysis (PCA), a technique that helps shrink the size of your data while keeping most of its important information.
This method is particularly helpful when you have a dataset with lots of features, as it can reduce complexity and potentially improve the performance of your models. By identifying and keeping only the essential parts, PCA allows you to simplify your data efficiently for further analysis.
Principal Component Analysis (PCA) is a bit different from other feature selection tools. Instead of choosing or ranking existing features, PCA changes your data into a set of new variables called principal components. These components are uncorrelated with each other and each one captures a portion of the original data's variation.
Since variation in data reflects the information it holds, the principal components are ordered from most to least variation, meaning they show the most important perspectives of your original data. This transformation helps remove repetitive information and focuses on identifying patterns in large datasets, making PCA a valuable tool for reducing complexity.
Before applying PCA, we need to prepare our data by excluding the target variable. Unlike other methods, PCA doesn't require a target variable because it focuses solely on transforming the features to capture as much information as possible through the principal components. Let's revisit our Titanic dataset and prepare the feature set by excluding the 'survived'
column.
Python1import pandas as pd 2 3# Load the updated dataset 4df = pd.read_csv("titanic_updated.csv") 5 6# Prepare the feature set by excluding the target variable 'survived' 7X = df.drop(columns=['survived'])
With the feature set prepared, we can proceed to the next step: scaling.
It’s important to standardize your data before applying PCA to ensure that each feature contributes equally to the analysis. Without proper scaling, features with larger ranges could dominate the PCA. We'll use StandardScaler()
from sklearn
to transform these features into a standard scale with a mean of 0 and a variance of 1, ensuring an even contribution to PCA's calculations.
Python1from sklearn.preprocessing import StandardScaler 2 3# Initialize a StandardScaler to standardize the features 4scaler = StandardScaler() 5 6# Fit the scaler to the feature set and transform it 7X_scaled = scaler.fit_transform(X)
By scaling the features, we achieve balanced and unbiased variance measurement, which is crucial for accurate PCA results.
With our data standardized, we can now implement PCA to extract principal components from the normalized dataset.
Python1from sklearn.decomposition import PCA 2 3# Initialize PCA object for transformation 4pca = PCA() 5 6# Fit the PCA model to the scaled data and transform it to get principal components 7X_pca = pca.fit_transform(X_scaled)
In this code:
- We import the
PCA
class fromsklearn.decomposition
. - We initialize a PCA object
pca
. - Using the
fit_transform
method, we fit the PCA model to the standardized dataX_scaled
and transform it. This results inX_pca
, which contains the principal components of the original dataset.
To understand the importance of each principal component, we calculate and print the explained variance ratio.
Python1# Calculate explained variance ratio for each principal component 2explained_variance = pca.explained_variance_ratio_ 3 4# Print explained variance ratio for each component 5print("Explained variance ratio by component:") 6for i, var in enumerate(explained_variance): 7 print(f"PC{i+1}: {var:.3f}")
This code:
- Retrieves the
explained_variance_ratio_
attribute from the PCA model, which indicates the proportion of the dataset's total variance explained by each principal component. - Iterates over the
explained_variance
array, printing out each principal component number and its corresponding explained variance ratio, formatted to three decimal places.
The output shows how much variance is captured by each principal component:
Plain text1Explained variance ratio by component: 2PC1: 0.281 3PC2: 0.200 4PC3: 0.141 5PC4: 0.138 6PC5: 0.073 7PC6: 0.045 8PC7: 0.042 9PC8: 0.033 10PC9: 0.024 11PC10: 0.019 12PC11: 0.004 13PC12: 0.000 14PC13: 0.000
The explained variance ratio reveals how much of the total variance in the dataset is represented by each principal component:
- PC1 explains 28.1% of the variance.
- PC2 explains 20.0%.
- PC3 explains 14.1%.
- PC4 explains 13.8%.
Together, the first three components account for approximately 62.2% of the total variance (28.1% + 20.0% + 14.1%). Including the fourth component increases the cumulative variance explained to about 76%.
To decide how many principal components are sufficient, consider the trade-off between simplification and information loss. A common practice is to retain enough components to capture around 95% of the total variance. In this scenario, including the first three or four components may provide a good balance between reducing dimensionality and preserving essential information, as each subsequent component contributes progressively less to the overall variance.
Selecting the first three or four principal components could be sufficient for simplifying the dataset while maintaining most of its variability. This reduction in dimensions facilitates more efficient modeling and analysis without significantly sacrificing important data characteristics.
Once you've analyzed the variance explained by each principal component, you can decide on how many components to retain in order to capture enough of the dataset's variability. To achieve approximately 95% of the total variance explained, the first eight principal components suffice.
Python1# Define the number of principal components to retain 2num_components = 8 3 4# Select only the specified number of principal components from the transformed data 5selected_components = X_pca[:, :num_components] 6 7# Define column names for the new DataFrame that will hold the selected principal components 8pc_columns = [f"PC{i+1}" for i in range(num_components)] 9 10# Create a new DataFrame with the selected principal components and their respective column names 11df_selected_pcs = pd.DataFrame(selected_components, columns=pc_columns) 12 13# Display the first few rows of the new DataFrame to verify its contents 14print(df_selected_pcs.head())
In this implementation, we set num_components
to 8, aligning with the cumulative variance analysis to retain a substantial portion of the dataset's variability. By selecting the first eight principal components from X_pca
, we ensure that the reduced dataset maintains critical information while simplifying the complexity. Column names are designated accordingly for clarity. The resulting DataFrame, df_selected_pcs
, encapsulates these principal components, providing an efficient and manageable dataset for further analysis or modeling.
The output of the first few rows of this new DataFrame showcases the transformed data:
Plain text1 PC1 PC2 PC3 ... PC6 PC7 PC8 20 -1.639001 1.115591 0.297646 ... -0.587665 -0.323175 0.093521 31 4.161394 -1.290348 0.963495 ... -0.829857 -0.782164 -0.188088 42 0.513964 0.934232 -1.936783 ... -0.124144 0.225471 0.502783 53 3.017344 -0.119894 -1.853349 ... -0.847779 -0.768767 -0.207961 64 -2.229038 -0.261702 -0.442905 ... 0.104084 0.236256 0.238923
This output illustrates how the original dataset is represented in the reduced feature space of principal components, maintaining most of the data's variability while simplifying dimensionality.
After analyzing the explained variance ratio, you may decide to specify the number of principal components to retain directly during the PCA transformation. This approach allows you to control the dimensionality of your dataset based on the desired level of variance capture.
Python1from sklearn.decomposition import PCA 2 3# Set the number of principal components to retain 4n_components = 8 5 6# Initialize PCA object with the specified number of components 7pca_limited = PCA(n_components=n_components) 8 9# Fit the PCA model to the scaled data and transform it to get the specified number of principal components 10X_pca_limited = pca_limited.fit_transform(X_scaled)
In this code, we set n_components
to 8, indicating that we want to retain the first eight principal components. We initialize a new PCA object pca_limited
with the n_components
parameter. The fit_transform
method is used to transform the data, resulting in X_pca_limited
, which contains only the specified number of principal components.
To create a new DataFrame with these specified principal components, we can follow a similar process as before:
Python1# Define column names for the new DataFrame that will hold the specified principal components 2pc_columns_limited = [f"PC{i+1}" for i in range(n_components)] 3 4# Create a new DataFrame with the specified principal components and their respective column names 5df_pca_limited = pd.DataFrame(X_pca_limited, columns=pc_columns_limited)
This implementation defines column names for the new DataFrame based on the number of components retained. It creates df_pca_limited
, a DataFrame containing the specified principal components, and displays the first few rows of df_pca_limited
to verify the transformation and ensure the dataset is ready for further analysis or modeling.
Throughout this lesson, you've learned how to implement Principal Component Analysis to reduce data dimensionality efficiently. By standardizing your data, applying PCA, and interpreting explained variance, you simplified the Titanic dataset without sacrificing essential information. This step into dimensionality reduction refines your ability to handle complex datasets, an important skill as you further practice with exercises. Embrace this newfound expertise by experimenting with PCA on different datasets, which will solidify your understanding and prepare you for the remaining topics in the course.