Dimensionality Reduction with PCA

Welcome to the lesson on Dimensionality Reduction with PCA. As you've progressed through the course, you've learned to clean and choose important features in data using various methods, such as statistical tests and models like Random Forests. In this lesson, we shift focus to Principal Component Analysis (PCA), a technique that helps shrink the size of your data while keeping most of its important information.

This method is particularly helpful when you have a dataset with lots of features, as it can reduce complexity and potentially improve the performance of your models. By identifying and keeping only the essential parts, PCA allows you to simplify your data efficiently for further analysis.

Understanding Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a bit different from other feature selection tools. Instead of choosing or ranking existing features, PCA changes your data into a set of new variables called principal components. These components are uncorrelated with each other and each one captures a portion of the original data's variation.

Since variation in data reflects the information it holds, the principal components are ordered from most to least variation, meaning they show the most important perspectives of your original data. This transformation helps remove repetitive information and focuses on identifying patterns in large datasets, making PCA a valuable tool for reducing complexity.

Preparing the Feature Set for PCA

Before applying PCA, we need to prepare our data by excluding the target variable. Unlike other methods, PCA doesn't require a target variable because it focuses solely on transforming the features to capture as much information as possible through the principal components. Let's revisit our Titanic dataset and prepare the feature set by excluding the 'survived' column.

With the feature set prepared, we can proceed to the next step: scaling.

Scaling Features for PCA

It’s important to standardize your data before applying PCA to ensure that each feature contributes equally to the analysis. Without proper scaling, features with larger ranges could dominate the PCA. We'll use StandardScaler() from sklearn to transform these features into a standard scale with a mean of 0 and a variance of 1, ensuring an even contribution to PCA's calculations.

By scaling the features, we achieve balanced and unbiased variance measurement, which is crucial for accurate PCA results.

Implementing PCA on the Dataset

With our data standardized, we can now implement PCA to extract principal components from the normalized dataset.

In this code:

  • We import the PCA class from sklearn.decomposition.
  • We initialize a PCA object pca.
  • Using the fit_transform method, we fit the PCA model to the standardized data X_scaled and transform it. This results in X_pca, which contains the principal components of the original dataset.
Interpreting the Explained Variance Ratio

To understand the importance of each principal component, we calculate and print the explained variance ratio.

This code:

  • Retrieves the explained_variance_ratio_ attribute from the PCA model, which indicates the proportion of the dataset's total variance explained by each principal component.
  • Iterates over the explained_variance array, printing out each principal component number and its corresponding explained variance ratio, formatted to three decimal places.

The output shows how much variance is captured by each principal component:

The explained variance ratio reveals how much of the total variance in the dataset is represented by each principal component:

  • PC1 explains 28.1% of the variance.
  • PC2 explains 20.0%.
  • PC3 explains 14.1%.
  • PC4 explains 13.8%.

Together, the first three components account for approximately 62.2% of the total variance (28.1% + 20.0% + 14.1%). Including the fourth component increases the cumulative variance explained to about 76%.

To decide how many principal components are sufficient, consider the trade-off between simplification and information loss. A common practice is to retain enough components to capture around 95% of the total variance. In this scenario, including the first three or four components may provide a good balance between reducing dimensionality and preserving essential information, as each subsequent component contributes progressively less to the overall variance.

Selecting the first three or four principal components could be sufficient for simplifying the dataset while maintaining most of its variability. This reduction in dimensions facilitates more efficient modeling and analysis without significantly sacrificing important data characteristics.

Creating a New DataFrame with Principal Components

Once you've analyzed the variance explained by each principal component, you can decide on how many components to retain in order to capture enough of the dataset's variability. To achieve approximately 95% of the total variance explained, the first eight principal components suffice.

In this implementation, we set num_components to 8, aligning with the cumulative variance analysis to retain a substantial portion of the dataset's variability. By selecting the first eight principal components from X_pca, we ensure that the reduced dataset maintains critical information while simplifying the complexity. Column names are designated accordingly for clarity. The resulting DataFrame, df_selected_pcs, encapsulates these principal components, providing an efficient and manageable dataset for further analysis or modeling.

The output of the first few rows of this new DataFrame showcases the transformed data:

This output illustrates how the original dataset is represented in the reduced feature space of principal components, maintaining most of the data's variability while simplifying dimensionality.

Specifying the Number of Components During Transformation

After analyzing the explained variance ratio, you may decide to specify the number of principal components to retain directly during the PCA transformation. This approach allows you to control the dimensionality of your dataset based on the desired level of variance capture.

In this code, we set n_components to 8, indicating that we want to retain the first eight principal components. We initialize a new PCA object pca_limited with the n_components parameter. The fit_transform method is used to transform the data, resulting in X_pca_limited, which contains only the specified number of principal components.

To create a new DataFrame with these specified principal components, we can follow a similar process as before:

This implementation defines column names for the new DataFrame based on the number of components retained. It creates df_pca_limited, a DataFrame containing the specified principal components, and displays the first few rows of df_pca_limited to verify the transformation and ensure the dataset is ready for further analysis or modeling.

Conclusion and Summary

Throughout this lesson, you've learned how to implement Principal Component Analysis to reduce data dimensionality efficiently. By standardizing your data, applying PCA, and interpreting explained variance, you simplified the Titanic dataset without sacrificing essential information. This step into dimensionality reduction refines your ability to handle complex datasets, an important skill as you further practice with exercises. Embrace this newfound expertise by experimenting with PCA on different datasets, which will solidify your understanding and prepare you for the remaining topics in the course.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal