Let's dive into Principal Component Analysis (PCA), a technique often used in machine learning to simplify complex data while keeping important details. PCA transforms datasets with lots of closely connected parts into datasets with parts that do not directly relate to each other. Think of it like organizing a messy room and putting everything in clear, separate bins.
We can start using the PCA by creating our own little dataset. For this lesson, we'll make a 3D (three-dimensional) dataset of 200 points:
Before PCA, we need to bring all features of our dataset to a common standard to avoid bias. This just means making sure every feature's average value is 0, and the spread of their values is the same:
The above code calculates the dataset's average (np.mean
) and spread (np.std
) and then adjusts each point accordingly.
