Welcome to this crucial lesson! Today, we're exploring the dimensions of data preprocessing you need to master before setting up any Machine Learning model, including neural networks. The preprocessing phase is especially critical when dealing with image data. So, let's delve into why it is so.
Many Machine Learning models require the data to be in a format where each row represents a sample and each column represents a feature. However, in our scenario, our image data presents as a 2D array (think of it like a grid of pixel values). Therefore, we need to convert it into a 1D array. This preprocessing step is widely known as flattening.
In Python, this conversion can be accomplished using the reshape
operation in numpy. Here's how it's achieved:
Here, n_samples
denotes the number of samples in our dataset, and -1
implies the length in that dimension is inferred. This bit of code instructs numpy to calculate the dimension's size that will maintain the total number of elements the same as in the original array.
After reshaping our data into a more compatible format, the succeeding step in data preprocessing is to segment our data into training and test sets. This process is vital because it enables us to assess our model's performance on unseen data, thereby helping prevent overfitting.
Scikit-learn affords an efficient method to carry out this operation using the train_test_split()
function. This function randomly divides our data into training and testing sets. The usage is as shown below:
