Loading...

The Importance of Data Preprocessing Dimensions

Welcome to this crucial lesson! Today, we're exploring the dimensions of data preprocessing you need to master before setting up any Machine Learning model, including neural networks. The preprocessing phase is especially critical when dealing with image data. So, let's delve into why it is so.

Many Machine Learning models require the data to be in a format where each row represents a sample and each column represents a feature. However, in our scenario, our image data presents as a 2D array (think of it like a grid of pixel values). Therefore, we need to convert it into a 1D array. This preprocessing step is widely known as flattening.

In Python, this conversion can be accomplished using the reshape operation in numpy. Here's how it's achieved:

Here, n_samples denotes the number of samples in our dataset, and -1 implies the length in that dimension is inferred. This bit of code instructs numpy to calculate the dimension's size that will maintain the total number of elements the same as in the original array.

Data Splitting for Machine Learning

After reshaping our data into a more compatible format, the succeeding step in data preprocessing is to segment our data into training and test sets. This process is vital because it enables us to assess our model's performance on unseen data, thereby helping prevent overfitting.

Scikit-learn affords an efficient method to carry out this operation using the train_test_split() function. This function randomly divides our data into training and testing sets. The usage is as shown below:

This function segregates our dataset into training and testing sets according to the proportions specified by test_size. Its shuffle parameter decides whether the data should be shuffled before splitting. In this case, we're opting not to shuffle the data.

Data Standardization and Transformation

Once we've reshaped and segmented our data, the following imperative step is to standardize our data for Neural Networks. Standardization assures all features in our dataset operate on the same scale. This ensures that each feature is treated equally by the model, subsequently enhancing the model's performance.

The StandardScaler from the sklearn library enables us to standardize our data efficiently:

This sequence first calculates the mean and standard deviation of the training data. Then it subtracts the mean from each feature and divides it by its corresponding standard deviation, effectively scaling the data in the process.

Dataset Characteristics

After concluding the preprocessing steps of reshaping, splitting, and standardization, it is beneficial to observe our dataset's nature and characteristics in detail. Python offers numerous options to explore these qualities, even with simple print statements.

Consider the following command, which prints the shape of our dataset:

output:

The output from these commands provides insight into the number of samples and features present in our training and testing datasets.

Lesson Summary and Practice

Congratulations on reaching the end of this lesson on Data Preprocessing and Transformation! You've acquired essential skills in reshaping image data, splitting it into training and testing sets, standardizing the data, and combining all these steps.

In the imminent practice exercises, you will gain hands-on experience with these concepts, undeniably augmenting your prowess in these techniques and deepening your understanding.

We are almost ready to start learning about neural networks! For now, let's practice some of what we've learned above.

Previous Lesson

Next Lesson: Building Neural Networks with TensorFlow: A Beginner's Guide

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal