Introduction

In today's lesson, our focus is on preprocessing the Iris dataset for TensorFlow. We will explore various techniques, such as data splitting, feature scaling, and one-hot encoding. This foundation is invaluable in the field of machine learning as it aids in understanding the intricacies of data transformation before we feed it to a neural network. Let's get into it!

Overview of the Iris Dataset

Before we delve into data preprocessing, it is imperative to understand the data we are processing. The Iris dataset comprises measurements from 150 Iris flowers coming from three different species. Each sample includes the following 4 features:

  • Sepal length (cm): e.g., 5.1, 4.9, 4.7, etc.
  • Sepal width (cm): e.g., 3.5, 3.0, 3.2, etc.
  • Petal length (cm): e.g., 1.4, 1.4, 1.3, etc.
  • Petal width (cm): e.g., 0.2, 0.2, 0.2, etc.

Additionally, each sample has a class label representing the Iris species. The targets in the dataset are represented as one of the following options:

  • Iris setosa: 0
  • Iris versicolor: 1
  • Iris virginica: 2

With these measurements and labels, the Iris dataset becomes a multivariate dataset often used for machine learning introductions.

Insight into Data Preprocessing

Data preprocessing is a crucial step in machine learning. It is the process of converting or mapping data from the initial form to another format to prepare the data for the next processing phase. This converted data could be easier for the algorithms to extract information, hence improving their ability to predict. The steps involved in preprocessing we will cover in today's lesson include data load, split, scale, and encode.

Step 1: Loading the Dataset

Before diving into preprocessing, let's start by loading the Iris dataset. We use the load_iris function from scikit-learn for this purpose. It returns the feature matrix X and the target vector y.

The output will be:

Here, X contains 150 samples, each with 4 features (sepal length, sepal width, petal length, and petal width). The y vector contains 150 class labels, with each label representing one of the three Iris species. This initial step helps us understand the dimensions of our dataset before we proceed with further processing.

Step 2: Splitting into Training and Testing Sets

The initial step in preprocessing itself is data splitting. We divide the dataset into two parts: a training set and a testing set. The training set is used to train the model, while the testing set validates its performance. Typically, we use scikit-learn's train_test_split function for this purpose. By splitting the data, we ensure that our model can generalize well to new, unseen data. The stratify parameter ensures that the proportion of different classes in the split datasets is the same as in the original dataset. For our specific example, we will use 70% of the data for training and 30% for testing.

Step 3: Feature Scaling

After splitting the data, we perform feature scaling to normalize the range of independent variables or features. This step is crucial because it ensures all input features have the same scale, preventing features with larger scales from dominating those with smaller scales. We achieve this normalization using the StandardScaler from scikit-learn, which standardizes features by centering the data to have a mean of 0 and scaling to unit variance. The fit method calculates the mean and standard deviation for scaling based on the training data.

Step 4: Target Encoding

The final preprocessing step is data encoding. The target variables in the Iris dataset are categorical and must be converted into a format that our machine learning model can utilize. This is done using one-hot encoding, which transforms categorical data into a binary (0 or 1) format. For example, a variable labeled as 1 (Iris versicolor) would be represented as [0, 1, 0] after one-hot encoding. We use the OneHotEncoder from scikit-learn to perform this action, ensuring our target variables are ready for input into the model. The fit method learns the unique categories present in the training data, which will be used for encoding.

Data Preprocessing in Practice

Below are the summarized preprocessing steps, including data loading, splitting, scaling, and encoding in one section encapsulated in a single function. This function facilitates modularization, allowing us to use the processed data imported in another file where we develop our model.

Loading and Printing Preprocessed Data

After defining the function that preprocesses the data, we can load the preprocessed data and print a sample of the training input and target.

The output of the above code will be:

This output illustrates the results of our preprocessing steps — scaling of feature data to ensure a standardized dataset and one-hot encoding of target variables to prepare them for machine learning models.

Lesson Summary and Practice

In conclusion, we have successfully preprocessed the Iris dataset and made it ready for machine learning modeling with TensorFlow. We've loaded, split, scaled, and encoded the data using Python. This fundamental knowledge is essential to you as a Machine Learning Engineer to improve accuracy and build efficient models using TensorFlow.

Next, we will have exercises to consolidate these preprocessing steps. The exercises aim to enhance your understanding and application of data preprocessing and prepare you for more challenging tasks in the future. Happy learning!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal