Loading...

Categorical Data Encoding Techniques

Welcome to the first lesson of the Shaping and Transforming Features course. In this lesson, we'll explore the significance of categorical data encoding. Categorical data represent distinct categories or labels that are not numerical. While they may appear straightforward to the human eye, these types of data require special handling when it comes to feeding them into machine learning models.

Encoding categorical data is essential because it transforms textual labels into a numerical form that models can interpret. This step is a crucial part of the feature engineering process. Throughout this lesson, we'll use the Titanic dataset, which will help clarify how encoding techniques are applied in real-world scenarios.

Understanding Categorical Data and Data Encoding

Categorical data refers to variables that contain label values rather than numeric ones. These can be divided into two types: nominal and ordinal. Nominal data are categories with no specific order (e.g., gender or car brands), while ordinal data have a clear rank or order (e.g., education level).

Data encoding is the process of converting categorical data into a numerical format so that machine learning models can process it. Encoding is essential because many algorithms require numerical input to perform mathematical calculations. Two common encoding methods are one-hot encoding and label encoding.

One-Hot Encoding transforms each category into a binary vector where each category corresponds to a column with boolean values. For example, for a color attribute with values like red, green, blue, one-hot encoding would create three columns: red, green, and blue, with a '1' in the appropriate column.
Label Encoding assigns a unique integer to each category value. Using the previous example of colors, red might be encoded as 0, green as 1, and blue as 2. However, this method might introduce artificial ordinal relationships, which can be misleading if the categories do not have a natural rank.

These encoding techniques prepare categorical data for numeric-based machine learning algorithms, ensuring the data is inputted correctly without losing valuable categorical information.

Identifying and Selecting Categorical Features

Before transforming categorical data, we first need to identify and select the categorical features within a dataset. In our case, we'll use the Titanic dataset. You can easily identify categorical columns using pandas by selecting columns with object data types.

Here's how you can accomplish this task with pandas:

The code above selects columns of type object, typically indicating text data such as sex, embarked, and alive in the Titanic dataset.

The output displays the first five rows of the selected categorical features, showing distinct string categories that exist within each column. As we proceed through the lesson, some features we will encode include alive, which indicates survival status, sex, which specifies gender, and embark_town, denoting the port of embarkation.

One-Hot Encoding Using Pandas

One-hot encoding is a common technique for converting categorical data into a form that can be provided to machine learning algorithms. This is done by converting each category value into a new categorical column and assigning a 1 or 0 (True/False).

You can easily perform one-hot encoding in pandas with the get_dummies() function. Let’s demonstrate this by encoding the alive column to see how this method works:

The get_dummies() function creates a new column for each category in the data and marks it with a 1 or 0. The option drop_first=True helps prevent a problem known as multicollinearity, which occurs when you have too many similar columns. By dropping the first category, we remove one redundant column and keep the data easy for models to understand.

In this output, you can see that the alive column is now represented as a boolean (True or False) instead of the original "yes" or "no". When using one-hot encoding, pandas often converts the column into boolean values rather than numeric 0 and 1, simplifying the use of the data in machine learning models.

One-Hot Encoding Using Scikit-learn

In the previous section, we used pandas to perform one-hot encoding, which converted categorical data into a numerical format that can be easily used by machine learning models. Another way to do this is by using the Scikit-learn library, which offers a powerful tool for more advanced encoding needs.

Let's continue with the Titanic dataset and see how to use Scikit-learn's OneHotEncoder to encode the sex column:

Initializing OneHotEncoder: The first step is to create an instance of OneHotEncoder. The option drop='first' is similar to what we used with pandas and helps avoid multicollinearity by dropping the first category. The sparse_output=False option ensures that the encoded data is easier to read by returning a dense array instead of a sparse matrix.
Transforming the Data: We then use fit_transform() on the sex column to convert categorical values into numerical format. This method returns a NumPy array, which consists of the one-hot encoded values.

Label Encoding with Scikit-learn

While one-hot encoding is excellent for handling categorical data, another method called label encoding can be useful, especially for ordinal data or when simplicity is preferred. In label encoding, each category is assigned a unique integer, which can make it easier for some models to work with.

Let's see how we can use Scikit-learn's LabelEncoder to encode the embark_town column from the Titanic dataset:

Initializing LabelEncoder: First, we create an instance of LabelEncoder, which will transform the categorical labels into integers.
Transforming the Data: Using fit_transform(), we convert the embark_town column's text categories into numeric values. This method directly updates the DataFrame by adding a new column, embark_town_encoded, to hold these integer values.

As you can see in the output, each town is assigned a unique integer value. While this method is straightforward, it's important to be cautious, as it might introduce unintended ordinal relationships among categories that aren't naturally ranked. Label encoding is often best for ordinal categorical data where the order of categories holds meaning, or it can be combined with other encoding techniques to achieve more balanced data representation.

Summary and Preparing for Practice

In this lesson, we delved into essential techniques for encoding categorical data, a pivotal aspect of feature engineering. We explored two common methods: one-hot encoding and label encoding, using both pandas and Scikit-learn libraries. Each method serves different purposes, and choosing the right one depends on the nature of your dataset and the requirements of your predictive models.

As you proceed to the practice exercises, you will apply these encoding techniques hands-on, reinforcing your understanding and learning to navigate common data preprocessing challenges. These exercises will help solidify how to transform categorical features into valuable inputs for your machine learning algorithms. Happy encoding!

Next Lesson: Categorizing Continuous Data with Binning

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal