Introduction to Categorical Data

Hello, Space Voyager! Today, we're venturing through a fascinating territory: Categorical Data Encoding! Categorical Data consist of groups or traits such as "gender", "marital status", or "hometown". We convert categories into numbers using Label and One-Hot Encoding techniques for our machine-learning mates.

Concept of Label Encoding

Label Encoding maps categories to numbers ranging from 0 through N-1, where N represents the unique category count. It's beneficial for ordered data like "Small", "Medium", and "Large".

To illustrate, here is a Python list of shirt sizes:

Python's Pandas library can be used to assign 0 to "Small", 1 to "Medium", and 2 to "Large":

In this example, we define mapping in the most natural way for it – as a dictionary. Then, we apply this mapping using dataframe's .map function.

Concept of One-Hot Encoding

One-Hot Encoding creates additional columns for each category, placing a 1 for the appropriate category and 0s elsewhere. It's preferred for nominal data, where order doesn't matter, such as "Red", "Green", "Blue".

You can perform one-hot encoding with Pandas' .get_dummies():

Why One-Hot Encoding?

As One-Hot encoding converts each category value into a new column and assigns a 1 or 0 (True/False) value to the column, it does not impose any ordinal relationship among categories where there is none. This can often be the case with labels like 'Red', 'Blue', 'Green'. Each of these categories is distinct, and there is no order. Converting these label categories into a numerical format using label encoding would imply an order, while one-hot encoding does not. It could be helpful for training machine learning models.

Categorical Data Encoding Pitfalls

Finally, let's address the potential pitfalls of encoding. Label Encoding can create an unintended order, which may mislead our model. One-Hot Encoding can slow down our model when used with many unique categories. Consider merging select categories or using different encoding techniques to combat these issues.

For instance, the 'Species' feature in an 'Animal Shelter' dataset can be restructured to address such problems. Instead of Label Encoding or One-Hot Encoding each unique species like 'Dog', 'Cat', 'Rabbit', and 'Bird', we can merge 'Dog' and 'Cat' into a new category 'Pet', and 'Rabbit' and 'Bird' into 'Other'. This technique reduces our feature's unique categories, making it more model-friendly.

Wrapping Up

Bravo, Voyager! You've navigated through the realm of Categorical Data Encoding, mastering Label and One-Hot Encoding, and gaining insights on pitfalls and best practices. These acquired skills are bona fide assets! Now, gear up for some hands-on work, where you'll practice enhancing your newly learned encoding skills with real-world datasets. See you there!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal