Hello, Space Voyager! Today, we're venturing through a fascinating territory: Categorical Data Encoding! Categorical Data consist of groups or traits such as "gender", "marital status", or "hometown". We convert categories into numbers using Label
and One-Hot Encoding
techniques for our machine-learning mates.
Label Encoding
maps categories to numbers ranging from 0
through N-1
, where N
represents the unique category count. It's beneficial for ordered data like "Small"
, "Medium"
, and "Large"
.
To illustrate, here is a Python list of shirt sizes:
Python's Pandas library can be used to assign 0 to "Small", 1 to "Medium", and 2 to "Large":
In this example, we define mapping in the most natural way for it – as a dictionary. Then, we apply this mapping using dataframe's .map
function.
One-Hot Encoding
creates additional columns for each category, placing a 1
for the appropriate category and 0
s elsewhere. It's preferred for nominal data, where order doesn't matter, such as "Red"
, "Green"
, "Blue"
.
You can perform one-hot encoding with Pandas' .get_dummies()
:
As One-Hot encoding converts each category value into a new column and assigns a 1
or 0
(True
/False
) value to the column, it does not impose any ordinal relationship among categories where there is none. This can often be the case with labels like 'Red', 'Blue', 'Green'. Each of these categories is distinct, and there is no order. Converting these label categories into a numerical format using label encoding would imply an order, while one-hot encoding does not. It could be helpful for training machine learning models.
Finally, let's address the potential pitfalls of encoding. Label Encoding
can create an unintended order, which may mislead our model. One-Hot Encoding
can slow down our model when used with many unique categories. Consider merging select categories or using different encoding techniques to combat these issues.
For instance, the 'Species'
feature in an 'Animal Shelter' dataset can be restructured to address such problems. Instead of Label Encoding or One-Hot Encoding each unique species like 'Dog'
, 'Cat'
, 'Rabbit'
, and 'Bird'
, we can merge 'Dog'
and 'Cat'
into a new category 'Pet'
, and 'Rabbit'
and 'Bird'
into 'Other'
. This technique reduces our feature's unique categories, making it more model-friendly.
Bravo, Voyager! You've navigated through the realm of Categorical Data Encoding, mastering Label
and One-Hot Encoding
, and gaining insights on pitfalls and best practices. These acquired skills are bona fide assets! Now, gear up for some hands-on work, where you'll practice enhancing your newly learned encoding skills with real-world datasets. See you there!
