Welcome to our lesson on handling categorical data with Pandas! We're diving into a critical aspect of data manipulation. Data comes in various types, and one of the crucial types is categorical data — data divided into specific categories.
By the end of this lesson, you'll understand how to convert columns in a DataFrame
to categorical types, why it's important, and how to verify the conversion. We'll also see an example of encoding categorical data efficiently. Let's get started!
Categorical data can be divided into groups or categories. It's like sorting toys into different bins: one for cars, one for dolls, and one for blocks. In real-life data, examples include gender (male or female), class (first, second, third), or colors (red, blue, green).
In Pandas, categorical data can make computations faster and save memory. It's like organizing toys so you can find the one you need quickly!
Starting with this lesson we will from time to time work with a real data, not just toy examples. Welcome the famous titanic dataset, containing information about the Titanic's passengers and whether they survived or not! This dataset mainly comprises data about the passengers' demographics and their travel details, which can be used to predict passenger survival on the Titanic. For instance, it includes features like the ticket fare, the passenger's class or the passenger's age.
This dataset has multiple categorical columns. The most straightforward example is the 'sex'
column, which contains either "male"
or "female"
So why convert data to categorical types?
- Memory Efficiency: Categorical data takes up less memory than string data by storing only distinct values and using codes.
- Performance: Operations on categorical data are faster than on string data because comparisons use integer codes.
- Clarity: It indicates that a column contains specific categories rather than free text.
Let's see a practical example using the Titanic dataset, which contains passenger details like gender and class. By converting columns like sex
and class
to categorical types, we can make operations more efficient.
Let's convert DataFrame columns to categorical types using the Titanic dataset. We'll use the .astype()
method in Pandas.
In this slide, you can see how to load the titanic dataset and all its info. This lesson, we will focus only on the sex
and class
columns, containing passenger's sex and ticket class, respectively.
Now let's convert the sex
and class
columns and reprint the DataFrame information.
Notice how sex
and class
changed from object
to category
. This confirms the conversion was successful. This way, Pandas now treats these columns as categorical data, optimizing memory and performance.
Sometimes, you must convert categorical data to numeric codes for machine learning models. Let's see how to encode the sex
column with label encoding. It is the simplest encoding, which replaces categories with some numbers. For example, male
with 0
and female
with 1
.
cat.codes
is an attribute of Pandas' Categorical
type that returns the codes corresponding to the categories in the categorical data. When used, it converts each category into an integer code. For example, if the categorical data has categories ['male', 'female']
, it might convert male
to 0
and female
to 1
.
Now, let's see an example of one-hot encoding. This encoding will create a separate column for each category.
The pd.get_dummies
function creates a separate dataframe with encoded values, performing the one-hot encoding. Next, we append this new dataframe to the original one using the concat
function. One-hot encoding creates new columns for each category of class
(e.g., class_first
, class_second
, class_third
), with binary values indicating each category's presence in the record.
Today, we've learned:
- What categorical data is: Data divided into specific categories.
- Why it's beneficial to convert to categorical types: For memory efficiency and better performance.
- How to perform the conversion: Using the
astype('category')
method in Pandas. - Encoding examples: Label encoding and one-hot encoding to convert categories into numeric forms.
Now it's time to get hands-on! In the upcoming practice tasks, you'll apply what you've learned. You'll convert columns to categorical types and practice encoding them. This practice will solidify your understanding and build confidence in handling categorical data in Pandas. Let's dive in!
