Lesson Introduction

Welcome to our lesson on handling categorical data with Pandas! We're diving into a critical aspect of data manipulation. Data comes in various types, and one of the crucial types is categorical data — data divided into specific categories.

By the end of this lesson, you'll understand how to convert columns in a DataFrame to categorical types, why it's important, and how to verify the conversion. We'll also see an example of encoding categorical data efficiently. Let's get started!

Understanding Categorical Data

Categorical data can be divided into groups or categories. It's like sorting toys into different bins: one for cars, one for dolls, and one for blocks. In real-life data, examples include gender (male or female), class (first, second, third), or colors (red, blue, green).

In Pandas, categorical data can make computations faster and save memory. It's like organizing toys so you can find the one you need quickly!

Starting with this lesson we will from time to time work with a real data, not just toy examples. Welcome the famous titanic dataset, containing information about the Titanic's passengers and whether they survived or not! This dataset mainly comprises data about the passengers' demographics and their travel details, which can be used to predict passenger survival on the Titanic. For instance, it includes features like the ticket fare, the passenger's class or the passenger's age.

This dataset has multiple categorical columns. The most straightforward example is the 'sex' column, which contains either "male" or "female"

Why Convert to Categorical Data

So why convert data to categorical types?

  1. Memory Efficiency: Categorical data takes up less memory than string data by storing only distinct values and using codes.
  2. Performance: Operations on categorical data are faster than on string data because comparisons use integer codes.
  3. Clarity: It indicates that a column contains specific categories rather than free text.

Let's see a practical example using the Titanic dataset, which contains passenger details like gender and class. By converting columns like sex and class to categorical types, we can make operations more efficient.

Identifying Categorical Data

Let's convert DataFrame columns to categorical types using the Titanic dataset. We'll use the .astype() method in Pandas.

In this slide, you can see how to load the titanic dataset and all its info. This lesson, we will focus only on the sex and class columns, containing passenger's sex and ticket class, respectively.

How to Convert Columns to Categorical Types

Now let's convert the sex and class columns and reprint the DataFrame information.

Notice how sex and class changed from object to category. This confirms the conversion was successful. This way, Pandas now treats these columns as categorical data, optimizing memory and performance.

Encoding Examples: Label Encoding

Sometimes, you must convert categorical data to numeric codes for machine learning models. Let's see how to encode the sex column with label encoding. It is the simplest encoding, which replaces categories with some numbers. For example, male with 0 and female with 1.

cat.codes is an attribute of Pandas' Categorical type that returns the codes corresponding to the categories in the categorical data. When used, it converts each category into an integer code. For example, if the categorical data has categories ['male', 'female'], it might convert male to 0 and female to 1.

Encoding Examples: One-Hot Encoding

Now, let's see an example of one-hot encoding. This encoding will create a separate column for each category.

The pd.get_dummies function creates a separate dataframe with encoded values, performing the one-hot encoding. Next, we append this new dataframe to the original one using the concat function. One-hot encoding creates new columns for each category of class (e.g., class_first, class_second, class_third), with binary values indicating each category's presence in the record.

Lesson Summary

Today, we've learned:

  1. What categorical data is: Data divided into specific categories.
  2. Why it's beneficial to convert to categorical types: For memory efficiency and better performance.
  3. How to perform the conversion: Using the astype('category') method in Pandas.
  4. Encoding examples: Label encoding and one-hot encoding to convert categories into numeric forms.

Now it's time to get hands-on! In the upcoming practice tasks, you'll apply what you've learned. You'll convert columns to categorical types and practice encoding them. This practice will solidify your understanding and build confidence in handling categorical data in Pandas. Let's dive in!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal