In this lesson, we will delve into the aspect of encoding and transforming categorical data present in a dataset. By generating numerical representations, we make it possible to build models using datasets that contain categorical variables. This session focuses on introducing you to different types of categorical data encodings, understanding their use, and learning how to apply them.
Understanding categorical variable encoding is essential for a wide array of machine-learning tasks. Sadly, not all algorithms can understand human language the way we do. By converting these text data into numbers, we are translating the data into a format that algorithms can process - and that's what we will cover in this lesson.
Any guesses on the effects that a passenger's gender
or embarkation point
might have on their survival rates? We address these issues by using different types of encoding techniques to convert the gender
and embarkation point
details into a form that a machine learning model can understand.
While Python provides built-in methods for encoding, the Pandas library shines with its efficiency and simplicity. Let's begin by loading our libraries and dataset.
The above code will load the Titanic dataset and allow us to transform it using different techniques, shown in the following sections.
As part of this session, we mainly consider two categorical variables from the Titanic dataset, and . These columns are in a text format to which our algorithms can't relate. Hence, we use different encoding techniques to solve our problem.
