Introduction to Descriptive Statistics with the Titanic Dataset

Welcome back! This lesson is all about descriptive statistics and understanding the various characteristics of the Titanic dataset.

So, why do we need to study statistics when dealing with data? Well, statistics is a branch of mathematics dealing with data collection, organization, and interpretation. In data science, we use statistics to extract meaningful insights and knowledge from data.

Statistics helps us deal with the data's complexity by reducing a complex dataset into a simpler summary. It assists in the presentation and visualization of the data, thereby making our data analysis or machine learning model more precise.

Take our current dataset, for instance, which comprises various demographics and passenger information; wouldn't it be interesting to know the average age or to gauge the variety in travelers' fares? Our lesson will focus on extracting these primary statistical features from our dataset, helping us better comprehend the Titanic voyage.

Overview of Descriptive Statistics

Descriptive statistics summarise and organize the characteristics of a data set. A data set is a collection of responses or observations from a sample or entire population.

In pandas, there's a function called describe(), which calculates the basic statistics for all continuous variables, i.e., types of variables that can take on an infinite number of values within a specific range. It provides the count, mean, standard deviation (std), min, quartiles, and max in its output.

Firstly, let's import the libraries we will be using and load the dataset:

The output of the head command will be like this:

The describe() function can then be executed as follows:

The output of the describe() function will be like this:

In this code snippet, the describe() function generates descriptive statistics that summarize a dataset's distribution's central tendency, dispersion, and shape, excluding NaN values.

What Else?

Notice how all the categorical columns, like 'sex' or 'class', are missing in the output. By default, describe() only includes columns with numerical data.

If you want to include all columns, you need to pass include='all' as an argument. Here is how to do it:

Note that for categorical variables, the output has different features – unique, top, and freq. 'unique' shows the number of distinct objects in the column, 'top' shows the most frequent object, and 'freq' shows how many times the top object appears in the column.

Unveiling The Spread

Variability, also known as dispersion, is the extent to which data points differ from the center. Two commonly used measures are the range and interquartile range (IQR).

The range is the difference between a dataset's maximum and minimum values. However, it's sensitive to outliers; extremely high or low values can skew the range. Here's how you calculate the range for the age column of the Titanic dataset:

The IQR measures statistical dispersion, or how far apart the data points are. It's the range within which the middle 50% of your data falls. It's a better measure of dispersion than the range because outliers don't affect it. Here's how you can calculate it:

Determining The Central Position

Central tendency measures help you find the center of your dataset. Mean and median are the most common measures of central tendency.

The mean or average is the most common measure of central tendency. It's the sum of all data points divided by the number of data points.

The median is the middle score. The scores must be arranged in numerical order to identify the median correctly.

Wrapping Up

You've just taken your first steps into the realm of descriptive statistics! In this lesson, you've learned about the usefulness of statistics in data analysis and how we can summarize our Titanic dataset via central tendency and dispersion measures.

Hence, understanding these statistical characteristics and central tendencies is significant for making effective predictions about our dataset, offering a sound foundation for building meaningful data visualizations.

Ready to Practice?

With the theory presented, let's put that into practice! This practice exercise will help you revisit everything learned in this lesson while drawing out statistical inferences from our Titanic dataset.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal