Loading...

Introduction to Descriptive Statistics and Probability Distributions

Welcome to the first lesson in our "Statistics with SciPy" course. In this lesson, you will gain a foundational understanding of descriptive statistics and probability distributions and learn how to use SciPy, a powerful Python library, to work with these concepts.

Descriptive statistics summarize and describe the main features of a dataset, helping to simplify and present data in an informative way. Probability distributions, on the other hand, describe how the values of a random variable are distributed. Together, these concepts are pivotal for data analysis and decision-making processes. SciPy enables you to perform these analyses with ease, offering tools to perform various statistical operations.

Descriptive Statistics Using SciPy: Part 1

Let's delve into the core descriptive statistics: mean, median, mode, and standard deviation. We'll use a sample dataset to illustrate these concepts.

First, we need to understand what these terms mean:

Mean is the average of a dataset.
Median is the middle value when the data is sorted.
Mode is the most frequently occurring value.
Standard Deviation measures the amount of variation or dispersion in a set of values.

To calculate these, we'll use SciPy. Let's start with defining our sample dataset.

In this code snippet, we're using NumPy to create an array named data that holds our sample dataset, which we will analyze using SciPy.

Descriptive Statistics Using SciPy: Part 2

Now, let's calculate the mean, median, standard deviation, and mode using SciPy functions:

stats.tmean(data) calculates the average of the dataset.
np.percentile(data, 50) finds the 50th percentile, which is equivalent to the median for finite datasets.
stats.tstd(data) computes how far elements of the dataset are spread from the average.

Output:

Next, let's find the mode using SciPy:

stats.mode(data) returns an object containing two arrays: the mode values and their respective counts. To extract the most frequently occurring value, we use .mode on the result. If two or more items have the same maximal frequency, the smallest of those is returned. In this case, the mode of the dataset is 4, which appears the most frequently.

Understanding Probability Distributions

Probability distributions describe how values are spread for a random variable. An important type is the normal distribution, often depicted as a bell curve, where data is symmetrically distributed, with most observations clustering around the central peak. The formula for the probability density function of a normal distribution is:

f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Working with Probability Distributions in SciPy

SciPy provides tools for generating and analyzing probability distributions. Here's how to work with the normal distribution:

Generate random variables from a normal distribution:

stats.norm(loc=0, scale=1) creates a normal distribution with mean 0 and standard deviation 1.
norm_dist.rvs(size=1000) generates 1000 random variables.

Plot the distribution using a histogram:

The histogram shows the distribution of the random variables, resembling the bell curve of a normal distribution. The overlaid line is the theoretical probability density function (PDF), helping to visualize the data's spread and central tendency.

Other Probability Distributions

scipy provides functions for various probability distributions. Actually, it has more than 50 probability distribution functions! Explore them in the documentation.

Here are the most common ones:

expon – An exponential continuous random variable.
uniform – A uniform continuous random variable.
bernoulli – A Bernoulli discrete random variable.

Summary and Preparation for Practice

In this lesson, you learned how to calculate descriptive statistics using SciPy and gained an introduction to probability distributions with a focus on the normal distribution. You also practiced generating and fitting probability distributions using SciPy's tools.

These skills are foundational in data analysis, allowing you to summarize data effectively and understand underlying patterns. In upcoming practice exercises, you will have the opportunity to apply what you've learned, further reinforcing your understanding and proficiency in using SciPy for statistical analysis. Keep exploring these examples to gain confidence in working with data!

Next Lesson: Hypothesis Testing with SciPy

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal