Welcome to the first lesson in our "Statistics with SciPy" course. In this lesson, you will gain a foundational understanding of descriptive statistics and probability distributions and learn how to use SciPy, a powerful Python library, to work with these concepts.
Descriptive statistics summarize and describe the main features of a dataset, helping to simplify and present data in an informative way. Probability distributions, on the other hand, describe how the values of a random variable are distributed. Together, these concepts are pivotal for data analysis and decision-making processes. SciPy enables you to perform these analyses with ease, offering tools to perform various statistical operations.
Let's delve into the core descriptive statistics: mean, median, mode, and standard deviation. We'll use a sample dataset to illustrate these concepts.
First, we need to understand what these terms mean:
- Mean is the average of a dataset.
- Median is the middle value when the data is sorted.
- Mode is the most frequently occurring value.
- Standard Deviation measures the amount of variation or dispersion in a set of values.
To calculate these, we'll use SciPy. Let's start with defining our sample dataset.
Python1from scipy import stats 2import numpy as np 3 4# Sample data 5data = np.array([1, 2, 2, 3, 4, 4, 4, 5, 6, 6])
In this code snippet, we're using NumPy to create an array named data
that holds our sample dataset, which we will analyze using SciPy.
Now, let's calculate the mean, median, standard deviation, and mode using SciPy functions:
Python1# Calculate mean, median, and standard deviation using SciPy 2mean_value = stats.tmean(data) 3median_value = np.percentile(data, 50) # Note: For finite datasets, the median is the 50th percentile. 4std_deviation = stats.tstd(data) 5 6print(f"Mean: {mean_value}, Median: {median_value}, Standard Deviation: {std_deviation}")
stats.tmean(data)
calculates the average of the dataset.np.percentile(data, 50)
finds the 50th percentile, which is equivalent to the median for finite datasets.stats.tstd(data)
computes how far elements of the dataset are spread from the average.
Output:
Plain text1Mean: 3.7, Median: 4.0, Standard Deviation: 1.699673171197595
Next, let's find the mode using SciPy:
Python1# Calculate mode 2mode_value = stats.mode(data).mode 3 4print(f"Mode: {mode_value}") # 4
stats.mode(data)
returns an object containing two arrays: the mode values and their respective counts. To extract the most frequently occurring value, we use .mode
on the result. If two or more items have the same maximal frequency, the smallest of those is returned. In this case, the mode of the dataset is 4
, which appears the most frequently.
Probability distributions describe how values are spread for a random variable. An important type is the normal distribution, often depicted as a bell curve, where data is symmetrically distributed, with most observations clustering around the central peak. The formula for the probability density function of a normal distribution is:
where:
- is the mean,
- is the variance (which is the square of the standard deviation, a measure of dispersion),
- is the random variable.
In statistics, understanding probability distributions helps in modeling and predicting data behavior in real-world scenarios.
SciPy provides tools for generating and analyzing probability distributions. Here's how to work with the normal distribution:
Generate random variables from a normal distribution:
Python1# Generate random variables from a normal distribution 2norm_dist = stats.norm(loc=0, scale=1) 3random_vars = norm_dist.rvs(size=1000)
stats.norm(loc=0, scale=1)
creates a normal distribution with mean0
and standard deviation1
.norm_dist.rvs(size=1000)
generates1000
random variables.
Plot the distribution using a histogram:
Python1import matplotlib.pyplot as plt 2 3# Plot histogram of the random variables 4plt.hist(random_vars, bins=30, density=True, alpha=0.6, color='g') 5 6# Plot the probability density function (PDF) of the normal distribution 7xmin, xmax = plt.xlim() 8x = np.linspace(xmin, xmax, 100) 9p = stats.norm.pdf(x, loc=0, scale=1) 10plt.plot(x, p, 'k', linewidth=2) 11 12plt.title('Histogram of Random Variables from Normal Distribution') 13plt.xlabel('Value') 14plt.ylabel('Density') 15plt.show()
The histogram shows the distribution of the random variables, resembling the bell curve of a normal distribution. The overlaid line is the theoretical probability density function (PDF), helping to visualize the data's spread and central tendency.
scipy
provides functions for various probability distributions. Actually, it has more than 50
probability distribution functions! Explore them in the documentation.
Here are the most common ones:
expon
– An exponential continuous random variable.uniform
– A uniform continuous random variable.bernoulli
– A Bernoulli discrete random variable.
In this lesson, you learned how to calculate descriptive statistics using SciPy and gained an introduction to probability distributions with a focus on the normal distribution. You also practiced generating and fitting probability distributions using SciPy's tools.
These skills are foundational in data analysis, allowing you to summarize data effectively and understand underlying patterns. In upcoming practice exercises, you will have the opportunity to apply what you've learned, further reinforcing your understanding and proficiency in using SciPy for statistical analysis. Keep exploring these examples to gain confidence in working with data!