Welcome to the lesson on regression analysis, which plays a crucial role in statistical modeling and prediction. Regression analysis helps us understand relationships between variables by fitting a mathematical model to the data. In this lesson, you'll learn how to perform simple linear regression using SciPy. This lesson builds on your previous understanding of descriptive statistics, probability distributions, and correlation and will prepare you for further exploration of statistical methods.
As a reminder, a simple linear regression is a statistical technique to model the relationship between two variables by fitting a linear equation to the observed data. The goal is to find the best-fitting straight line, known as the regression line, which minimizes the difference (called errors) between the observed data points and the coordinates on the line.
The formula for a simple linear regression line is:
Where:
- is the dependent variable.
- is the independent variable.
- is the y-intercept.
- is the slope of the line.
- represents the random error term.
To explore simple linear regression, we need datasets x
and y
, representing the independent and dependent variables, respectively. Here's the sample data we'll work with:
Python1import numpy as np 2 3# Sample data 4x = np.array([1, 2, 3, 4, 5]) 5y = np.array([2, 1, 4, 3, 5])
In this example, x
represents the independent variable whose values are [1, 2, 3, 4, 5]
, and y
is the dependent variable with values [2, 1, 4, 3, 5]
. The aim is to model the relationship between x
and y
.
Let's proceed step-by-step to perform simple linear regression using SciPy.
Use scipy.stats.linregress
to calculate the line of best fit. This function generates several outputs:
Python1from scipy import stats 2 3# Simple linear regression 4slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
Here's a breakdown of the outputs:
slope
: The gradient of the regression line.intercept
: The y-intercept of the regression line.r_value
: The correlation coefficient, indicating the strength and direction of a linear relationship.p_value
: The p-value for a hypothesis test whose null hypothesis is that the slope is zero.std_err
: The standard error of the estimated gradient.
The regression line equation can be expressed as .
Let's visualize the regression line along with the data points using Matplotlib:
Python1import matplotlib.pyplot as plt 2 3# Plotting the scatter plot and regression line 4plt.scatter(x, y, label='Data points') 5plt.plot(x, slope * x + intercept, color='red', label='Linear fit') 6plt.xlabel('Independent Variable X') 7plt.ylabel('Dependent Variable Y') 8plt.legend() 9plt.title('Simple Linear Regression') 10plt.show()
Here's what happens in this code:
- The
scatter
function plots the data points. - The
plot
function draws the regression line obtained from the linear regression analysis. - Labels and title are added for clarity.
Here is the result:
Now, let's use the famous Iris dataset to perform simple linear regression with real data:
Python1from sklearn.datasets import load_iris 2import numpy as np 3import matplotlib.pyplot as plt 4from scipy import stats 5 6# Load iris dataset 7iris = load_iris() 8x = iris.data[:, 0] # sepal length 9y = iris.data[:, 2] # petal length 10 11# Perform simple linear regression 12slope, intercept, r_value, p_value, std_err = stats.linregress(x, y) 13 14# Plotting the regression results 15plt.scatter(x, y, label='Iris Data Points') 16plt.plot(x, slope * x + intercept, color='red', label='Linear fit') 17plt.xlabel('Sepal Length') 18plt.ylabel('Petal Length') 19plt.legend() 20plt.title('Linear Regression on Iris Dataset') 21plt.show()
In this example:
- We take the sepal length as the independent variable (
x
) and the petal length as the dependent variable (y
). - We apply the same process of calculating the regression using
scipy.stats.linregress
.
The resulting plot shows the regression line.
In this lesson, you've learned how to perform simple linear regression using SciPy. You explored practical steps to compute the regression line and visualize the relationship between two variables. Understanding regression analysis is essential for data-driven decision-making and predictive modeling.
As you move on to the practice exercises, apply these concepts to solve real-world problems where regression can provide insights. Keep practicing to strengthen your understanding, and look forward to more advanced topics in your learning journey.