Welcome to our lesson on Linear Regression Analysis! This technique is fundamental in machine learning for predicting values based on data. By the end of this lesson, you will understand what linear regression is, why it's useful, and how to create it using Python with the popular scikit-learn
library.
Imagine you're running a lemonade stand and want to predict future sales based on past data. Linear regression helps you figure out the trend and make educated guesses. Let's explore how it works.
Linear regression models the relationship between two variables by fitting a straight line to the observed data. The simplest form is simple linear regression, where we have one independent variable (input) and one dependent variable (output).
Let's say you have the following data on hours studied and the corresponding test scores:
- Hours studied: [1, 2, 3, 4, 5]
- Test scores: [2, 4, 5, 4, 5]
Our goal is to predict the test score for studying 6 hours. We'll start by visualizing the data:
It is a scatter plot showing the relationship between hours studied and test scores. Now, let's introduce a line to approximate this relationship.
The general formula for a line is: where:
- is the dependent variable (output we predict, like sales).
- is the independent variable (input, like days).
- (slope) determines the line's steepness.
- (intercept) is where the line crosses the y-axis.
The best-fit line minimizes the error between the observed data points and the predicted values. We can use scikit-learn
, a powerful machine learning library in Python, to find this line easily.
Let's calculate this in Python:
This code will provide the slope and intercept for the best-fit line through the data points using scikit-learn
. You will learn more about scikit-learn
when you start exploring the Machine Learning. By now, let's quickly review this code:
LinearRegression()
creates a linear regression model, which is capable of learning from the data and making predictions..fit(X, y)
method trains the model, finding the perfect line coefficients..coef_[0]
obtains the slope of the best-fit line. The reason we need to use[0]
here is that a line could be multidimensional, so the model's.coef_
is a list of coefficients. In our two-dimensional case, we will get a list of one coefficient. To get it, we use[0]
..intercept_
obtains the intercept of the best-fit line.
Now that we have the slope and intercept, let's plot the best-fit line on the original data:
This line is the best-fit line, it minimizes the average distance between the line and the data points.
With the best-fit line equation , we can predict new values. Let's predict the test score for 6 hours of study:
Fantastic! You've learned the basics of linear regression, how to calculate it using scikit-learn
, and how to implement it in Python. We've explored predicting a test score based on hours studied by calculating and plotting the best-fit line.
Now it's time to put this knowledge into practice. In the next session, you'll implement linear regression on a new dataset and make predictions. Let's dive into those exercises and solidify your understanding!
