Lesson Introduction

Welcome to our lesson on Linear Regression Analysis! This technique is fundamental in machine learning for predicting values based on data. By the end of this lesson, you will understand what linear regression is, why it's useful, and how to create it using Python with the popular scikit-learn library.

Imagine you're running a lemonade stand and want to predict future sales based on past data. Linear regression helps you figure out the trend and make educated guesses. Let's explore how it works.

Understanding Linear Regression

Linear regression models the relationship between two variables by fitting a straight line to the observed data. The simplest form is simple linear regression, where we have one independent variable (input) and one dependent variable (output).

Real-Life Example

Let's say you have the following data on hours studied and the corresponding test scores:

  • Hours studied: [1, 2, 3, 4, 5]
  • Test scores: [2, 4, 5, 4, 5]

Our goal is to predict the test score for studying 6 hours. We'll start by visualizing the data:

It is a scatter plot showing the relationship between hours studied and test scores. Now, let's introduce a line to approximate this relationship.

Plotting Multiple Lines

The general formula for a line is: y=mx+cy = mx + c where:

  • yy is the dependent variable (output we predict, like sales).
  • xx is the independent variable (input, like days).
  • mm (slope) determines the line's steepness.
  • (intercept) is where the line crosses the y-axis.
Using scikit-learn to Calculate the Best-Fit Line

The best-fit line minimizes the error between the observed data points and the predicted values. We can use scikit-learn, a powerful machine learning library in Python, to find this line easily.

Let's calculate this in Python:

This code will provide the slope and intercept for the best-fit line through the data points using scikit-learn. You will learn more about scikit-learn when you start exploring the Machine Learning. By now, let's quickly review this code:

  • LinearRegression() creates a linear regression model, which is capable of learning from the data and making predictions.
  • .fit(X, y) method trains the model, finding the perfect line coefficients.
  • .coef_[0] obtains the slope of the best-fit line. The reason we need to use [0] here is that a line could be multidimensional, so the model's .coef_ is a list of coefficients. In our two-dimensional case, we will get a list of one coefficient. To get it, we use [0].
  • .intercept_ obtains the intercept of the best-fit line.
Plot the Line with the Data

Now that we have the slope and intercept, let's plot the best-fit line on the original data:

This line is the best-fit line, it minimizes the average distance between the line and the data points.

Making Predictions for New Values

With the best-fit line equation y=0.6x+2.2y = 0.6x + 2.2, we can predict new values. Let's predict the test score for 6 hours of study:

y(6)=0.66+2.2=3.6+2.2=5.8y(6) = 0.6 \cdot 6 + 2.2 = 3.6 + 2.2 = 5.8
Lesson Summary

Fantastic! You've learned the basics of linear regression, how to calculate it using scikit-learn, and how to implement it in Python. We've explored predicting a test score based on hours studied by calculating and plotting the best-fit line.

Now it's time to put this knowledge into practice. In the next session, you'll implement linear regression on a new dataset and make predictions. Let's dive into those exercises and solidify your understanding!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal