Machine Learning and Sklearn: An Introduction

Welcome! This lesson paves your path toward understanding machine learning and the powerful Python library, sklearn. Machine learning, an application of artificial intelligence, enables systems to learn and improve without being explicitly programmed. It plays a key role in various sectors, such as autonomous vehicles, voice recognition systems, and recommendation engines.

Suppose you aim to predict housing prices as an illustration. This scenario constitutes a standard supervised learning problem wherein you train your model using past data. With sklearn, you can import the data, preprocess it, select an algorithm (like linear regression), train the model with the training data, and make predictions. All these steps can be accomplished without manually implementing algorithms.

Importing the Iris Dataset

Datasets form the backbone of machine learning. In this course, we'll use the Iris dataset, which consists of measurements — namely, sepal length, sepal width, petal length, and petal width — for 150 flowers representing three species of iris.

Sklearn provides an easy-to-use load_iris function to import the Iris dataset. Let's see how it works:

Here, the load_iris() function loads the dataset and assigns it to the iris variable. We then separate the dataset into X for features and y for the target.

Furthermore, you can print the description of the dataset for more detailed insight using the DESCR attribute as follows:

Output:

This code prints a detailed description of the dataset and its attributes.

Exploring Sklearn's Functionality

After the data loading, let's delve into how Python and sklearn enable us to explore it. 'Features’ and 'Target' are two critical terms related to the dataset. Here, 'Features' refer to the attributes of the Iris flower: sepal length, sepal width, petal length, and petal width. 'Target', on the other hand, refers to the species of the Iris flower, which we aim to predict based on these features.

The data and target attributes of the iris object hold the feature matrix and the response vector, respectively. The shape property gives information about their dimensionality - how many examples we have and how many features each example consists of.

Output:

Preparing Dataset for Model Training

Before feeding our data to the machine learning model, we must split it into a training set and a test set. The training set teaches our model, while the test set evaluates its performance. Sklearn allows for the convenient split of these datasets using the train_test_split function from the model_selection module.

Output:

Here, the train_test_split function has divided our data into a training set — 80% of the original data, and a test set — the remaining 20%.

Quick Look at Sklearn's Model Structure

Each machine learning model in sklearn is represented as a Python class. These classes offer an interface that includes methods for building the model (fit), making predictions (predict), and evaluating the model's performance (score).

In the next, more concrete lesson, you'll see how to apply these methods after selecting a specific type of machine learning model. For now, understand that the procedure of using these models would look something as follows:

Congratulations! With the knowledge acquired from this lesson, you now understand what sklearn is, how to import data using it, the process of preparing data for machine learning tasks, and the rudimentary structure of sklearn models. The upcoming sessions will build upon this fundamental understanding by introducing you to more specific machine-learning models and optimization tricks. Keep practicing and continue learning as we're taking the first steps into the exciting world of machine learning! Keep up the good work!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal