Welcome! This lesson paves your path toward understanding machine learning and the powerful Python library, sklearn.
Machine learning, an application of artificial intelligence, enables systems to learn and improve without being explicitly programmed. It plays a key role in various sectors, such as autonomous vehicles, voice recognition systems, and recommendation engines.
Suppose you aim to predict housing prices as an illustration. This scenario constitutes a standard supervised learning problem wherein you train your model using past data. With sklearn
, you can import the data, preprocess it, select an algorithm (like linear regression), train the model with the training data, and make predictions. All these steps can be accomplished without manually implementing algorithms.
Datasets form the backbone of machine learning. In this course, we'll use the Iris dataset, which consists of measurements — namely, sepal length, sepal width, petal length, and petal width — for 150 flowers representing three species of iris.
Sklearn
provides an easy-to-use load_iris
function to import the Iris dataset. Let's see how it works:
Here, the load_iris()
function loads the dataset and assigns it to the iris
variable. We then separate the dataset into X
for features and y
for the target.
Furthermore, you can print the description of the dataset for more detailed insight using the DESCR
attribute as follows:
Output:
This code prints a detailed description of the dataset and its attributes.
After the data loading, let's delve into how Python and sklearn
enable us to explore it. 'Features’ and 'Target' are two critical terms related to the dataset. Here, 'Features' refer to the attributes of the Iris flower: sepal length, sepal width, petal length, and petal width. 'Target', on the other hand, refers to the species of the Iris flower, which we aim to predict based on these features.
The data
and target
attributes of the iris
object hold the feature matrix and the response vector, respectively. The shape
property gives information about their dimensionality - how many examples we have and how many features each example consists of.
Output:
Before feeding our data to the machine learning model, we must split it into a training set and a test set. The training set teaches our model, while the test set evaluates its performance. Sklearn
allows for the convenient split of these datasets using the train_test_split
function from the model_selection
module.
Output:
Here, the train_test_split
function has divided our data into a training set — 80% of the original data, and a test set — the remaining 20%.
Each machine learning model in sklearn
is represented as a Python class. These classes offer an interface that includes methods for building the model (fit
), making predictions (predict
), and evaluating the model's performance (score
).
In the next, more concrete lesson, you'll see how to apply these methods after selecting a specific type of machine learning model. For now, understand that the procedure of using these models would look something as follows:
Congratulations! With the knowledge acquired from this lesson, you now understand what sklearn
is, how to import data using it, the process of preparing data for machine learning tasks, and the rudimentary structure of sklearn
models. The upcoming sessions will build upon this fundamental understanding by introducing you to more specific machine-learning models and optimization tricks. Keep practicing and continue learning as we're taking the first steps into the exciting world of machine learning! Keep up the good work!
