Now that you have learned how to reliably evaluate your models using cross-validation, it is time to take the next step in building robust machine learning workflows. In real-world projects, you rarely use raw data directly with your models. Instead, you often need to preprocess your data — such as scaling features or encoding categories — before fitting a model. If you tune your model’s hyperparameters without including these preprocessing steps, you risk introducing data leakage or making your workflow harder to manage.
This is where pipelines come in. A pipeline in scikit-learn
allows you to chain together multiple steps, such as preprocessing and modeling, into a single object. This makes your code cleaner, helps prevent mistakes, and ensures that all steps are included during model evaluation and hyperparameter tuning. In this lesson, you will learn how to build a pipeline and tune its hyperparameters using GridSearchCV
, so you can optimize your entire workflow, not just the model itself.
A pipeline in scikit-learn
is a simple way to bundle preprocessing and modeling steps together. For example, you might want to scale your features before passing them to a Support Vector Machine (SVM
) classifier. Instead of doing this in separate steps, you can use a pipeline to connect them.
Here is how you can create a pipeline that first scales the data using StandardScaler
and then fits an SVM
model:
In this code, the pipeline has two steps. The first step, named scaler
, uses StandardScaler
to standardize the features. The second step, named svm
, uses the SVC
model. Naming each step is important because you will use these names when tuning hyperparameters later. With this setup, every time you fit the pipeline, it will first scale the data and then train the SVM
, making your workflow both simple and reliable.
Once you have a pipeline, you can tune its hyperparameters just as you would with a single model. However, when you want to tune parameters inside a pipeline, you need to tell GridSearchCV
which step the parameter belongs to. This is done using the double underscore (__
) syntax. For example, if you want to tune the C
parameter of the SVM
step, you would use svm__C
.
Here is how you can set up a parameter grid for the pipeline:
This grid tells GridSearchCV
to try different values for the C
parameter and to test both the linear
and rbf
kernels for the SVM
. The kernel
parameter in SVC
determines the type of decision boundary the SVM will use. Common options include 'linear'
(a straight line or hyperplane), 'rbf'
(radial basis function, which allows for more flexible, curved boundaries), and 'poly'
(polynomial, which can model even more complex relationships). By tuning , you can control how the SVM separates the classes in your data, which can have a big impact on model performance depending on the underlying patterns in your dataset. The double underscore connects the step name () with the parameter name ( or ). You can add more parameters from other steps in the pipeline using the same pattern.
Let’s put everything together with a complete example. Suppose you have your training data in X_train
and y_train
. You want to scale your features and train an SVM
, but you are not sure which kernel or value of C
will work best. Here is how you can use a pipeline and GridSearchCV
to find the best settings:
When you run this code, GridSearchCV
will automatically scale the data and try each combination of C
and kernel
for the SVM
. For example, the output might look like this:
This means that the best results were achieved using an SVM
with C=1
and the rbf
kernel, after scaling the features. By using a pipeline, you ensured that scaling was always applied in the same way during both training and cross-validation, which helps prevent data leakage and ensures fair model evaluation.
In this lesson, you learned how to build a pipeline in scikit-learn
to combine preprocessing and modeling steps, and how to tune the hyperparameters of the entire pipeline using GridSearchCV
. This approach helps you write cleaner code, avoid common mistakes, and get the most out of your models by optimizing both preprocessing and model parameters together.
In the upcoming practice exercises, you will get hands-on experience building and tuning your own pipelines. You will try different models, preprocessing steps, and parameter grids to see how pipeline tuning can improve your results. Take your time to experiment and see how changing different parts of the pipeline affects your model’s performance. Good luck!
