Tuning Pipelines with GridSearchCV in scikit-learn

Introduction: Why Tune Pipelines?

Now that you have learned how to reliably evaluate your models using cross-validation, it is time to take the next step in building robust machine learning workflows. In real-world projects, you rarely use raw data directly with your models. Instead, you often need to preprocess your data — such as scaling features or encoding categories — before fitting a model. If you tune your model’s hyperparameters without including these preprocessing steps, you risk introducing data leakage or making your workflow harder to manage.

This is where pipelines come in. A pipeline in scikit-learn allows you to chain together multiple steps, such as preprocessing and modeling, into a single object. This makes your code cleaner, helps prevent mistakes, and ensures that all steps are included during model evaluation and hyperparameter tuning. In this lesson, you will learn how to build a pipeline and tune its hyperparameters using GridSearchCV, so you can optimize your entire workflow, not just the model itself.

Building a Pipeline in scikit-learn

A pipeline in scikit-learn is a simple way to bundle preprocessing and modeling steps together. For example, you might want to scale your features before passing them to a Support Vector Machine (SVM) classifier. Instead of doing this in separate steps, you can use a pipeline to connect them.

Here is how you can create a pipeline that first scales the data using StandardScaler and then fits an SVM model:

In this code, the pipeline has two steps. The first step, named scaler, uses StandardScaler to standardize the features. The second step, named svm, uses the SVC model. Naming each step is important because you will use these names when tuning hyperparameters later. With this setup, every time you fit the pipeline, it will first scale the data and then train the SVM, making your workflow both simple and reliable.

Tuning Pipeline Hyperparameters with GridSearchCV

Once you have a pipeline, you can tune its hyperparameters just as you would with a single model. However, when you want to tune parameters inside a pipeline, you need to tell GridSearchCV which step the parameter belongs to. This is done using the double underscore (__) syntax. For example, if you want to tune the C parameter of the SVM step, you would use svm__C.

Here is how you can set up a parameter grid for the pipeline:

This grid tells GridSearchCV to try different values for the C parameter and to test both the linear and rbf kernels for the SVM. The kernel parameter in SVC determines the type of decision boundary the SVM will use. Common options include 'linear' (a straight line or hyperplane), 'rbf' (radial basis function, which allows for more flexible, curved boundaries), and 'poly' (polynomial, which can model even more complex relationships). By tuning , you can control how the SVM separates the classes in your data, which can have a big impact on model performance depending on the underlying patterns in your dataset. The double underscore connects the step name () with the parameter name ( or ). You can add more parameters from other steps in the pipeline using the same pattern.

Example: End-to-End Pipeline Tuning

Let’s put everything together with a complete example. Suppose you have your training data in X_train and y_train. You want to scale your features and train an SVM, but you are not sure which kernel or value of C will work best. Here is how you can use a pipeline and GridSearchCV to find the best settings:

When you run this code, GridSearchCV will automatically scale the data and try each combination of C and kernel for the SVM. For example, the output might look like this:

This means that the best results were achieved using an SVM with C=1 and the rbf kernel, after scaling the features. By using a pipeline, you ensured that scaling was always applied in the same way during both training and cross-validation, which helps prevent data leakage and ensures fair model evaluation.

Summary and Practice Preview

In this lesson, you learned how to build a pipeline in scikit-learn to combine preprocessing and modeling steps, and how to tune the hyperparameters of the entire pipeline using GridSearchCV. This approach helps you write cleaner code, avoid common mistakes, and get the most out of your models by optimizing both preprocessing and model parameters together.

In the upcoming practice exercises, you will get hands-on experience building and tuning your own pipelines. You will try different models, preprocessing steps, and parameter grids to see how pipeline tuning can improve your results. Take your time to experiment and see how changing different parts of the pipeline affects your model’s performance. Good luck!

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal