Loading...

Introduction

Welcome to the second lesson in our "Automating Retraining with Apache Airflow" course! In the previous lesson, we introduced Apache Airflow and built a simple "hello world" DAG. Now that we understand the basics, we're ready to take a significant step forward and explore how to structure more complex workflows specifically designed for machine learning pipelines.

Machine learning workflows typically involve several distinct stages, from data extraction to model deployment. In this lesson, we'll build a complete ML pipeline using Airflow's TaskFlow API, learning how to orchestrate the various steps needed to train and deploy a model. We'll see how Airflow can help us manage dependencies between tasks, pass data between steps, and implement conditional logic based on model performance.

By the end of this lesson, you'll understand how to design a functional DAG that represents an end-to-end ML training pipeline, giving you a solid foundation for implementing actual ML code in future lessons.

Understanding ML Workflows in Airflow

Before diving into code, let's consider what an ML pipeline typically involves and how we can represent it in Airflow. A standard machine learning workflow often includes these key stages:

Data extraction: Collecting data from various sources.
Data transformation: Cleaning, preprocessing, and feature engineering.
Model training: Using the prepared data to train a machine learning model.
Model validation: Evaluating the model's performance against a validation set.
Model deployment: Deploying the model to production (if it meets quality thresholds).

Each of these stages can be represented as a task in our Airflow DAG, with clear dependencies between them. For example, we can't train a model until the data has been extracted and transformed, or we can't deploy it until it has been trained and evaluated. Airflow is particularly well-suited for ML workflows because it:

Handles task scheduling and retries automatically;
Provides visibility into the execution of each step;
Allows us to pass data between tasks;
Enables conditional execution based on the results of previous tasks;
Maintains a record of runs for reproducibility and auditing.

Designing an ML Pipeline DAG

Let's start designing our ML pipeline by setting up the basic DAG structure:

This code should look familiar from our first lesson. We import the necessary modules and set up default arguments. Notice that we've increased the retry_delay to 5 minutes, which is more appropriate for ML tasks that might take longer to execute than simple examples.

Now, let's define our DAG using the @dag decorator:

We've named our DAG mlops_pipeline and set it to run daily. While scheduling a DAG to run at regular intervals (like daily) is common, Airflow also supports alternative triggering mechanisms: for example, you can use triggers to launch a DAG in response to external events, or sensors to wait for specific conditions (such as the arrival of a file or completion of another process) before starting the workflow. This flexibility allows you to tailor your ML pipeline's execution to your specific operational requirements.

The catchup=False parameter ensures that Airflow won't execute the DAG for past dates, which is important for ML pipelines where we typically only want to train with the most recent data. Finally, the tags help categorize our DAG in the Airflow UI, making it easier to find among many workflows.

Implementing Data Extraction and Transformation

Now let's implement the first two tasks of our ML pipeline:

This task simulates extracting data from a source system. In a real ML pipeline, this might involve querying a database, accessing an API, or reading files from a data lake. The task returns a dictionary with metadata about the extraction process, including a flag indicating success and the number of records extracted. This information will be useful for downstream tasks.

Next, let's implement the transformation task:

The transform_data task takes the result from the extraction task as an input parameter, demonstrating how the TaskFlow API makes it easy to pass data between tasks. The task checks if extraction was successful, performs its transformation (simulated in this case), and returns information about the features generated. Note how we're raising an exception if extraction failed — this ensures that our pipeline will fail appropriately if a critical step doesn't complete successfully.

Building Model Training and Validation Tasks

With our data prepared, let's implement the next stages of our ML pipeline:

The train_model task simulates training a machine learning model using the transformed data. In a real pipeline, this is where you would implement your model training code using frameworks like scikit-learn, TensorFlow, or PyTorch. Our simulated task returns a dictionary with the training results, including an accuracy metric that will be used for model validation.

Now for the validation task:

The validate_model task evaluates the model's performance against a predefined threshold (0.8 in this case). Rather than failing the task if the model doesn't meet the threshold, it returns a status indicating whether validation passed or failed. This approach gives us flexibility in how we handle underperforming models — we might choose to deploy them anyway with monitoring, or we might prevent deployment entirely.

Note: in general, model validation could be more complex and involve different metrics than simple accuracy. We'll delve into this topic in more detail in a later lesson in the course.

Implementing Conditional Deployment Logic

The final task in our ML pipeline is model deployment, which should only happen if the model passes validation:

This task introduces an important concept: the trigger_rule parameter. By setting trigger_rule="none_failed", we're telling Airflow to run this task as long as no upstream tasks failed (even if some skipped). This ensures our deployment task runs as long as no upstream tasks have failed (i.e., raised an exception), even if the validation task returns a result indicating the model isn't good enough to deploy. The deployment logic inside the task will then decide whether to actually deploy the model based on the validation result.

Inside the task, we check the validation result and either deploy the model or log the failure. In both cases, the task itself succeeds, but the return value indicates whether deployment actually occurred. This pattern is useful for conditional logic that doesn't necessarily represent a failure of the pipeline itself.

Defining Task Dependencies

Now that we've defined all our tasks, we need to establish the relationships between them:

When we call extract_data(), it returns a value which we pass to transform_data(), creating a dependency between them. This pattern continues through our entire workflow, creating a linear sequence of tasks.

The final line training_pipeline() instantiates our DAG object, making it discoverable by the Airflow scheduler. This is a critical step — without it, our DAG won't be registered with Airflow.

Conclusion and Next Steps

In this lesson, you've built a complete ML pipeline using Apache Airflow's TaskFlow API. You've learned how to structure a DAG for machine learning workflows, create tasks that represent each stage of an ML pipeline, pass data between tasks, implement conditional logic, and use trigger rules to control task execution. These concepts form the foundation of production-ready ML pipelines that can reliably train and deploy models on a schedule.

In the upcoming practice exercises, you'll have the opportunity to apply these concepts by creating your own ML pipeline DAG. You'll experiment with different task configurations, implement conditional logic, and explore how Airflow can help make your ML workflows more robust and manageable. This hands-on experience will solidify your understanding of how Airflow can serve as a powerful orchestration tool for machine learning operations.

Previous Lesson

Next Lesson: Testing and Running ML Pipelines with Airflow CLI

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal