Introduction

Welcome to the second lesson of our "The Python Data Model & Protocols" course! In the first lesson, we explored how dunder methods make custom objects integrate seamlessly with Python's core features. We built a Money class that behaved like a native Python type through the proper implementation of __eq__, __hash__, __repr__, and other special methods. Now we're ready to tackle another fundamental aspect of Python's data model: iteration protocols and generators.

Today, we'll discover how Python handles iteration behind the scenes and learn to build memory-efficient streaming pipelines using generators. Instead of loading entire datasets into memory, we'll process data lazily, one item at a time, using Python's elegant iteration machinery.

We'll construct a practical CSV processing system that demonstrates these concepts in action. By the end of this lesson, you'll understand how to build composable, streaming data transformations that can handle massive datasets without overwhelming your system's memory. This knowledge forms a crucial foundation for the advanced patterns we'll explore in later courses.

Understanding Python's Iteration Protocols

Before diving into generators, let's understand what makes iteration work in Python. When we write for item in container, Python follows a specific protocol to retrieve items one by one. This protocol is the backbone of Python's iteration system, and understanding it helps us create more efficient and Pythonic code.

Python recognizes two main approaches to iteration:

  1. The iterator protocol requires an object to implement __iter__() (which returns the iterator) and __next__() (which yields the next item or raises StopIteration). For example, file objects implement this protocol.
  2. The iterable protocol only requires __iter__(), which can return any iterator. Most built-in collections like lists, tuples, and dictionaries are iterables.

The beauty of this system lies in its lazy evaluation: iterators produce items on demand rather than creating them all upfront. This means we can process potentially infinite sequences or massive datasets without exhausting our system's memory.

Generators: Lazy Evaluation in Action

Generators are Python's most elegant way to create iterators. They're functions that use the yield keyword instead of return, and they automatically implement the iteration protocol for us. When a generator function is called, it returns a generator object that produces values lazily as we iterate over it.

These imports set up our streaming CSV processing system. Notice the Iterator type from typing: this represents objects that implement the iteration protocol. The from __future__ import annotations allows us to use cleaner type hints that work across Python versions.

Building a CSV Row Generator

Our first generator reads CSV files lazily, yielding one row at a time instead of loading the entire file into memory. This approach scales beautifully: whether processing 100 rows or 100 million rows, our memory usage remains constant.

This generator function opens a CSV file and yields normalized dictionaries row by row. The key insight is the yield statement: instead of accumulating rows in a list, we yield each row immediately. The file stays open throughout iteration, and Python's context manager ensures proper cleanup when iteration completes. The normalization step strips whitespace from string values, creating cleaner data for downstream processing.

Creating Streaming Filters

Our next component filters rows based on specific criteria while maintaining the streaming approach. This demonstrates how generators can consume other iterators and produce filtered results without materializing intermediate collections.

This generator accepts an iterator of rows and yields only those matching our criteria. Notice how we copy each row into a new dictionary: this prevents unexpected mutations when multiple pipeline stages share references to the same row objects. The function signature shows both input and output as iterators, emphasizing the streaming nature of our pipeline components.

Transforming Data with Map Generators

The final component of our pipeline transforms specific columns by applying functions to their values. This map-style operation demonstrates how generators can modify data while preserving the streaming approach.

This generator applies a transformation function to a specific column in each row. The func parameter makes this component highly reusable: we can pass different functions to convert strings to numbers, format dates, normalize text, or perform any other column-specific transformation. Again, we create new dictionaries to maintain immutability and prevent surprising side effects in our pipeline.

Composing a Processing Pipeline

Now we'll see how these generator functions work together to create a powerful streaming pipeline. The main execution demonstrates the composition pattern that makes generator-based processing so elegant and efficient.

This code creates a sample CSV file and builds our streaming pipeline step by step. Each assignment creates a new generator that wraps the previous one, forming a chain of transformations. Notice that no actual processing happens yet: we're just composing the pipeline. The real work begins when we start iterating over the final generator.

Seeing the Complete Pipeline in Action

The final step consumes our pipeline, processing each row and accumulating results. This demonstrates both the power and the important characteristics of generator-based processing.

This loop processes each row through our entire pipeline: reading from CSV, filtering for USD currency, converting types, formatting for output, and accumulating a total. The generator chain ensures that only one row exists in memory at any given time, making this approach suitable for processing arbitrarily large datasets.

The output shows our pipeline in action: three USD transactions are processed and printed as JSON, followed by their total. Notice how the pipeline filtered out Bob's EUR transaction and Eva's GBP transaction, processed only the USD entries, and correctly calculated the sum using Decimal arithmetic for precision.

Conclusion and Next Steps

We've explored Python's iteration protocols and built a powerful streaming data processing system using generators. The key insights from this lesson are the lazy evaluation principle, the composability of generator functions, and the memory efficiency of streaming approaches.

Our CSV processing pipeline demonstrates how generators enable us to build complex data transformations that scale gracefully. Whether processing thousands or millions of rows, our memory footprint remains constant because we only hold one row at a time. This lazy evaluation approach, combined with the clean composition pattern, creates systems that are both efficient and maintainable.

As you may recall from our previous lesson on dunder methods, Python's data model provides elegant protocols for common operations. Iteration is no exception: the iterator protocol gives us a uniform way to traverse any sequence, while generators make implementing custom iterators remarkably straightforward.

In the upcoming practice exercises, you'll implement your own generators from scratch, debug common iteration pitfalls such as exhausted iterators, and build streaming pipelines that transform data efficiently and elegantly. Keep learning!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal