Concurrency and the GIL

Introduction

Welcome to the first lesson of Python Concurrency & Async I/O! You've made tremendous progress through the previous courses in this learning path. You've mastered the Python data model, built sophisticated class machinery with descriptors and metaclasses, and explored functional patterns, including composable error handling with Result types. Now we're shifting gears to explore one of Python's most powerful yet misunderstood domains: concurrent and asynchronous programming.

Throughout this course, we'll explore Python's different concurrency models and learn to write high-performance, robust asynchronous code. We'll begin by understanding the Global Interpreter Lock and when to use threads versus processes. We'll then dive deep into the asyncio event loop, learning to build producer-consumer pipelines, manage task groups with structured concurrency, and implement backpressure and retry logic for resilient systems. This journey will equip you with the tools to build applications that handle multiple tasks efficiently, whether you're processing data in parallel, handling network requests, or orchestrating complex workflows.

Today's lesson focuses on the fundamentals: differentiating CPU-bound and I/O-bound workloads and choosing the right concurrency tool. We'll benchmark threads versus processes to observe the Global Interpreter Lock's effect on CPU-bound code and see threads shine for I/O-bound tasks, while processes deliver true parallelism for CPU-bound work. By the end of this lesson, you'll understand when to reach for each tool and how to measure their impact on your specific workloads.

Understanding CPU-Bound and I/O-Bound Workloads

Before we explore Python's concurrency tools, we need to understand the two fundamental types of workloads: CPU-bound and I/O-bound. This distinction determines which concurrency approach will be effective for your particular problem.

CPU-bound workloads spend most of their time performing calculations and manipulating data in memory. Examples include image processing, mathematical computations, data compression, and cryptographic operations. These tasks keep the processor busy; adding more CPU cores allows them to run faster because the work can be distributed across cores. The bottleneck is the CPU itself.

I/O-bound workloads spend most of their time waiting for external resources: reading from disk, making network requests, or querying databases. The CPU is mostly idle during these operations, waiting for data to arrive. Examples include web scraping, API clients, and file processing pipelines. Adding more CPU cores doesn't help much because the bottleneck is the external resource, not the computation. What matters is the ability to work on other tasks while waiting.

This distinction matters because Python's concurrency tools behave very differently depending on the workload type. The Global Interpreter Lock affects CPU-bound tasks differently than I/O-bound tasks, and choosing the wrong tool can make your code slower rather than faster.

The Global Interpreter Lock

The Global Interpreter Lock, commonly called the GIL, is a mutex that protects access to Python objects in CPython, the standard Python implementation. Only one thread can execute Python bytecode at a time, even on multi-core systems. This design simplifies memory management and C extension integration but creates a critical constraint: threads cannot achieve true parallelism for CPU-bound tasks.

When one thread holds the GIL, all other threads must wait before they can execute Python code. The interpreter periodically switches between threads, giving each a turn with the GIL, but only one executes at any moment. For CPU-bound work, this means multiple threads offer no performance benefit; they might even slow things down due to context-switching overhead. The GIL serializes execution that we hoped would be parallel.

However, the GIL is released during blocking I/O operations: file reads, network requests, calls to time.sleep. While one thread waits for I/O, another can acquire the GIL and do useful work. This makes threads effective for I/O-bound workloads; the GIL isn't the bottleneck because threads spend most of their time waiting, not computing. When an I/O operation completes, the thread reacquires the GIL to process the results.

Python's multiprocessing module sidesteps the GIL entirely by creating separate processes, each with its own Python interpreter and memory space. Multiple processes can run truly in parallel on multiple cores because each has its own GIL. The tradeoff is overhead: starting processes is slower than starting threads, and sharing data between processes requires serialization. For CPU-bound tasks, this overhead is worthwhile; for I/O-bound tasks, threads are simpler and faster.

Python's Concurrency Tools

Python's concurrent.futures module provides a high-level interface for asynchronous execution using either threads or processes. The key abstractions are ThreadPoolExecutor and ProcessPoolExecutor, which present identical interfaces but use different underlying mechanisms.

Both executors manage a pool of workers: threads for ThreadPoolExecutor, processes for ProcessPoolExecutor. You submit tasks to the executor, which distributes them among workers and returns Future objects representing eventual results. This abstraction lets you swap between threads and processes by changing the executor class, making it easy to benchmark and compare.

The as_completed function takes a collection of futures and yields them as they finish, in completion order rather than submission order. This pattern is useful when you want to process results as soon as they're available rather than waiting for all tasks to complete. Combined with context managers for automatic cleanup, concurrent.futures provides a clean, safe interface for concurrent execution.

Simulating CPU-Intensive Work

To understand the GIL's impact, we need workloads that represent real scenarios. Let's start with a CPU-bound function that performs intensive calculations:

This function performs n iterations of mathematical operations: modulo, addition, and square root. The math.sqrt call is particularly CPU-intensive. The modulo operation keeps values between 1 and 1000 to avoid floating-point overflow. The function accumulates results in acc and returns the final sum. This represents CPU-bound work because it performs continuous computation with no I/O or waiting.

The key characteristic: every CPU cycle is spent computing. The GIL prevents multiple threads from executing this function in parallel because each thread needs the GIL to run Python bytecode. With threads, only one burn_cpu executes at a time. With processes, multiple burn_cpu calls can run simultaneously on different cores.

Simulating I/O-Bound Work

Now let's create a function that simulates I/O-bound behavior, where the CPU spends most of its time waiting:

This function simulates I/O operations by sleeping for a total duration split into smaller chunks. The time.sleep call releases the GIL, allowing other threads to run during the wait. The chunks parameter splits the total sleep into multiple short sleeps rather than one long sleep, simulating multiple I/O operations like reading file chunks or making several API calls.

The critical behavior: during time.sleep, the GIL is released. If we have multiple threads calling fake_io, they can overlap their sleep periods. While one thread sleeps, others can sleep too, or perform work if they have any. This is why threads work well for I/O-bound tasks; the actual waiting happens outside the GIL's control. With processes, we gain no advantage because the bottleneck is the simulated I/O delay, not computation.

Building a Flexible Benchmark Function

To compare threads and processes fairly, we need a generic benchmarking function that works with both executor types:

The function takes an executor class (not an instance), the target function, a list of argument tuples, and the worker count. It creates an executor using a context manager to ensure cleanup, submits all tasks at once, and collects results as they complete. The *args unpacking passes each tuple of arguments to the function correctly.

The time.perf_counter() provides high-resolution timing suitable for benchmarking. We measure from just before executor creation to just after all results are collected, capturing the total time including executor startup, task distribution, execution, and result gathering. The function returns both the elapsed time and the list of results, allowing us to verify correctness and measure performance simultaneously.

Setting Up the Experiment

With our workload functions and benchmarking tool ready, we can set up a controlled experiment to compare threads and processes:

We determine the worker count based on available CPU cores, clamping between 2 and 8 to keep benchmark times reasonable. The os.cpu_count() returns the number of logical processors.

The cpu_tasks list contains tuples, each specifying one million iterations for burn_cpu. We create twice as many tasks as workers to keep workers busy; if we only had as many tasks as workers, we'd measure single-batch performance rather than sustained throughput. Each task represents significant CPU work.

The io_tasks list contains tuples specifying 0.3 seconds of total sleep split into 10 chunks. Each task simulates 10 I/O operations totaling 300 milliseconds. With multiple workers, these sleeps can overlap significantly because the GIL is released during time.sleep. We use the same task count as the CPU benchmark for fair comparison.

Running the Benchmarks

Now we execute all four benchmark scenarios: CPU-bound with threads, CPU-bound with processes, I/O-bound with threads, and I/O-bound with processes:

Each call to bench returns the elapsed time and results list. The variable names encode the scenario: cpu_t_sec is CPU-bound with threads (time in seconds), cpu_p_sec is CPU-bound with processes, and so on. By keeping the task lists and worker counts identical, we ensure any performance differences reflect the executor type's characteristics, not workload variations.

The bench function handles all the complexity: creating executors, submitting tasks, collecting results, and timing execution. This keeps our benchmark code clean and focused on comparing the four scenarios. The results let us verify correctness (all tasks completed and returned expected values), while the times reveal performance characteristics.

Interpreting the Results

Let's examine the output to understand what we've learned about threads, processes, and the GIL:

We print the sums of results to verify correctness: both thread and process executors should produce identical totals for each workload. Then we print the execution times and calculate ratios to quantify the performance difference.

The first two lines confirm correctness: threads and processes both computed the same total from CPU tasks, validating our benchmark. The next two lines reveal the GIL's impact: threads took 1.381 seconds, while processes took 0.642 seconds, more than twice as fast. The ratio of 2.15 shows processes achieved roughly 2x speedup by running truly in parallel across cores, while threads serialized execution.

For I/O-bound tasks, the sums are again identical at 4.8 seconds of total simulated I/O time. But now the execution times are nearly equal: threads took 0.605 seconds, processes took 0.613 seconds. The ratio of 1.01 shows essentially no difference. Both completed in about 0.6 seconds despite 4.8 seconds of total sleep time because workers overlapped their I/O waits. Threads were slightly faster, likely because process creation overhead outweighs any benefit when the bottleneck is waiting, not computing.

Conclusion and Next Steps

You've completed the first lesson of the Python Concurrency & Async I/O course! Today, we explored the fundamental distinction between CPU-bound and I/O-bound workloads and discovered how the Global Interpreter Lock shapes Python's concurrency landscape. You implemented benchmarking infrastructure using concurrent.futures, created both CPU-intensive and I/O-simulation functions, and ran controlled experiments comparing ThreadPoolExecutor and ProcessPoolExecutor.

The key insight: threads and processes serve different purposes. For CPU-bound work, the GIL prevents threads from achieving parallelism; processes bypass the GIL and deliver true multi-core performance despite higher overhead. For I/O-bound work, threads excel because the GIL is released during waiting; processes offer no advantage and add unnecessary complexity. Choosing the right tool requires understanding your workload's bottleneck.

Throughout this lesson, you built the burn_cpu function for CPU-intensive simulations, the fake_io function for I/O-bound behavior, and the generic bench function that compares executor types objectively. You set up a controlled experiment with calibrated task lists and measured execution times that revealed the GIL's dramatic effect on CPU-bound parallelism and minimal impact on I/O-bound concurrency. These patterns will serve as a foundation for more sophisticated concurrent systems.

Moving forward, the next lesson will introduce Asyncio Foundations, where we'll explore event-loop-driven concurrency by building a basic producer-consumer pipeline. We'll learn about tasks, queues, and coordination patterns that enable efficient asynchronous programming for I/O-bound workloads. But before that exciting journey begins, dive into the upcoming practice exercises to solidify your understanding of threads, processes, and the GIL, and make these benchmarking patterns second nature!

Next Lesson: Asyncio Foundations Explained

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal