Multi-Head Attention Mechanism

Introduction

Welcome to "Deconstructing the Transformer Architecture"! You've just completed an incredible journey through Sequence Models & The Dawn of Attention, where you built the foundational understanding of attention mechanisms and created a standalone PyTorch module for scaled dot-product attention. Now you're ready to take the next major step in your exploration of modern NLP architectures.

In this course, you'll systematically build the complete Transformer architecture from the ground up. You'll start by enhancing your attention mechanism with Multi-Head Attention, then explore positional encodings, layer normalization, and feed-forward networks. By the end, you'll have a fully functional Transformer that can handle real sequence-to-sequence tasks. Our first lesson focuses on Multi-Head Attention, the mechanism that allows Transformers to attend to different types of information simultaneously across multiple representation subspaces.

The Power of Multiple Attention Perspectives

While our single-head attention mechanism from the previous course works well, it has a fundamental limitation: it can only focus on one type of relationship at a time. Imagine reading a sentence where you need to simultaneously track grammatical dependencies, semantic relationships, and contextual references. A single attention head might excel at one of these tasks but struggle to capture all of them effectively.

Multi-Head Attention solves this by running multiple attention computations in parallel, each focusing on different aspects of the input relationships. Think of it as having multiple experts, each specializing in different types of patterns. One head might focus on local dependencies, another on long-range relationships, and yet another on specific semantic patterns. The key insight is that by running these computations in parallel and combining their results, you can capture much richer representations than any single attention mechanism could provide.

The mathematical foundation remains the same scaled dot-product attention you mastered previously, but now you apply it across multiple "heads" simultaneously. Each head operates on different learned projections of the input, allowing the model to attend to information from different representation subspaces at different positions. Mathematically, if you have $h$ heads and model dimension $d_{model}$ , each head operates on dimension , ensuring computational efficiency while maintaining expressiveness.

The Architecture: Linear Projections and Head Division

Let's begin implementing Multi-Head Attention by establishing the core architecture. The first crucial component involves creating separate linear projections for queries, keys, and values, then splitting these projections across multiple attention heads:

The constructor establishes several critical design decisions. We ensure d_model is divisible by num_heads because we'll split the model dimension evenly across heads. Each head operates on $d_k = d_{model} / num\_heads$ dimensions, meaning if you have a 512-dimensional model with 8 heads, each head works with 64-dimensional projections. The four linear layers create separate learned transformations for queries, keys, values, and the final output projection that combines results from all heads.

Implementing Scaled Dot-Product for Multiple Heads

Now you implement the core attention computation that will be applied independently to each head. This method encapsulates the scaled dot-product attention you learned previously, but is designed to handle the multi-head tensor structure:

This implementation mirrors your previous scaled dot-product attention but operates on tensors with an additional head dimension. The key insight is that the same mathematical operations apply regardless of whether you're computing attention for one head or multiple heads simultaneously. The transpose(-2, -1) operation and matrix multiplications work seamlessly across the head dimension through broadcasting, allowing you to compute attention for all heads in parallel rather than using expensive loops. The dropout applied to the attention weights serves as a regularization mechanism, randomly zeroing out attention connections during training to prevent the model from over-relying on specific attention patterns and improve generalization.

The attention computation for each head follows the formula:

\text{Attention}(Q_i, K_i, V_i) = \text{softmax}\left(\frac{Q_iK_i^T}{\sqrt{d_k}}\right)V_i

Tensor Reshaping and Multi-Head Computation

The heart of Multi-Head Attention lies in how you reshape tensors to create multiple heads and then recombine their outputs. This section handles the complex tensor manipulations that enable parallel attention computation:

The reshaping operation is crucial: you transform tensors from (batch_size, seq_len, d_model) to (batch_size, num_heads, seq_len, d_k). The view operation splits the model dimension into separate heads, while transpose(1, 2) moves the head dimension to the second position for efficient computation. This tensor manipulation is what enables the "multi-head" aspect — you're essentially creating multiple parallel attention computations from a single input.

The mask handling ensures compatibility with the multi-head structure by adding necessary dimensions for proper broadcasting across all heads. After computing attention, you reverse the reshaping process: transposing back and using view to concatenate all head outputs into the original dimension. The call ensures the tensor is stored in a contiguous block of memory, which is required before reshaping. Finally, the output projection learns how to best combine information from all heads, implementing the concatenation and linear transformation specified in the original Transformer paper.

Network Initialization

Proper weight initialization and comprehensive testing are crucial for ensuring your Multi-Head Attention module functions correctly. Let's implement the initialization method and create a thorough testing framework:

Xavier uniform initialization helps ensure stable training by preventing vanishing or exploding gradients during the initial training phases. This initialization strategy considers the number of input and output connections to set appropriate initial weight magnitudes.

Testing Our Implementation

Now let's create tests to verify that your Multi-Head Attention implementation works correctly across different scenarios:

This testing framework validates critical aspects of the implementation. You verify that input and output dimensions are preserved, test both unmasked and causal-masked scenarios, and examine gradient flow across all parameters. The self-attention setup (where query, key, and value are all the same input x) is fundamental to Transformer architectures and provides a clear test case for your implementation.

Verifying the Results

When you run this complete test, it produces the following comprehensive output that confirms your implementation works correctly:

This output confirms several crucial aspects of your implementation: input and output shapes are preserved, attention weights have the expected multi-head structure (batch, heads, seq, seq), and gradient flow is verified across all parameters. The causal mask creates a lower triangular matrix that prevents positions from attending to future positions, essential for autoregressive generation tasks. The total parameter count of 16,640 reflects four projection matrices, each with $64 \times 64 = 4,096$ weights plus 64 bias terms, totaling $4 \times ($ parameters.

Computational Complexity and Optimization Considerations

Multi-Head Attention, while powerful, comes with significant computational costs that scale quadratically with sequence length. The time complexity is $O(n^2 \cdot d)$ where $n$ is the sequence length and $d$ is the model dimension, dominated by the computation of attention scores between all pairs of positions. Memory complexity follows a similar pattern, requiring storage for attention matrices across all heads. For longer sequences (thousands of tokens), this quadratic scaling becomes prohibitive, making standard attention computationally expensive for tasks like document processing or long-form generation.

Conclusion and Next Steps

Congratulations! You've successfully implemented Multi-Head Attention, a cornerstone mechanism that enables Transformers to capture multiple types of relationships simultaneously. Your implementation demonstrates how parallel attention heads can attend to different representation subspaces, providing a much richer understanding than single-head mechanisms. The careful tensor reshaping and concatenation logic you've mastered form the foundation for more complex Transformer components.

In our next lesson, you'll explore positional encodings, the mechanism that gives Transformers their understanding of sequence order. Unlike RNNs that process sequences step by step, Transformers need explicit positional information, and you'll discover the elegant mathematical solutions that make this possible. Get ready to dive deeper into the architectural innovations that make Transformers so powerful!

Next Lesson: Feed-Forward Networks and AddNorm

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal