Introducing the Attention Mechanism

Introduction

Welcome back! It's amazing seeing you again in our course "Sequence Models & The Dawn of Attention". As we advance to our second lesson, we're about to explore one of the most revolutionary concepts in modern deep learning: the attention mechanism.

In our previous lesson, we witnessed firsthand how LSTMs struggle with long-range dependencies, observing performance degradation as sequence lengths increased. This limitation wasn't merely a technical curiosity; it revealed a fundamental bottleneck that restricted the potential of sequence models. The core issue lies in forcing models to compress all relevant information into a single fixed-size hidden state, creating an information bottleneck that becomes more severe as sequences grow longer.

Today, we'll discover how attention mechanisms elegantly solve this problem by allowing models to selectively focus on different parts of the input sequence. Rather than relying on a single summary vector, attention enables direct access to any position in the sequence, fundamentally changing how we approach sequence modeling. We'll implement two foundational attention variants: Luong (multiplicative) and Bahdanau (additive) attention, understanding their mathematical foundations and practical differences. Let's dive in!

The Problem with Fixed Context Windows

The limitation we observed with LSTMs stems from a deeper architectural constraint: the fixed-size context bottleneck. Traditional sequence models, like RNNs and LSTMs, process information sequentially and attempt to compress all information encountered so far into a hidden state of a predetermined size. This hidden state must then serve as the sole basis for future predictions or for understanding the sequence as a whole.

Imagine trying to summarize an entire book in a single, short sentence while ensuring all crucial plot points, character developments, and themes are perfectly preserved. It's an incredibly difficult, if not impossible, task. As the book (or sequence) gets longer, more information needs to be crammed into that one sentence (the fixed-size hidden state), inevitably leading to information loss.

Consider a machine translation task. If we want to translate a long, complex sentence like "The cat, which had been lazily napping under the old oak tree in the sprawling garden all afternoon, suddenly awoke with a start," an LSTM would process this word by word. By the time it needs to translate "awoke," its hidden state must somehow retain the fact that "cat" is the subject, along with all the intervening descriptive clauses. This reliance on a compressed summary becomes increasingly problematic with longer sequences. Attention mechanisms offer a way out by allowing the model to look back at the entire input sequence at each step, rather than relying solely on a compressed summary.

The Query-Key-Value Paradigm

At its core, attention operates through three fundamental components: Queries (Q), Keys (K), and Values (V). This QKV formulation provides an elegant and powerful framework for computing relevance and retrieving information. The concept is inspired by information retrieval systems, where you use a query to search a database (composed of key-value pairs) to find relevant information.

Let's break down these components in the context of sequence models:

A Query represents the current point of interest or what the model is trying to figure out at a specific step. For example, in machine translation, a query might be related to the word currently being generated in the target language. It essentially asks, "Given my current context, what information from the input sequence is most relevant right now?"
Keys are associated with different parts of the input sequence. Each key corresponds to a specific element or position in the input. They act like "labels" or "indices" for the information contained in the input.
Values also correspond to the elements of the input sequence. They contain the actual information or content that we want to retrieve. Typically, for each key, there's an associated value.

The attention mechanism works by:

Comparing the Query with all the Keys to calculate a set of scores. These scores determine how relevant each input part (represented by its Key) is to the current Query.
Using these scores to compute a weighted sum of the Values.

This process allows the model to selectively focus on the most pertinent parts of the input sequence (those whose Keys best match the Query) and retrieve their corresponding Values.

Setting Up Sample Data for Attention

Let's set up some sample data to see what these tensors look like. We'll use PyTorch for our implementation:

In this function, create_sample_data, we generate sample random tensors for our query, keys, and values:

query has a shape of (batch_size, hidden_size). This represents a single query vector for each item in our batch.
keys has a shape of (batch_size, seq_len, hidden_size). For each item in the batch, there's a sequence of seq_len key vectors.
values has a shape of (batch_size, seq_len, hidden_size), mirroring the shape of keys. Each key has a corresponding value.

These tensor shapes are typical for attention mechanisms. The hidden_size represents the dimensionality of our embeddings or feature vectors.

Calculating Attention Scores

The heart of any attention mechanism lies in computing attention scores. These scores quantify the relevance between a Query (representing what we're looking for) and each Key in the input sequence (representing different pieces of information). Think of it like searching for a specific topic (the query) in a library catalog (the keys); some entries will be highly relevant, others not at all.

Once we have these raw scores, they need to be transformed into something more usable. This is where the softmax function comes in. Applying softmax to the scores converts them into a set of attention weights. These weights have two important properties:

Each weight is between 0 and 1.
All weights for a given query (across all keys in the sequence) sum up to 1.

This means the attention weights form a probability distribution, indicating how much "attention" or importance the model should pay to each part of the input sequence when constructing its output. A higher weight for a particular key-value pair means that value will contribute more to the final result, known as the context vector.

Different attention mechanisms use different scoring functions to calculate the initial raw scores. A common and intuitive one is the dot product, which measures the similarity in orientation between the query and key vectors. We'll see this in action with Luong attention.

Implementing Luong Attention

Luong attention, also known as multiplicative attention, is a popular and efficient attention mechanism. It gets its "multiplicative" name because it primarily uses dot products (a form of multiplication) to calculate attention scores. Let's see how to implement it.

Let's break this down:

Prepare Query: We first unsqueeze the query tensor at dimension 1. This changes its shape from (batch_size, hidden_size) to (batch_size, 1, hidden_size), making it compatible for batch matrix multiplication (torch.bmm) with the keys.
Calculate Scores: The core of Luong attention is the dot product between the and each . We transpose the tensor's last two dimensions (from to ). Then, computes the dot product of the query with every key in the sequence for each item in the batch. The result (after squeezing out the dimension of size 1) has a shape of , indicating how much the query aligns with each key.

Implementing Bahdanau Attention

Bahdanau attention, also called additive attention, takes a slightly more complex approach to computing query-key similarity. Instead of a simple dot product, it uses a small feed-forward neural network (often a single linear layer after combining query and key) to calculate the scores. This allows for potentially more expressive relationships to be learned.

In Bahdanau attention:

Learnable Transformations: We define three linear layers: W_q to transform the query, W_k to transform the keys, and v to compute the final score from their combined representation. In a real model, these layers would be initialized once and their weights learned during training.

Comparing Attention Mechanisms

Now let's examine how these two attention mechanisms behave in practice by running them on our sample data and analyzing their outputs:

When we run the main() function, we'll see the shapes of our tensors and the attention weights produced by each mechanism. The expected output is:

Looking at the output:

The tensor shapes confirm our understanding. The context vector has the same hidden_size as the query, and attention weights are distributed over the seq_len.

Conclusion and Next Steps

Today, we've taken a crucial step in our journey from traditional sequence models to the Transformer architecture. We explored how attention mechanisms solve the fixed-context bottleneck that limited RNNs and LSTMs, introducing the elegant Query-Key-Value paradigm that enables selective information retrieval from any sequence position.

Through implementing both Luong and Bahdanau attention, we discovered how different similarity functions produce distinct attention behaviors: multiplicative attention can create focused patterns, while additive attention with learnable transformations allows for more nuanced distributions. These mechanisms are foundational building blocks for more advanced attention systems. In our next lesson, we'll build upon these concepts to explore multi-head attention, discovering how parallel attention mechanisms can capture different types of relationships simultaneously, bringing us closer to the full Transformer. Until then, let's practice what we've learned today!

Previous Lesson

Next Lesson: Scaled Dot-Product Attention and Masking in Transformers

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal