Welcome back to Deconstructing the Transformer Architecture! In our second lesson, we're diving into the other essential components that make each Transformer block so powerful: the Position-wise Feed-Forward Network and the critical Add & Norm operations. While our previous lesson explored Multi-Head Attention and how multiple attention heads can capture diverse relationships across different representation subspaces, today we'll discover the complementary mechanisms that complete the Transformer's computational prowess.
While attention mechanisms handle the complex relationships between positions in a sequence, position-wise feed-forward networks provide the computational power to transform these attended representations. Think of attention as gathering the right information, and the feed-forward network as processing that information to extract meaningful patterns. Combined with residual connections and layer normalization, these components form the complete building blocks that enable Transformers to learn deep, stable representations. This lesson will guide you through implementing both the feed-forward networks and the Add & Norm components that make deep Transformer training possible.
Position-wise Feed-Forward Networks serve a fundamentally different purpose than attention mechanisms in the Transformer architecture. While attention focuses on where to look across the sequence, the FFN determines what to do with the information once it's been gathered. The term "position-wise" indicates that the same transformation is applied independently to each position in the sequence, meaning the network processes each token's representation separately without considering relationships between positions.
The architecture is elegantly simple: two linear transformations with a non-linear activation function between them. Mathematically, this can be expressed as for ReLU activation, where the first linear layer typically expands the dimensionality (often to ), and the second layer projects back to the original model dimension. This expansion and contraction pattern allows the network to learn complex non-linear transformations while maintaining consistent dimensionality throughout the Transformer stack.
Let's implement the core PositionwiseFeedForward
network with configurable activation functions and proper weight initialization:
The constructor establishes the essential FFN architecture with two linear transformations. The first layer expands from d_model
to d_ff
dimensions, providing the network with increased representational capacity in the intermediate layer. The second layer projects back to d_model
, ensuring dimensional consistency with the rest of the Transformer. The dropout
layer provides regularization to prevent overfitting, and the configurable activation function allows experimentation with different non-linearities.
The choice of d_ff
is crucial: it's typically set to in standard Transformer implementations. This 4x expansion provides sufficient capacity for complex transformations while maintaining computational efficiency. The expansion-contraction pattern creates a bottleneck architecture that forces the network to learn compressed, meaningful representations in the intermediate layer.
The choice of activation function significantly impacts the FFN's behavior and training dynamics. Let's complete the implementation with proper weight initialization:
The forward pass implements the classic feed-forward pattern: linear transformation, activation, dropout, and final linear projection. Xavier uniform initialization ensures stable gradient flow during training by setting initial weights based on the number of input and output connections. The bias terms are initialized to zero, which is a common practice for linear layers in deep networks.
The activation function choice between ReLU
and GELU
represents different approaches to non-linearity. ReLU
provides simple, efficient computation with sparse activation patterns, while GELU
offers smoother gradients and has shown superior performance in many Transformer applications. GELU (Gaussian Error Linear Unit) approximates the activation pattern , where is the cumulative distribution function of the standard normal distribution, providing a more nuanced activation profile than the hard threshold of ReLU.
The Add & Norm component implements two critical techniques that enable stable training of deep Transformer networks: residual connections and layer normalization:
The AddNorm
module encapsulates two fundamental techniques that solve different training challenges. Residual connections address the vanishing gradient problem by providing direct pathways for gradients to flow through the network. The mathematical operation ensures that even if the sublayer learns to output zeros, the original input can still pass through unchanged.
Now let's see how these components work together in the complete Transformer encoder layer pattern and examine the practical benefits:
This implementation demonstrates the canonical Transformer encoder layer pattern: Multi-Head Attention
followed by Add & Norm
, then Position-wise Feed-Forward Network
followed by another Add & Norm
. Each sublayer (attention and FFN) is wrapped with a residual connection and layer normalization, creating the characteristic two-step pattern that defines each Transformer block.
The sequential application shows how information flows through the block: first, the attention mechanism gathers relevant information from across the sequence; then, the Add & Norm operation stabilizes and normalizes this output while preserving the original input through the residual connection. Next, the FFN processes this normalized representation, and finally, another Add & Norm operation stabilizes the output while again preserving information flow through the residual connection.
When we run the complete test suite, we get the following output that demonstrates the effectiveness of our implementation:
This output reveals several important characteristics of our implementation. First, the shape preservation confirms that both the FFN and the complete Transformer block maintain consistent dimensionality throughout processing. More importantly, the statistics comparison shows the effect of layer normalization: while the input has a small positive mean, the output is centered around zero, and the standard deviation remains close to 1.0, indicating that layer normalization is successfully stabilizing the activations and preventing activation drift that could destabilize training.
We've successfully implemented the complete PositionWiseFeed-Forward
Network and Add & Norm
components that form the other half of each Transformer block! These components work in harmony with the MultiHeadAttention
mechanism you built previously to create powerful, trainable deep architectures. The FFN provides the non-linear computational power, while the Add & Norm operations ensure stable gradient flow and consistent activation magnitudes throughout the network.
The combination of these elements demonstrates the elegant engineering behind Transformers: attention mechanisms for relationship modeling, feed-forward networks for non-linear processing, and residual connections with layer normalization for training stability. In our next lesson, we'll explore how Transformers handle sequence order through positional encodings, completing our understanding of the core architectural components that make these models so effective across diverse NLP tasks. Now, let's get ready for some practice!
