Transformer Architecture

Attention Is All You Need

1. The Revolution

In 2017, Google published "Attention Is All You Need," introducing the Transformer architecture. It replaced recurrent layers with pure attention mechanisms, enabling:

šŸ’” Key Innovation: No recurrence, no convolution — just attention and feed-forward networks.

2. Architecture Overview

Encoder-Decoder Structure

Encoder (Left)

Input Embedding + Positional Encoding
Multi-Head Self-Attention
Add & Norm
Feed-Forward Network
Add & Norm

Ɨ N layers (usually 6)

Decoder (Right)

Output Embedding + Positional Encoding
Masked Multi-Head Self-Attention
Add & Norm
Multi-Head Cross-Attention
Add & Norm
Feed-Forward Network
Add & Norm
Linear + Softmax

Ɨ N layers (usually 6)

3. Multi-Head Attention (The Core)

The heart of the Transformer. Instead of one attention mechanism, use multiple "heads" in parallel:

For each head h:
$$ \text{head}_h = \text{Attention}(QW_h^Q, KW_h^K, VW_h^V) $$

Where:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$

Combine all heads:
$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O $$

Why multiple heads? Each head can attend to different aspects:

4. Positional Encoding

Since Transformers have no recurrence, they don't know word order by default. We add positional encodings to embeddings:

$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$

where pos = position, i = dimension index, dmodel = embedding dimension (e.g., 512)

This creates unique patterns for each position that the model can learn to use.

5. Feed-Forward Networks

After attention, each position passes through a simple 2-layer network:

$$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$

Usually: dmodel = 512, dff = 2048
Applied to each position independently (same network, different inputs).

6. Layer Normalization & Residual Connections

Two critical tricks for training deep networks:

Residual Connections

$$ \text{output} = \text{LayerNorm}(x + \text{Sublayer}(x)) $$

The input "skips" around the sublayer. Helps gradients flow during backprop.

Layer Normalization

$$ \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma + \epsilon} + \beta $$

Normalizes across features (not batch). Stabilizes training.

7. Training Details

Hyperparameter Base Model Big Model
Layers (N) 6 6
dmodel 512 1024
dff 2048 4096
Attention Heads 8 16
Parameters 65M 213M

8. Transformer Variants

BERT (Bidirectional Encoder Representations from Transformers)

Uses only the encoder. Pre-trained on masked language modeling.

GPT (Generative Pre-trained Transformer)

Uses only the decoder. Pre-trained on next-token prediction.

T5 (Text-to-Text Transfer Transformer)

Full encoder-decoder. Treats all tasks as text-to-text.

9. Why Transformers Won

Aspect RNN/LSTM Transformer
Parallelization āŒ Sequential āœ… Fully parallel
Long Dependencies āŒ Vanishing gradients āœ… Direct connections
Training Speed Slow (days) Fast (hours on GPUs)
Memory O(n) O(n²) (attention matrix)
Interpretability Black box Attention weights!

10. Modern Applications

šŸš€ Future: Transformers are becoming the universal architecture for sequence modeling across all domains.