Transformer Architecture
Attention Is All You Need
1. The Revolution
In 2017, Google published "Attention Is All You Need," introducing the Transformer architecture. It replaced recurrent layers with pure attention mechanisms, enabling:
- ā Parallel processing (vs sequential RNNs)
- ā Better long-range dependencies
- ā Faster training on modern GPUs
- ā State-of-the-art results on NLP tasks
2. Architecture Overview
Encoder-Decoder Structure
Encoder (Left)
Ć N layers (usually 6)
Decoder (Right)
Ć N layers (usually 6)
3. Multi-Head Attention (The Core)
The heart of the Transformer. Instead of one attention mechanism, use multiple "heads" in parallel:
$$ \text{head}_h = \text{Attention}(QW_h^Q, KW_h^K, VW_h^V) $$
Where:
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$
Combine all heads:
$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O $$
Why multiple heads? Each head can attend to different aspects:
- Head 1: Syntax (subject-verb agreement)
- Head 2: Semantics (word meaning relationships)
- Head 3: Long-range dependencies
- Head 4-8: Other patterns the model discovers
4. Positional Encoding
Since Transformers have no recurrence, they don't know word order by default. We add positional encodings to embeddings:
where pos = position, i = dimension index, dmodel = embedding dimension (e.g., 512)
This creates unique patterns for each position that the model can learn to use.
5. Feed-Forward Networks
After attention, each position passes through a simple 2-layer network:
Usually: dmodel = 512, dff = 2048
Applied to each position independently (same network, different inputs).
6. Layer Normalization & Residual Connections
Two critical tricks for training deep networks:
Residual Connections
The input "skips" around the sublayer. Helps gradients flow during backprop.
Layer Normalization
Normalizes across features (not batch). Stabilizes training.
7. Training Details
| Hyperparameter | Base Model | Big Model |
|---|---|---|
| Layers (N) | 6 | 6 |
| dmodel | 512 | 1024 |
| dff | 2048 | 4096 |
| Attention Heads | 8 | 16 |
| Parameters | 65M | 213M |
8. Transformer Variants
BERT (Bidirectional Encoder Representations from Transformers)
Uses only the encoder. Pre-trained on masked language modeling.
- Task: Fill in the blank ("The cat sat on the [MASK]")
- Use case: Classification, NER, QA
GPT (Generative Pre-trained Transformer)
Uses only the decoder. Pre-trained on next-token prediction.
- Task: "The cat sat on the" ā predict "mat"
- Use case: Text generation, few-shot learning
T5 (Text-to-Text Transfer Transformer)
Full encoder-decoder. Treats all tasks as text-to-text.
- Translation: "translate English to German: Hello" ā "Hallo"
- Summarization, QA, all as text generation
9. Why Transformers Won
| Aspect | RNN/LSTM | Transformer |
|---|---|---|
| Parallelization | ā Sequential | ā Fully parallel |
| Long Dependencies | ā Vanishing gradients | ā Direct connections |
| Training Speed | Slow (days) | Fast (hours on GPUs) |
| Memory | O(n) | O(n²) (attention matrix) |
| Interpretability | Black box | Attention weights! |
10. Modern Applications
- š¤ NLP: BERT, GPT-3/4, T5, BART
- š¼ļø Computer Vision: Vision Transformer (ViT), DEIT
- šµ Audio: Wav2Vec, Speech Transformer
- 𧬠Biology: AlphaFold2 (protein folding)
- š® RL: Decision Transformer
- šØ Multimodal: CLIP, DALL-E, Flamingo