Transformers | Math for ML

1. The Revolution

In 2017, Google published "Attention Is All You Need," introducing the Transformer architecture. It replaced recurrent layers with pure attention mechanisms, enabling:

✅ Parallel processing (vs sequential RNNs)
✅ Better long-range dependencies
✅ Faster training on modern GPUs
✅ State-of-the-art results on NLP tasks

💡 Key Innovation: No recurrence, no convolution — just attention and feed-forward networks.

2. Architecture Overview

Encoder-Decoder Structure

Encoder (Left)

Input Embedding + Positional Encoding

Multi-Head Self-Attention

Add & Norm

Feed-Forward Network

Add & Norm

× N layers (usually 6)

Decoder (Right)

Output Embedding + Positional Encoding

Masked Multi-Head Self-Attention

Add & Norm

Multi-Head Cross-Attention

Add & Norm

Feed-Forward Network

Add & Norm

Linear + Softmax

× N layers (usually 6)

3. Multi-Head Attention (The Core)

The heart of the Transformer. Instead of one attention mechanism, use multiple "heads" in parallel:

For each head h: $$ \text{head}_h = \text{Attention}(QW_h^Q, KW_h^K, VW_h^V) $$ Where: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$ Combine all heads: $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O $$

Why multiple heads? Each head can attend to different aspects:

Head 1: Syntax (subject-verb agreement)
Head 2: Semantics (word meaning relationships)
Head 3: Long-range dependencies
Head 4-8: Other patterns the model discovers

4. Positional Encoding

Since Transformers have no recurrence, they don't know word order by default. We add positional encodings to embeddings:

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)

where pos = position, i = dimension index, d_model = embedding dimension (e.g., 512)

This creates unique patterns for each position that the model can learn to use.

5. Feed-Forward Networks

After attention, each position passes through a simple 2-layer network:

\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

Usually: d_model = 512, d_ff = 2048
Applied to each position independently (same network, different inputs).

6. Layer Normalization & Residual Connections

Two critical tricks for training deep networks:

Residual Connections

\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))

The input "skips" around the sublayer. Helps gradients flow during backprop.

Layer Normalization

\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sigma + \epsilon} + \beta

Normalizes across features (not batch). Stabilizes training.

7. Training Details

Hyperparameter	Base Model	Big Model
Layers (N)	6	6
d_model	512	1024
d_ff	2048	4096
Attention Heads	8	16
Parameters	65M	213M

8. Transformer Variants

BERT (Bidirectional Encoder Representations from Transformers)

Uses only the encoder. Pre-trained on masked language modeling.

Task: Fill in the blank ("The cat sat on the [MASK]")
Use case: Classification, NER, QA

GPT (Generative Pre-trained Transformer)

Uses only the decoder. Pre-trained on next-token prediction.

Task: "The cat sat on the" → predict "mat"
Use case: Text generation, few-shot learning

T5 (Text-to-Text Transfer Transformer)

Full encoder-decoder. Treats all tasks as text-to-text.

Translation: "translate English to German: Hello" → "Hallo"
Summarization, QA, all as text generation

9. Why Transformers Won

Aspect	RNN/LSTM	Transformer
Parallelization	❌ Sequential	✅ Fully parallel
Long Dependencies	❌ Vanishing gradients	✅ Direct connections
Training Speed	Slow (days)	Fast (hours on GPUs)
Memory	O(n)	O(n²) (attention matrix)
Interpretability	Black box	Attention weights!

10. Modern Applications

🔤 NLP: BERT, GPT-3/4, T5, BART
🖼️ Computer Vision: Vision Transformer (ViT), DEIT
🎵 Audio: Wav2Vec, Speech Transformer
🧬 Biology: AlphaFold2 (protein folding)
🎮 RL: Decision Transformer
🎨 Multimodal: CLIP, DALL-E, Flamingo

🚀 Future: Transformers are becoming the universal architecture for sequence modeling across all domains.

Transformer Architecture

1. The Revolution

2. Architecture Overview

Encoder-Decoder Structure

Encoder (Left)

Decoder (Right)

3. Multi-Head Attention (The Core)

4. Positional Encoding

5. Feed-Forward Networks

6. Layer Normalization & Residual Connections

Residual Connections

Layer Normalization

7. Training Details

8. Transformer Variants

BERT (Bidirectional Encoder Representations from Transformers)

GPT (Generative Pre-trained Transformer)

T5 (Text-to-Text Transfer Transformer)

9. Why Transformers Won

10. Modern Applications