Attention Mechanism
The Foundation of Modern NLP
1. The Problem: Why Attention?
Traditional sequence-to-sequence models (like RNNs/LSTMs) compress the entire input sequence into a single fixed-size context vector. This creates a bottleneck for long sequences.
💡 Key Insight: Instead of compressing everything into one vector,
attention allows the model to focus on relevant parts of the input at each decoding step.
2. Attention in 3 Steps
Step 1: Calculate Alignment Scores
$$ e_{ij} = \text{score}(s_{i-1}, h_j) $$
Step 2: Compute Attention Weights
$$ \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})} $$
Step 3: Compute Context Vector
$$ c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j $$