Attention Mechanism

The Foundation of Modern NLP

1. The Problem: Why Attention?

Traditional sequence-to-sequence models (like RNNs/LSTMs) compress the entire input sequence into a single fixed-size context vector. This creates a bottleneck for long sequences.

💡 Key Insight: Instead of compressing everything into one vector, attention allows the model to focus on relevant parts of the input at each decoding step.

2. Attention in 3 Steps

Step 1: Calculate Alignment Scores

$$ e_{ij} = \text{score}(s_{i-1}, h_j) $$

Step 2: Compute Attention Weights

$$ \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})} $$

Step 3: Compute Context Vector

$$ c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j $$

3. Interactive Attention Visualization