Attention Mechanism | Math for ML

1. The Problem: Why Attention?

Traditional sequence-to-sequence models (like RNNs/LSTMs) compress the entire input sequence into a single fixed-size context vector. This creates a bottleneck for long sequences.

💡 Key Insight: Instead of compressing everything into one vector, attention allows the model to focus on relevant parts of the input at each decoding step.

2. Attention in 3 Steps

Step 1: Calculate Alignment Scores

e_{ij} = \text{score}(s_{i-1}, h_j)

Step 2: Compute Attention Weights

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}

Step 3: Compute Context Vector

c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j

3. Interactive Attention Visualization

Sequence Length: 5

Query Position: 3