Here, operations are element-wise. Parameters with frequently large gradients get their effective learning rate reduced.
7. Walkthrough: Manual Calculation
We will trace the first 2 iterations for all 5 algorithms to see exactly how they differ.
Setup:
Function: \( J(\theta) = 0.5\theta_1^2 + 5\theta_2^2 \)
Gradient: \( \nabla J = [\theta_1, 10\theta_2] \)
Start Point: \( \theta_0 = [10, 1] \)
Learning Rate: \( \eta = 0.1 \)
1. Vanilla Gradient Descent
Iteration 1:
1. Gradient: \( g_0 = \nabla J(\theta_0) \)
\( g_0 = [10, 10(1)] = [10, 10] \)
2. Update: \( \theta_1 = \theta_0 - \eta \cdot g_0 \)
\( \theta_1 = [10, 1] - 0.1 \cdot [10, 10] \)
\( \theta_1 = [10, 1] - [1, 1] = [9, 0] \)
Iteration 2:
1. Gradient: \( g_1 = \nabla J(\theta_1) \)
\( g_1 = [9, 10(0)] = [9, 0] \)
2. Update: \( \theta_2 = \theta_1 - \eta \cdot g_1 \)
\( \theta_2 = [9, 0] - 0.1 \cdot [9, 0] \)
\( \theta_2 = [9, 0] - [0.9, 0] = [8.1, 0] \)
2. Momentum (EMA formulation)
Params: \( \beta = 0.9 \). Initialize \( v_0 = [0, 0] \).
Iteration 1:
1. Gradient: \( g_0 = [10, 10] \)
2. Velocity: \( v_1 = \beta v_0 + (1-\beta)g_0 \)
\( v_1 = 0.9[0,0] + 0.1[10,10] = [0,0] + [1,1] = [1, 1] \)
3. Update: \( \theta_1 = \theta_0 - \eta v_1 \)
\( \theta_1 = [10, 1] - 0.1[1, 1] \)
\( \theta_1 = [10, 1] - [0.1, 0.1] = [9.9, 0.9] \)
Iteration 2:
1. Gradient: \( g_1 = [9.9, 10(0.9)] = [9.9, 9.0] \)
2. Velocity: \( v_2 = \beta v_1 + (1-\beta)g_1 \)
\( v_2 = 0.9[1, 1] + 0.1[9.9, 9.0] \)
\( v_2 = [0.9, 0.9] + [0.99, 0.9] = [1.89, 1.8] \)
3. Update: \( \theta_2 = \theta_1 - \eta v_2 \)
\( \theta_2 = [9.9, 0.9] - 0.1[1.89, 1.8] \)
\( \theta_2 = [9.9, 0.9] - [0.189, 0.18] = [9.711, 0.72] \)
3. Adagrad
Initialize sum of squares \( G_0 = [0, 0] \). Epsilon \( \epsilon = 10^{-8} \) (ignored).
Iteration 1:
1. Gradient: \( g_0 = [10, 10] \)
2. Accumulate Squares: \( G_1 = G_0 + g_0^2 \)
\( G_1 = [0,0] + [10^2, 10^2] = [100, 100] \)
3. Update: \( \theta_1 = \theta_0 - \frac{\eta}{\sqrt{G_1}} \odot g_0 \)
\( \theta_1 = [10, 1] - \frac{0.1}{\sqrt{[100, 100]}} \odot [10, 10] \)
\( \theta_1 = [10, 1] - [ \frac{0.1}{10}(10), \frac{0.1}{10}(10) ] \)
\( \theta_1 = [10, 1] - [0.1, 0.1] = [9.9, 0.9] \)
Iteration 2:
1. Gradient: \( g_1 = [9.9, 9.0] \)
2. Accumulate Squares: \( G_2 = G_1 + g_1^2 \)
\( G_2 = [100, 100] + [9.9^2, 9.0^2] \)
\( G_2 = [100, 100] + [98.01, 81] = [198.01, 181] \)
3. Update: \( \theta_2 = \theta_1 - \frac{\eta}{\sqrt{G_2}} \odot g_1 \)
\( \theta_2 = [9.9, 0.9] - \frac{0.1}{\sqrt{[198.01, 181]}} \odot [9.9, 9.0] \)
\( \theta_2 \approx [9.9, 0.9] - [ \frac{0.1}{14.07}(9.9), \frac{0.1}{13.45}(9.0) ] \)
\( \theta_2 \approx [9.9, 0.9] - [ 0.0071(9.9), 0.0074(9.0) ] \)
\( \theta_2 \approx [9.9, 0.9] - [0.07, 0.067] = [9.83, 0.833] \)
4. RMSprop
Params: \( \beta = 0.999 \). Initialize \( E_0 = [0, 0] \).
Iteration 1:
1. Gradient: \( g_0 = [10, 10] \)
2. Moving Avg: \( E_1 = \beta E_0 + (1-\beta) g_0^2 \)
\( E_1 = 0.999[0,0] + 0.001[100, 100] = [0.1, 0.1] \)
3. Update: \( \theta_1 = \theta_0 - \frac{\eta}{\sqrt{E_1}} \odot g_0 \)
\( \theta_1 = [10, 1] - \frac{0.1}{\sqrt{[0.1, 0.1]}} \odot [10, 10] \)
\( \theta_1 = [10, 1] - [ \frac{0.1}{0.316}(10), \frac{0.1}{0.316}(10) ] \)
\( \theta_1 \approx [10, 1] - [ 0.316(10), 0.316(10) ] \)
\( \theta_1 = [10, 1] - [3.16, 3.16] = [6.84, -2.16] \)
Iteration 2:
1. Gradient: \( g_1 = [6.84, 10(-2.16)] = [6.84, -21.6] \)
2. Moving Avg: \( E_2 = 0.999 E_1 + 0.001 g_1^2 \)
\( E_2 = 0.999[0.1, 0.1] + 0.001[6.84^2, (-21.6)^2] \)
\( E_2 \approx [0.0999, 0.0999] + 0.001[46.7, 466.5] \)
\( E_2 \approx [0.0999, 0.0999] + [0.047, 0.467] = [0.147, 0.567] \)
3. Update: \( \theta_2 = \theta_1 - \frac{\eta}{\sqrt{E_2}} \odot g_1 \)
\( \theta_2 = [6.84, -2.16] - \frac{0.1}{\sqrt{[0.147, 0.567]}} \odot [6.84, -21.6] \)
\( \theta_2 \approx [6.84, -2.16] - [ \frac{0.1}{0.38}(6.84), \frac{0.1}{0.75}(-21.6) ] \)
\( \theta_2 \approx [6.84, -2.16] - [ 0.26(6.84), 0.133(-21.6) ] \)
\( \theta_2 \approx [6.84, -2.16] - [ 1.78, -2.87 ] = [5.06, 0.71] \)
5. Adam
Params: \( \beta_1 = 0.9, \beta_2 = 0.999 \). Init \( m_0=0, v_0=0 \).
Iteration 1:
1. Gradient: \( g_0 = [10, 10] \)
2. Update Moments:
\( m_1 = \beta_1 m_0 + (1-\beta_1) g_0 = 0.9(0) + 0.1[10,10] = [1, 1] \)
\( v_1 = \beta_2 v_0 + (1-\beta_2) g_0^2 = 0.999(0) + 0.001[100,100] = [0.1, 0.1] \)
3. Bias Correct (t=1):
\( \hat{m}_1 = m_1 / (1 - 0.9^1) = [1, 1] / 0.1 = [10, 10] \)
\( \hat{v}_1 = v_1 / (1 - 0.999^1) = [0.1, 0.1] / 0.001 = [100, 100] \)
4. Update: \( \theta_1 = \theta_0 - \frac{\eta}{\sqrt{\hat{v}_1}} \odot \hat{m}_1 \)
\( \theta_1 = [10, 1] - \frac{0.1}{\sqrt{[100, 100]}} \odot [10, 10] \)
\( \theta_1 = [10, 1] - [ \frac{0.1}{10}(10), \frac{0.1}{10}(10) ] \)
\( \theta_1 = [10, 1] - [0.1, 0.1] = [9.9, 0.9] \)
Iteration 2:
1. Gradient: \( g_1 = [9.9, 9.0] \)
2. Update Moments:
\( m_2 = 0.9[1, 1] + 0.1[9.9, 9.0] = [0.9, 0.9] + [0.99, 0.9] = [1.89, 1.8] \)
\( v_2 = 0.999[0.1, 0.1] + 0.001[9.9^2, 9.0^2] \)
\( v_2 \approx [0.1, 0.1] + 0.001[98, 81] = [0.1, 0.1] + [0.098, 0.081] = [0.198, 0.181] \)
3. Bias Correct (t=2):
Correction factors: \( 1-0.9^2 = 0.19 \), \( 1-0.999^2 \approx 0.002 \)
\( \hat{m}_2 = [1.89, 1.8] / 0.19 \approx [9.95, 9.47] \)
\( \hat{v}_2 = [0.198, 0.181] / 0.002 \approx [99, 90.5] \)
4. Update: \( \theta_2 = \theta_1 - \eta \frac{\hat{m}_2}{\sqrt{\hat{v}_2}} \)
\( \theta_2 = [9.9, 0.9] - 0.1 [ \frac{9.95}{\sqrt{99}}, \frac{9.47}{\sqrt{90.5}} ] \)
\( \theta_2 \approx [9.9, 0.9] - 0.1 [ 1.0, 1.0 ] \)
\( \theta_2 = [9.9, 0.9] - [0.1, 0.1] = [9.8, 0.8] \)
Notice how Adam's bias correction made the first step behave reasonably (0.1) compared to RMSprop (3.16) which exploded due to the small initialization!