The Calculus of Optimization

1. The Goal: Finding the Valley

In Machine Learning, we train models by minimizing a Loss Function, $ J(\theta) $, where $ \theta $ represents the weights of our model. Imagine being on a foggy mountain (the loss landscape) and trying to find the lowest point (the minimum loss) with zero visibility. You can only feel the slope of the ground under your feet.

The strategy is simple: Take a step in the direction of the steepest descent.

2. Vanilla Gradient Descent

The "slope" in higher dimensions is given by the Gradient, denoted as $ \nabla J(\theta) $. The gradient points in the direction of steepest ascent. To go down, we step in the opposite direction.

\theta_{t+1} = \theta_t - \eta \cdot \nabla J(\theta_t)

$ \theta_t $: Current weights (vector).
$ \eta $ (Eta): Learning Rate. Controls step size.
$ \nabla J(\theta_t) $: Gradient vector at current position.

Problem: If the valley is steep in one direction but flat in another (like a taco shell or ravine), standard GD oscillates across the steep slopes and makes very slow progress down the flat valley.

3. Momentum: Adding Velocity

To fix the oscillation problem, we borrow physics. Imagine a heavy ball rolling down the hill. It builds up momentum. If it encounters small bumps or zig-zags, its inertia keeps it moving in the general right direction.

Instead of just using the current gradient, we use an Exponential Moving Average (EMA) of past gradients.

v_{t} = \beta v_{t-1} + (1 - \beta) \nabla J(\theta_t) $$ $$ \theta_{t+1} = \theta_t - \eta \cdot v_{t}

$ v_t $: Velocity vector (moving average of gradients).
$ \beta $ (Beta): Momentum coefficient (usually 0.9). Controls how much history we keep.
$ (1-\beta) $: Ensures the velocity stays on the same scale as the gradients.

The Calculus of Optimization

Understanding Gradient Descent and Modern Optimizers

$$ G_{t} = G_{t-1} + (\nabla J(\theta_t))^2 $$ $$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \odot \nabla J(\theta_t) $$

Here, operations are element-wise. Parameters with frequently large gradients get their effective learning rate reduced.

Issue: $ G_t $ keeps growing, so the learning rate shrinks to zero over time, potentially stopping training too early.

5. RMSprop: Leaky Adagrad

To fix Adagrad's stopping problem, RMSprop uses a moving average of squared gradients instead of a cumulative sum. It "forgets" old history.

E[g^2]_t = \beta E[g^2]_{t-1} + (1-\beta) (\nabla J(\theta_t))^2 $$ $$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \odot \nabla J(\theta_t)

Usually $ \beta \approx 0.9 $.

6. Adam: Adaptive Moment Estimation

Adam combines the best of Momentum (first moment) and RMSprop (second moment). It is the most popular optimizer today.

1. Update biased first moment (Momentum): $$ m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla J(\theta_t) $$ 2. Update biased second moment (RMSprop): $$ v_t = \beta_2 v_{t-1} + (1-\beta_2) (\nabla J(\theta_t))^2 $$ 3. Correct bias (important at start): $$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $$ 4. Update weights: $$ \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t $$

7. Walkthrough: Manual Calculation

We will trace the first 2 iterations for all 5 algorithms to see exactly how they differ.

Setup:
Function: $ J(\theta) = 0.5\theta_1^2 + 5\theta_2^2 $
Gradient: $ \nabla J = [\theta_1, 10\theta_2] $
Start Point: $ \theta_0 = [10, 1] $
Learning Rate: $ \eta = 0.1 $

1. Vanilla Gradient Descent

Iteration 1: 1. Gradient: \( g_0 = \nabla J(\theta_0) \)   \( g_0 = [10, 10(1)] = [10, 10] \) 2. Update: \( \theta_1 = \theta_0 - \eta \cdot g_0 \)   \( \theta_1 = [10, 1] - 0.1 \cdot [10, 10] \)   \( \theta_1 = [10, 1] - [1, 1] = [9, 0] \) Iteration 2: 1. Gradient: \( g_1 = \nabla J(\theta_1) \)   \( g_1 = [9, 10(0)] = [9, 0] \) 2. Update: \( \theta_2 = \theta_1 - \eta \cdot g_1 \)   \( \theta_2 = [9, 0] - 0.1 \cdot [9, 0] \)   \( \theta_2 = [9, 0] - [0.9, 0] = [8.1, 0] \)

2. Momentum (EMA formulation)

Params: $ \beta = 0.9 $. Initialize $ v_0 = [0, 0] $.

Iteration 1: 1. Gradient: \( g_0 = [10, 10] \) 2. Velocity: \( v_1 = \beta v_0 + (1-\beta)g_0 \)   \( v_1 = 0.9[0,0] + 0.1[10,10] = [0,0] + [1,1] = [1, 1] \) 3. Update: \( \theta_1 = \theta_0 - \eta v_1 \)   \( \theta_1 = [10, 1] - 0.1[1, 1] \)   \( \theta_1 = [10, 1] - [0.1, 0.1] = [9.9, 0.9] \) Iteration 2: 1. Gradient: \( g_1 = [9.9, 10(0.9)] = [9.9, 9.0] \) 2. Velocity: \( v_2 = \beta v_1 + (1-\beta)g_1 \)   \( v_2 = 0.9[1, 1] + 0.1[9.9, 9.0] \)   \( v_2 = [0.9, 0.9] + [0.99, 0.9] = [1.89, 1.8] \) 3. Update: \( \theta_2 = \theta_1 - \eta v_2 \)   \( \theta_2 = [9.9, 0.9] - 0.1[1.89, 1.8] \)   \( \theta_2 = [9.9, 0.9] - [0.189, 0.18] = [9.711, 0.72] \)

3. Adagrad

Initialize sum of squares $ G_0 = [0, 0] $. Epsilon $ \epsilon = 10^{-8} $ (ignored).

Iteration 1: 1. Gradient: \( g_0 = [10, 10] \) 2. Accumulate Squares: \( G_1 = G_0 + g_0^2 \)   \( G_1 = [0,0] + [10^2, 10^2] = [100, 100] \) 3. Update: \( \theta_1 = \theta_0 - \frac{\eta}{\sqrt{G_1}} \odot g_0 \)   \( \theta_1 = [10, 1] - \frac{0.1}{\sqrt{[100, 100]}} \odot [10, 10] \)   \( \theta_1 = [10, 1] - [ \frac{0.1}{10}(10), \frac{0.1}{10}(10) ] \)   \( \theta_1 = [10, 1] - [0.1, 0.1] = [9.9, 0.9] \) Iteration 2: 1. Gradient: \( g_1 = [9.9, 9.0] \) 2. Accumulate Squares: \( G_2 = G_1 + g_1^2 \)   \( G_2 = [100, 100] + [9.9^2, 9.0^2] \)   \( G_2 = [100, 100] + [98.01, 81] = [198.01, 181] \) 3. Update: \( \theta_2 = \theta_1 - \frac{\eta}{\sqrt{G_2}} \odot g_1 \)   \( \theta_2 = [9.9, 0.9] - \frac{0.1}{\sqrt{[198.01, 181]}} \odot [9.9, 9.0] \)   \( \theta_2 \approx [9.9, 0.9] - [ \frac{0.1}{14.07}(9.9), \frac{0.1}{13.45}(9.0) ] \)   \( \theta_2 \approx [9.9, 0.9] - [ 0.0071(9.9), 0.0074(9.0) ] \)   \( \theta_2 \approx [9.9, 0.9] - [0.07, 0.067] = [9.83, 0.833] \)

4. RMSprop

Params: $ \beta = 0.999 $. Initialize $ E_0 = [0, 0] $.

Iteration 1: 1. Gradient: \( g_0 = [10, 10] \) 2. Moving Avg: \( E_1 = \beta E_0 + (1-\beta) g_0^2 \)   \( E_1 = 0.999[0,0] + 0.001[100, 100] = [0.1, 0.1] \) 3. Update: \( \theta_1 = \theta_0 - \frac{\eta}{\sqrt{E_1}} \odot g_0 \)   \( \theta_1 = [10, 1] - \frac{0.1}{\sqrt{[0.1, 0.1]}} \odot [10, 10] \)   \( \theta_1 = [10, 1] - [ \frac{0.1}{0.316}(10), \frac{0.1}{0.316}(10) ] \)   \( \theta_1 \approx [10, 1] - [ 0.316(10), 0.316(10) ] \)   \( \theta_1 = [10, 1] - [3.16, 3.16] = [6.84, -2.16] \) Iteration 2: 1. Gradient: \( g_1 = [6.84, 10(-2.16)] = [6.84, -21.6] \) 2. Moving Avg: \( E_2 = 0.999 E_1 + 0.001 g_1^2 \)   \( E_2 = 0.999[0.1, 0.1] + 0.001[6.84^2, (-21.6)^2] \)   \( E_2 \approx [0.0999, 0.0999] + 0.001[46.7, 466.5] \)   \( E_2 \approx [0.0999, 0.0999] + [0.047, 0.467] = [0.147, 0.567] \) 3. Update: \( \theta_2 = \theta_1 - \frac{\eta}{\sqrt{E_2}} \odot g_1 \)   \( \theta_2 = [6.84, -2.16] - \frac{0.1}{\sqrt{[0.147, 0.567]}} \odot [6.84, -21.6] \)   \( \theta_2 \approx [6.84, -2.16] - [ \frac{0.1}{0.38}(6.84), \frac{0.1}{0.75}(-21.6) ] \)   \( \theta_2 \approx [6.84, -2.16] - [ 0.26(6.84), 0.133(-21.6) ] \)   \( \theta_2 \approx [6.84, -2.16] - [ 1.78, -2.87 ] = [5.06, 0.71] \)

5. Adam

Params: $ \beta_1 = 0.9, \beta_2 = 0.999 $. Init $ m_0=0, v_0=0 $.

Iteration 1: 1. Gradient: \( g_0 = [10, 10] \) 2. Update Moments:   \( m_1 = \beta_1 m_0 + (1-\beta_1) g_0 = 0.9(0) + 0.1[10,10] = [1, 1] \)   \( v_1 = \beta_2 v_0 + (1-\beta_2) g_0^2 = 0.999(0) + 0.001[100,100] = [0.1, 0.1] \) 3. Bias Correct (t=1):   \( \hat{m}_1 = m_1 / (1 - 0.9^1) = [1, 1] / 0.1 = [10, 10] \)   \( \hat{v}_1 = v_1 / (1 - 0.999^1) = [0.1, 0.1] / 0.001 = [100, 100] \) 4. Update: \( \theta_1 = \theta_0 - \frac{\eta}{\sqrt{\hat{v}_1}} \odot \hat{m}_1 \)   \( \theta_1 = [10, 1] - \frac{0.1}{\sqrt{[100, 100]}} \odot [10, 10] \)   \( \theta_1 = [10, 1] - [ \frac{0.1}{10}(10), \frac{0.1}{10}(10) ] \)   \( \theta_1 = [10, 1] - [0.1, 0.1] = [9.9, 0.9] \) Iteration 2: 1. Gradient: \( g_1 = [9.9, 9.0] \) 2. Update Moments:   \( m_2 = 0.9[1, 1] + 0.1[9.9, 9.0] = [0.9, 0.9] + [0.99, 0.9] = [1.89, 1.8] \)   \( v_2 = 0.999[0.1, 0.1] + 0.001[9.9^2, 9.0^2] \)   \( v_2 \approx [0.1, 0.1] + 0.001[98, 81] = [0.1, 0.1] + [0.098, 0.081] = [0.198, 0.181] \) 3. Bias Correct (t=2):   Correction factors: \( 1-0.9^2 = 0.19 \), \( 1-0.999^2 \approx 0.002 \)   \( \hat{m}_2 = [1.89, 1.8] / 0.19 \approx [9.95, 9.47] \)   \( \hat{v}_2 = [0.198, 0.181] / 0.002 \approx [99, 90.5] \) 4. Update: \( \theta_2 = \theta_1 - \eta \frac{\hat{m}_2}{\sqrt{\hat{v}_2}} \)   \( \theta_2 = [9.9, 0.9] - 0.1 [ \frac{9.95}{\sqrt{99}}, \frac{9.47}{\sqrt{90.5}} ] \)   \( \theta_2 \approx [9.9, 0.9] - 0.1 [ 1.0, 1.0 ] \)   \( \theta_2 = [9.9, 0.9] - [0.1, 0.1] = [9.8, 0.8] \)

Notice how Adam's bias correction made the first step behave reasonably (0.1) compared to RMSprop (3.16) which exploded due to the small initialization!

8. Interactive Lab

Compare how different optimizers traverse a "Skewed Bowl" function.
$ J(x, y) = 0.5x^2 + 5y^2 $

Learning Rate (η): 0.05

Momentum (β1): 0.9

RMS/Adam Decay (β2): 0.99

Momentum

Adagrad

RMSprop

Adam

Loading Visualization...