3D Loss Landscapes

Visualizing the Optimization Journey

1. What is a Loss Landscape?

A loss landscape is a visualization of the loss function as a surface in high-dimensional space. For two parameters, we can plot it as a 3D surface where:

2. Common Landscape Features

🏔️ Local Minima

Small valleys where gradient = 0. Optimization can get stuck here. Common in non-convex problems.

🎯 Global Minimum

The deepest valley - best possible solution. Finding it is the goal of optimization.

🏄 Saddle Points

Points where gradient = 0 but not a minimum. Flat in some directions, curved in others.

🌊 Plateaus

Flat regions with near-zero gradients. Optimization slows down dramatically here.

3. Interactive 3D Loss Surface

Rotate and zoom the 3D surface. Watch an optimizer navigate from the starting point to the minimum.

4. Famous Loss Functions

Rosenbrock Function (Banana Valley)

$$ f(x,y) = (1-x)^2 + 100(y-x^2)^2 $$

Has a long, narrow valley. Easy to find the valley, but hard to converge to the minimum at (1,1). Tests optimizer's ability to navigate narrow curved valleys.

Rastrigin Function

$$ f(x,y) = 20 + x^2 + y^2 - 10(\cos(2\pi x) + \cos(2\pi y)) $$

Highly multimodal with many local minima. Global minimum at (0,0). Tests optimizer's ability to escape local minima.

Saddle Point

$$ f(x,y) = x^2 - y^2 $$

Classic saddle at origin. Gradient is zero but it's not a minimum. Tests optimizer's behavior at saddle points.

5. Why Deep Networks Have Complex Landscapes

Key Insights from Research:
  • High-dimensional spaces have exponentially more saddle points than local minima
  • In very deep networks, most critical points are saddles, not local minima
  • Local minima in deep networks tend to have similar loss values (loss plateau)
  • Wide minima (flat valleys) generalize better than sharp minima

6. Optimization Challenges

Challenge Cause Solution
Stuck in Local Minima Multimodal landscape Momentum, random restarts, simulated annealing
Slow at Plateaus Near-zero gradients Adaptive learning rates (Adam, RMSprop)
Oscillation in Valleys High curvature differences Momentum, learning rate decay
Exploding Gradients Steep regions Gradient clipping, batch normalization