3D Loss Landscapes

Visualizing the Optimization Journey

1. What is a Loss Landscape?

A loss landscape is a visualization of the loss function as a surface in high-dimensional space. For two parameters, we can plot it as a 3D surface where:

X and Y axes: Two model parameters (e.g., weights)
Z axis (height): Loss value
Valleys: Low loss (good solutions)
Hills/Peaks: High loss (poor solutions)

2. Common Landscape Features

🏔️ Local Minima

Small valleys where gradient = 0. Optimization can get stuck here. Common in non-convex problems.

🎯 Global Minimum

The deepest valley - best possible solution. Finding it is the goal of optimization.

🏄 Saddle Points

Points where gradient = 0 but not a minimum. Flat in some directions, curved in others.

🌊 Plateaus

Flat regions with near-zero gradients. Optimization slows down dramatically here.

3. Interactive 3D Loss Surface

Rotate and zoom the 3D surface. Watch an optimizer navigate from the starting point to the minimum.

Loss Function:

View Angle: 45°

4. Famous Loss Functions

Rosenbrock Function (Banana Valley)

$$ f(x,y) = (1-x)^2 + 100(y-x^2)^2 $$

Has a long, narrow valley. Easy to find the valley, but hard to converge to the minimum at (1,1). Tests optimizer's ability to navigate narrow curved valleys.

Rastrigin Function

f(x,y) = 20 + x^2 + y^2 - 10(\cos(2\pi x) + \cos(2\pi y))

Highly multimodal with many local minima. Global minimum at (0,0). Tests optimizer's ability to escape local minima.

Saddle Point

$$ f(x,y) = x^2 - y^2 $$

Classic saddle at origin. Gradient is zero but it's not a minimum. Tests optimizer's behavior at saddle points.

5. Why Deep Networks Have Complex Landscapes

Key Insights from Research:

High-dimensional spaces have exponentially more saddle points than local minima
In very deep networks, most critical points are saddles, not local minima
Local minima in deep networks tend to have similar loss values (loss plateau)
Wide minima (flat valleys) generalize better than sharp minima

6. Optimization Challenges

Challenge	Cause	Solution
Stuck in Local Minima	Multimodal landscape	Momentum, random restarts, simulated annealing
Slow at Plateaus	Near-zero gradients	Adaptive learning rates (Adam, RMSprop)
Oscillation in Valleys	High curvature differences	Momentum, learning rate decay
Exploding Gradients	Steep regions	Gradient clipping, batch normalization