3D Loss Landscapes
Visualizing the Optimization Journey
1. What is a Loss Landscape?
A loss landscape is a visualization of the loss function as a surface in high-dimensional space. For two parameters, we can plot it as a 3D surface where:
- X and Y axes: Two model parameters (e.g., weights)
- Z axis (height): Loss value
- Valleys: Low loss (good solutions)
- Hills/Peaks: High loss (poor solutions)
2. Common Landscape Features
🏔️ Local Minima
Small valleys where gradient = 0. Optimization can get stuck here. Common in non-convex problems.
🎯 Global Minimum
The deepest valley - best possible solution. Finding it is the goal of optimization.
🏄 Saddle Points
Points where gradient = 0 but not a minimum. Flat in some directions, curved in others.
🌊 Plateaus
Flat regions with near-zero gradients. Optimization slows down dramatically here.
3. Interactive 3D Loss Surface
Rotate and zoom the 3D surface. Watch an optimizer navigate from the starting point to the minimum.
4. Famous Loss Functions
Rosenbrock Function (Banana Valley)
Has a long, narrow valley. Easy to find the valley, but hard to converge to the minimum at (1,1). Tests optimizer's ability to navigate narrow curved valleys.
Rastrigin Function
Highly multimodal with many local minima. Global minimum at (0,0). Tests optimizer's ability to escape local minima.
Saddle Point
Classic saddle at origin. Gradient is zero but it's not a minimum. Tests optimizer's behavior at saddle points.
5. Why Deep Networks Have Complex Landscapes
- High-dimensional spaces have exponentially more saddle points than local minima
- In very deep networks, most critical points are saddles, not local minima
- Local minima in deep networks tend to have similar loss values (loss plateau)
- Wide minima (flat valleys) generalize better than sharp minima
6. Optimization Challenges
| Challenge | Cause | Solution |
|---|---|---|
| Stuck in Local Minima | Multimodal landscape | Momentum, random restarts, simulated annealing |
| Slow at Plateaus | Near-zero gradients | Adaptive learning rates (Adam, RMSprop) |
| Oscillation in Valleys | High curvature differences | Momentum, learning rate decay |
| Exploding Gradients | Steep regions | Gradient clipping, batch normalization |