Regularization Techniques
Preventing Overfitting
Overfitting: The Problem
Models that fit training data too well often perform poorly on new data. Regularization techniques prevent this by adding constraints or noise during training.
The Overfitting Problem
Underfitting: High bias, high training loss | Overfitting: Low training loss, high test loss | Just right: Low training + test loss
L1 Regularization (Lasso)
Adds penalty proportional to absolute value of weights to loss function:
$$L_{L1} = L_{original} + \lambda \sum_{i} |w_i|$$
$\lambda$ controls regularization strength. Higher = more regularization.
- ✅ Feature selection: Forces some weights exactly to zero
- ✅ Interpretable: Shows which features matter
- ❌ Non-differentiable: At $w=0$
L2 Regularization (Ridge)
Adds penalty proportional to square of weights:
$$L_{L2} = L_{original} + \lambda \sum_{i} w_i^2$$
- ✅ Smooth: Differentiable everywhere
- ✅ Shrinks: Weights toward zero gradually
- ✅ More common: Often works better than L1
- ❌ Doesn't eliminate: Weights become small but not zero
Elastic Net
Combines L1 and L2:
$$L = L_{original} + \lambda_1 \sum_i |w_i| + \lambda_2 \sum_i w_i^2$$
Gets benefits of both: feature selection from L1, smoothness from L2.
Dropout
Randomly disable neurons during training. Forces network to learn redundant representations.
How It Works
During training: Randomly set some neuron outputs to 0 with probability $p$ (e.g., 50%)
During inference: Use all neurons (but scale by $(1-p)$ to correct expected values)
- ✅ Powerful: Often more effective than L1/L2
- ✅ Practical: Simple to implement
- ✅ Ensemble effect: Like training many models
- ⚠️ Not always needed: Skip if you have plenty of data
Early Stopping
Stop training when validation loss stops improving.
- ✅ Simple: No hyperparameters
- ✅ Effective: Prevents overfitting directly
- ✅ Fast: Trains fewer epochs
Standard practice: Always monitor validation loss and use early stopping.
Data Augmentation
Create more training examples through random transformations: crop, flip, rotate, zoom, add noise.
- ✅ More data: Network sees different versions
- ✅ Better generalization: More diverse examples
- ✅ Realistic: Real test data varies anyway
Batch Normalization as Regularization
Batch normalization has regularization effect by:
- Adding noise from batch statistics
- Reducing dependence on specific initialization
- Allowing higher learning rates
When to Use What
- L2 + Early Stopping: Classic combo, works well
- Dropout: For deep networks with lots of data
- Data Augmentation: Always, if possible
- Batch Norm: Standard in modern networks
- L1: When you want to identify important features
💡 Key insight: Regularization trades training performance for better test performance. There's no free lunch - reducing overfitting typically means accepting higher training loss.