Regularization | Math for ML

Overfitting: The Problem

Models that fit training data too well often perform poorly on new data. Regularization techniques prevent this by adding constraints or noise during training.

The Overfitting Problem

Underfitting: High bias, high training loss | Overfitting: Low training loss, high test loss | Just right: Low training + test loss

L1 Regularization (Lasso)

Adds penalty proportional to absolute value of weights to loss function:

$$L_{L1} = L_{original} + \lambda \sum_{i} |w_i|$$

$\lambda$ controls regularization strength. Higher = more regularization.

✅ Feature selection: Forces some weights exactly to zero
✅ Interpretable: Shows which features matter
❌ Non-differentiable: At $w=0$

L2 Regularization (Ridge)

Adds penalty proportional to square of weights:

$$L_{L2} = L_{original} + \lambda \sum_{i} w_i^2$$

✅ Smooth: Differentiable everywhere
✅ Shrinks: Weights toward zero gradually
✅ More common: Often works better than L1
❌ Doesn't eliminate: Weights become small but not zero

Elastic Net

Combines L1 and L2:

$$L = L_{original} + \lambda_1 \sum_i |w_i| + \lambda_2 \sum_i w_i^2$$

Gets benefits of both: feature selection from L1, smoothness from L2.

Dropout

Randomly disable neurons during training. Forces network to learn redundant representations.

How It Works

During training: Randomly set some neuron outputs to 0 with probability $p$ (e.g., 50%)

During inference: Use all neurons (but scale by $(1-p)$ to correct expected values)

✅ Powerful: Often more effective than L1/L2
✅ Practical: Simple to implement
✅ Ensemble effect: Like training many models
⚠️ Not always needed: Skip if you have plenty of data

Early Stopping

Stop training when validation loss stops improving.

✅ Simple: No hyperparameters
✅ Effective: Prevents overfitting directly
✅ Fast: Trains fewer epochs

Standard practice: Always monitor validation loss and use early stopping.

Data Augmentation

Create more training examples through random transformations: crop, flip, rotate, zoom, add noise.

✅ More data: Network sees different versions
✅ Better generalization: More diverse examples
✅ Realistic: Real test data varies anyway

Batch Normalization as Regularization

Batch normalization has regularization effect by:

Adding noise from batch statistics
Reducing dependence on specific initialization
Allowing higher learning rates

When to Use What

L2 + Early Stopping: Classic combo, works well
Dropout: For deep networks with lots of data
Data Augmentation: Always, if possible
Batch Norm: Standard in modern networks
L1: When you want to identify important features

💡 Key insight: Regularization trades training performance for better test performance. There's no free lunch - reducing overfitting typically means accepting higher training loss.

Regularization Techniques