Dimensionality Reduction
PCA, t-SNE, and UMAP
The Curse of Dimensionality
High-dimensional data causes problems: computational cost, overfitting, and hard to visualize. Dimensionality reduction finds low-dimensional representations that preserve important structure.
Why Reduce?
- 🚀 Speed: Fewer features = faster training
- 🔍 Visualization: See 10,000D data in 2D
- 🧠 Remove noise: Keep signal, discard junk
- 📉 Prevent overfitting: Simpler models generalize better
Principal Component Analysis (PCA)
Goal: Find directions of maximum variance in data.
PCA finds vectors $v_1, v_2, ..., v_k$ that maximize:
$$\text{Var}(Xv_i)$$
Subject to: $v_i \perp v_j$ for $i \neq j$
These are eigenvectors of the covariance matrix, ordered by eigenvalue (variance).
- Linear: Works best for linear patterns
- Interpretable: Components are linear combinations of original features
- Fast: Just eigendecomposition
t-SNE (t-Distributed Stochastic Neighbor Embedding)
Goal: Preserve local structure - nearby points stay nearby.
Unlike PCA, t-SNE is nonlinear and great for visualization.
- ✅ Excellent for exploratory visualization
- ✅ Reveals natural clusters
- ❌ Slower than PCA
- ❌ Can't reduce to >3D practically
- ❌ Not great for prediction (non-parametric)
UMAP (Uniform Manifold Approximation and Projection)
Goal: Preserve both local and global structure.
Modern alternative to t-SNE - faster and often better.
- ✅ Faster than t-SNE
- ✅ Better global structure
- ✅ Works for many dimensions
- ✅ Has theoretical grounding
Comparison
| Method | Type | Speed | Best For |
|---|---|---|---|
| PCA | Linear | ⚡ Very Fast | Prediction, linear data |
| t-SNE | Nonlinear | 🐢 Slow | Exploratory visualization |
| UMAP | Nonlinear | ⚡ Fast | Visualization + performance |