Dimensionality Reduction

PCA, t-SNE, and UMAP

The Curse of Dimensionality

High-dimensional data causes problems: computational cost, overfitting, and hard to visualize. Dimensionality reduction finds low-dimensional representations that preserve important structure.

Why Reduce?

  • 🚀 Speed: Fewer features = faster training
  • 🔍 Visualization: See 10,000D data in 2D
  • 🧠 Remove noise: Keep signal, discard junk
  • 📉 Prevent overfitting: Simpler models generalize better

Principal Component Analysis (PCA)

Goal: Find directions of maximum variance in data.

PCA finds vectors $v_1, v_2, ..., v_k$ that maximize:

$$\text{Var}(Xv_i)$$

Subject to: $v_i \perp v_j$ for $i \neq j$

These are eigenvectors of the covariance matrix, ordered by eigenvalue (variance).

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Goal: Preserve local structure - nearby points stay nearby.

Unlike PCA, t-SNE is nonlinear and great for visualization.

UMAP (Uniform Manifold Approximation and Projection)

Goal: Preserve both local and global structure.

Modern alternative to t-SNE - faster and often better.

Comparison

Method Type Speed Best For
PCA Linear ⚡ Very Fast Prediction, linear data
t-SNE Nonlinear 🐢 Slow Exploratory visualization
UMAP Nonlinear ⚡ Fast Visualization + performance

Learn More

→ Linear AlgebraProbability