Probability Theory | Math for ML

1. Why Probability in ML?

Machine learning deals with uncertainty: noisy data, incomplete information, and predictions about the future. Probability gives us the mathematical framework to reason about and quantify uncertainty.

🎲 Model Uncertainty

How confident is the model? Bayesian methods provide probability distributions over predictions.

📊 Data Generation

Generative models (VAE, GAN) learn probability distributions to create new data.

🎯 Loss Functions

Cross-entropy, negative log-likelihood - optimization objectives come from probability theory.

2. Probability Basics

Probability of event A: $$ 0 \leq P(A) \leq 1 $$ Sum rule (OR): $$ P(A \cup B) = P(A) + P(B) - P(A \cap B) $$ Product rule (AND): $$ P(A \cap B) = P(A|B) \cdot P(B) $$ Conditional probability: $$ P(A|B) = \frac{P(A \cap B)}{P(B)} $$

3. Bayes' Theorem: The Foundation

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}

In machine learning context:

$$ P(\text{model}|\text{data}) = \frac{P(\text{data}|\text{model}) \cdot P(\text{model})}{P(\text{data})} $$ Posterior \( P(\text{model}|\text{data}) \): What we want - updated belief after seeing data Likelihood \( P(\text{data}|\text{model}) \): How likely is this data given the model? Prior \( P(\text{model}) \): Initial belief before seeing data Evidence \( P(\text{data}) \): Normalizing constant (often ignored)

🔍 Example: Medical diagnosis. Given a positive test result (data), what's the probability of disease (model)? Even with 99% accurate test, if disease is rare (low prior), most positives are false positives!

4. Key Distributions

Gaussian (Normal) Distribution

\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

Parameters: Mean \( \mu \), Variance \( \sigma^2 \)
Use: Continuous variables, noise modeling, Central Limit Theorem

Bernoulli Distribution

P(X=1) = p, \quad P(X=0) = 1-p

Parameter: Success probability \( p \)
Use: Binary outcomes (yes/no, click/no-click)

Categorical Distribution

P(X=k) = p_k, \quad \sum_{k=1}^K p_k = 1

Parameters: Probabilities for K classes
Use: Multi-class classification (digit recognition, sentiment analysis)

Exponential Distribution

p(x|\lambda) = \lambda e^{-\lambda x} \quad \text{for } x \geq 0

Parameter: Rate \( \lambda \)
Use: Time between events, survival analysis

6. Expected Value and Variance

Expected Value (Mean): $$ \mathbb{E}[X] = \sum_x x \cdot P(X=x) \quad \text{(discrete)} $$ $$ \mathbb{E}[X] = \int x \cdot p(x) \, dx \quad \text{(continuous)} $$ Variance: $$ \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 $$ Standard Deviation: $$ \sigma = \sqrt{\text{Var}(X)} $$

7. Maximum Likelihood Estimation (MLE)

How do we learn model parameters from data? Find parameters that make the observed data most likely.

Likelihood: $$ L(\theta) = P(\text{data}|\theta) = \prod_{i=1}^n P(x_i|\theta) $$ Log-Likelihood: $$ \log L(\theta) = \sum_{i=1}^n \log P(x_i|\theta) $$ MLE: $$ \hat{\theta} = \arg\max_\theta \log L(\theta) $$

Example: Estimating mean of Gaussian from samples
The MLE solution is simply the sample mean: \( \hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i \)

8. Cross-Entropy Loss

The standard loss for classification comes from negative log-likelihood!

Binary Cross-Entropy: $$ L = -\frac{1}{n}\sum_{i=1}^n [y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)] $$ Categorical Cross-Entropy: $$ L = -\frac{1}{n}\sum_{i=1}^n \sum_{k=1}^K y_{ik} \log \hat{y}_{ik} $$

Minimizing cross-entropy = maximizing likelihood = making predictions consistent with true distribution.

9. Key Concepts Summary

Independence

\( P(A \cap B) = P(A) \cdot P(B) \). Knowledge of A doesn't change probability of B.

Conditional Independence

\( P(A \cap B | C) = P(A|C) \cdot P(B|C) \). Core assumption in Naive Bayes classifier.

Law of Total Probability

\( P(A) = \sum_i P(A|B_i)P(B_i) \). Marginalization - summing out variables.

Probability for Machine Learning