Probability for Machine Learning
Understanding Uncertainty and Distributions
1. Why Probability in ML?
Machine learning deals with uncertainty: noisy data, incomplete information, and predictions about the future. Probability gives us the mathematical framework to reason about and quantify uncertainty.
🎲 Model Uncertainty
How confident is the model? Bayesian methods provide probability distributions over predictions.
📊 Data Generation
Generative models (VAE, GAN) learn probability distributions to create new data.
🎯 Loss Functions
Cross-entropy, negative log-likelihood - optimization objectives come from probability theory.
2. Probability Basics
3. Bayes' Theorem: The Foundation
In machine learning context:
- Posterior \( P(\text{model}|\text{data}) \): What we want - updated belief after seeing data
- Likelihood \( P(\text{data}|\text{model}) \): How likely is this data given the model?
- Prior \( P(\text{model}) \): Initial belief before seeing data
- Evidence \( P(\text{data}) \): Normalizing constant (often ignored)
4. Key Distributions
Gaussian (Normal) Distribution
Parameters: Mean \( \mu \), Variance \( \sigma^2 \)
Use: Continuous variables, noise modeling, Central Limit Theorem
Bernoulli Distribution
Parameter: Success probability \( p \)
Use: Binary outcomes (yes/no, click/no-click)
Categorical Distribution
Parameters: Probabilities for K classes
Use: Multi-class classification (digit recognition, sentiment analysis)
Exponential Distribution
Parameter: Rate \( \lambda \)
Use: Time between events, survival analysis
5. Interactive Distribution Explorer
Adjust parameters to see how different distributions behave.
Properties:
6. Expected Value and Variance
Variance: $$ \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 $$
Standard Deviation: $$ \sigma = \sqrt{\text{Var}(X)} $$
7. Maximum Likelihood Estimation (MLE)
How do we learn model parameters from data? Find parameters that make the observed data most likely.
Example: Estimating mean of Gaussian from samples
The MLE solution is simply the sample mean: \( \hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i \)
8. Cross-Entropy Loss
The standard loss for classification comes from negative log-likelihood!
Categorical Cross-Entropy: $$ L = -\frac{1}{n}\sum_{i=1}^n \sum_{k=1}^K y_{ik} \log \hat{y}_{ik} $$
Minimizing cross-entropy = maximizing likelihood = making predictions consistent with true distribution.
9. Key Concepts Summary
Independence
\( P(A \cap B) = P(A) \cdot P(B) \). Knowledge of A doesn't change probability of B.
Conditional Independence
\( P(A \cap B | C) = P(A|C) \cdot P(B|C) \). Core assumption in Naive Bayes classifier.
Law of Total Probability
\( P(A) = \sum_i P(A|B_i)P(B_i) \). Marginalization - summing out variables.