Calculus for Machine Learning
The Mathematics of Change and Optimization
Why Calculus?
Calculus is the mathematical foundation of machine learning. It enables us to:
- Find optimal parameters by computing gradients
- Measure sensitivity of outputs to input changes
- Optimize cost functions to train neural networks
- Understand convergence of learning algorithms
Without calculus, we couldn't train models to learn from data.
Derivatives: Measuring Change
A derivative tells us how a function changes at a specific point. It's the slope of the function at that point.
Formal Definition
The derivative of function $f$ at point $x$ is:
$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$This represents the instantaneous rate of change at point $x$.
Simple Example: Quadratic Function
Consider $f(x) = x^2$
The derivative is:
$$f'(x) = 2x$$
At specific points:
- At $x = 0$: $f'(0) = 0$ (flat point, minimum)
- At $x = 1$: $f'(1) = 2$ (increasing with slope 2)
- At $x = 2$: $f'(2) = 4$ (steeper increase)
- At $x = -1$: $f'(-1) = -2$ (decreasing)
Partial Derivatives: Multiple Variables
In machine learning, we work with multivariate functions - functions with many input variables. A partial derivative shows how the function changes with respect to one variable, keeping others constant.
Notation
For function $f(x, y, z)$, the partial derivatives are:
$$\frac{\partial f}{\partial x}, \quad \frac{\partial f}{\partial y}, \quad \frac{\partial f}{\partial z}$$Each one shows how $f$ changes along one dimension.
Example: Neural Network Cost Function
A cost function with two weights:
$$J(w_1, w_2) = (w_1 - 3)^2 + (w_2 + 2)^2$$
Partial derivatives:
- $$\frac{\partial J}{\partial w_1} = 2(w_1 - 3)$$
- $$\frac{\partial J}{\partial w_2} = 2(w_2 + 2)$$
These tell us: decrease $w_1$ if it's > 3, decrease $w_2$ if it's > -2.
The Gradient: Direction of Steepest Increase
The gradient is a vector containing all partial derivatives. It points in the direction of steepest increase of a function.
Gradient Vector
For function $f(x_1, x_2, ..., x_n)$:
$$\nabla f = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{pmatrix}$$This is the gradient vector in n-dimensional space.
Key Properties
- Direction: Points toward greatest increase
- Magnitude: Tells us how steep the increase is
- Opposite: $-\nabla f$ points toward steepest decrease (used in optimization!)
- At optima: $\nabla f = 0$ (all partial derivatives are zero)
The Chain Rule: Computing Nested Derivatives
Neural networks are compositions of functions. The chain rule tells us how to compute derivatives through these nested functions.
Chain Rule Formula
If $y = f(g(x))$, then:
$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$$where $u = g(x)$
Neural Network Example
Consider a 2-layer network:
- Layer 1: $z^{(1)} = w^{(1)}x + b^{(1)}$
- Activation: $a^{(1)} = \text{ReLU}(z^{(1)})$
- Layer 2: $\hat{y} = w^{(2)}a^{(1)} + b^{(2)}$
- Loss: $L = (\hat{y} - y)^2$
To update $w^{(1)}$, we need: $\frac{\partial L}{\partial w^{(1)}}$
Using chain rule:
$$\frac{\partial L}{\partial w^{(1)}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial a^{(1)}} \cdot \frac{\partial a^{(1)}}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial w^{(1)}}$$This is the essence of backpropagation!
Common Derivatives in ML
These are functions whose derivatives you'll encounter repeatedly:
| Function $f(x)$ | Derivative $f'(x)$ | Used In |
|---|---|---|
| $x^n$ | $nx^{n-1}$ | Power rule, polynomials |
| $e^x$ | $e^x$ | Softmax, cross-entropy |
| $\ln(x)$ | $\frac{1}{x}$ | Log-likelihood |
| $\sin(x)$ | $\cos(x)$ | Positional encoders |
| $\sigma(x) = \frac{1}{1+e^{-x}}$ | $\sigma(x)(1-\sigma(x))$ | Sigmoid activation |
| $\text{ReLU}(x) = \max(0,x)$ | $\begin{cases} 0 & x < 0 \\ 1 & x > 0 \end{cases}$ | ReLU activation |
Optimization: Using Calculus to Train
Machine learning is optimization. We use calculus to find parameters that minimize loss:
Gradient Descent Update Rule
$$w := w - \alpha \nabla J(w)$$Move parameters opposite to gradient direction with learning rate $\alpha$
Why This Works
- The gradient points toward increasing loss
- We subtract the gradient to move toward decreasing loss
- Step size is controlled by learning rate $\alpha$
- Repeat until convergence (when $\nabla J \approx 0$)
Real Example: In neural networks, we compute loss over many samples, take its gradient with respect to all weights, and update all weights simultaneously. This is exactly what happens during backpropagation!
Second Derivatives: Curvature Matters
The second derivative shows how the first derivative changes - it measures curvature.
Second Derivative
$$f''(x) = \frac{d}{dx}\left(\frac{df}{dx}\right)$$Curvature Interpretation
- $f''(x) > 0$: Concave up (minimum nearby, safe to optimize)
- $f''(x) < 0$: Concave down (maximum nearby, tricky to optimize)
- $f''(x) = 0$: Inflection point (curvature changes)
Advanced Optimization
Advanced optimizers use second derivatives:
- Newton's Method: Uses Hessian matrix (all second partial derivatives)
- Adam, RMSprop: Adapt step size based on curvature estimates
- L-BFGS: Approximates Hessian for efficient optimization
Integration (Brief Overview)
While derivatives are central to ML, integration appears in:
- Probability: Computing areas under probability density functions
- Expected values: $E[X] = \int x \cdot p(x) dx$
- Variational inference: Approximate intractable integrals
Continue Learning
Ready to see these concepts in action? Explore related topics:
Key Takeaways
- Derivatives measure how functions change - fundamental to optimization
- Gradients are vectors of partial derivatives - they guide us toward minima
- Chain rule lets us compute derivatives through nested functions (backprop!)
- Optimization iteratively moves opposite to gradients to minimize loss
- Second derivatives reveal curvature and help advanced optimizers