Calculus & Derivatives

Why Calculus?

Calculus is the mathematical foundation of machine learning. It enables us to:

Find optimal parameters by computing gradients
Measure sensitivity of outputs to input changes
Optimize cost functions to train neural networks
Understand convergence of learning algorithms

Without calculus, we couldn't train models to learn from data.

Derivatives: Measuring Change

A derivative tells us how a function changes at a specific point. It's the slope of the function at that point.

Formal Definition

The derivative of function $f$ at point $x$ is:

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

This represents the instantaneous rate of change at point $x$.

Simple Example: Quadratic Function

Consider $f(x) = x^2$

The derivative is:

$$f'(x) = 2x$$

At specific points:

At $x = 0$: $f'(0) = 0$ (flat point, minimum)
At $x = 1$: $f'(1) = 2$ (increasing with slope 2)
At $x = 2$: $f'(2) = 4$ (steeper increase)
At $x = -1$: $f'(-1) = -2$ (decreasing)

Partial Derivatives: Multiple Variables

In machine learning, we work with multivariate functions - functions with many input variables. A partial derivative shows how the function changes with respect to one variable, keeping others constant.

Notation

For function $f(x, y, z)$, the partial derivatives are:

$$\frac{\partial f}{\partial x}, \quad \frac{\partial f}{\partial y}, \quad \frac{\partial f}{\partial z}$$

Each one shows how $f$ changes along one dimension.

Example: Neural Network Cost Function

A cost function with two weights:

$$J(w_1, w_2) = (w_1 - 3)^2 + (w_2 + 2)^2$$

Partial derivatives:

$$\frac{\partial J}{\partial w_1} = 2(w_1 - 3)$$
$$\frac{\partial J}{\partial w_2} = 2(w_2 + 2)$$

These tell us: decrease $w_1$ if it's > 3, decrease $w_2$ if it's > -2.

The Gradient: Direction of Steepest Increase

The gradient is a vector containing all partial derivatives. It points in the direction of steepest increase of a function.

Gradient Vector

For function $f(x_1, x_2, ..., x_n)$:

$$\nabla f = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{pmatrix}$$

This is the gradient vector in n-dimensional space.

Key Properties

Direction: Points toward greatest increase
Magnitude: Tells us how steep the increase is
Opposite: $-\nabla f$ points toward steepest decrease (used in optimization!)
At optima: $\nabla f = 0$ (all partial derivatives are zero)

The Chain Rule: Computing Nested Derivatives

Neural networks are compositions of functions. The chain rule tells us how to compute derivatives through these nested functions.

Chain Rule Formula

If $y = f(g(x))$, then:

$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$$

where $u = g(x)$

Neural Network Example

Consider a 2-layer network:

Layer 1: $z^{(1)} = w^{(1)}x + b^{(1)}$
Activation: $a^{(1)} = \text{ReLU}(z^{(1)})$
Layer 2: $\hat{y} = w^{(2)}a^{(1)} + b^{(2)}$
Loss: $L = (\hat{y} - y)^2$

To update $w^{(1)}$, we need: $\frac{\partial L}{\partial w^{(1)}}$

Using chain rule:

$$\frac{\partial L}{\partial w^{(1)}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial a^{(1)}} \cdot \frac{\partial a^{(1)}}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial w^{(1)}}$$

This is the essence of backpropagation!

Common Derivatives in ML

These are functions whose derivatives you'll encounter repeatedly:

Function $f(x)$	Derivative $f'(x)$	Used In
$x^n$	$nx^{n-1}$	Power rule, polynomials
$e^x$	$e^x$	Softmax, cross-entropy
$\ln(x)$	$\frac{1}{x}$	Log-likelihood
$\sin(x)$	$\cos(x)$	Positional encoders
$\sigma(x) = \frac{1}{1+e^{-x}}$	$\sigma(x)(1-\sigma(x))$	Sigmoid activation
$\text{ReLU}(x) = \max(0,x)$	$\begin{cases} 0 & x < 0 \\ 1 & x > 0 \end{cases}$	ReLU activation

Optimization: Using Calculus to Train

Machine learning is optimization. We use calculus to find parameters that minimize loss:

Gradient Descent Update Rule

$$w := w - \alpha \nabla J(w)$$

Move parameters opposite to gradient direction with learning rate $\alpha$

Why This Works

The gradient points toward increasing loss
We subtract the gradient to move toward decreasing loss
Step size is controlled by learning rate $\alpha$
Repeat until convergence (when $\nabla J \approx 0$)

Real Example: In neural networks, we compute loss over many samples, take its gradient with respect to all weights, and update all weights simultaneously. This is exactly what happens during backpropagation!

Second Derivatives: Curvature Matters

The second derivative shows how the first derivative changes - it measures curvature.

Second Derivative

$$f''(x) = \frac{d}{dx}\left(\frac{df}{dx}\right)$$

Curvature Interpretation

$f''(x) > 0$: Concave up (minimum nearby, safe to optimize)
$f''(x) < 0$: Concave down (maximum nearby, tricky to optimize)
$f''(x) = 0$: Inflection point (curvature changes)

Advanced Optimization

Advanced optimizers use second derivatives:

Newton's Method: Uses Hessian matrix (all second partial derivatives)
Adam, RMSprop: Adapt step size based on curvature estimates
L-BFGS: Approximates Hessian for efficient optimization

Integration (Brief Overview)

While derivatives are central to ML, integration appears in:

Probability: Computing areas under probability density functions
Expected values: $E[X] = \int x \cdot p(x) dx$
Variational inference: Approximate intractable integrals

Continue Learning

Ready to see these concepts in action? Explore related topics:

Key Takeaways

Derivatives measure how functions change - fundamental to optimization
Gradients are vectors of partial derivatives - they guide us toward minima
Chain rule lets us compute derivatives through nested functions (backprop!)
Optimization iteratively moves opposite to gradients to minimize loss
Second derivatives reveal curvature and help advanced optimizers

Calculus for Machine Learning