Chapter 4: Autograd & Gradients¶
🔢 Learning Objectives
- Understand automatic differentiation
- Compute gradients automatically
- Control gradient computation
- Debug gradient flow
What is Autograd?¶
Autograd is PyTorch's automatic differentiation engine that powers neural network training. It automatically computes gradients (derivatives) of tensor operations, which are essential for optimization algorithms like gradient descent.
Why Autograd is Powerful
Autograd eliminates the need to manually compute derivatives, which is error-prone and time-consuming. It automatically builds a computation graph and computes gradients using the chain rule.
Understanding the Computation Graph
PyTorch builds a dynamic computation graph as you perform operations. When you call .backward(), it traverses this graph in reverse to compute gradients using backpropagation.
Why Autograd Matters¶
- 🎯 Automatic: No manual derivative calculations
- 🔄 Dynamic: Build graphs on-the-fly (define-by-run)
- 📊 Efficient: Optimized backward pass computation
- 🧮 Flexible: Supports complex operations and custom functions
Enabling Gradient Tracking¶
import torch
# Create tensors with gradient tracking
x = torch.tensor([2.0, 3.0], requires_grad=True)
print(f"x: {x}")
print(f"Requires grad: {x.requires_grad}")
print(f"Is leaf: {x.is_leaf}")
print(f"Gradient: {x.grad}") # None initially
# Create tensor without gradient tracking (default)
y = torch.tensor([1.0, 2.0])
print(f"\ny requires grad: {y.requires_grad}")
# Enable gradient tracking on existing tensor
y.requires_grad_(True)
print(f"y requires grad now: {y.requires_grad}")
Computing Gradients¶
Basic Example¶
import torch
# Create tensor with gradient tracking
x = torch.tensor([3.0], requires_grad=True)
print(f"x = {x}")
# Perform operations
y = x ** 2 # y = x²
print(f"y = x² = {y}")
# Compute gradient dy/dx
y.backward() # Computes gradients
# Access gradient
print(f"dy/dx = {x.grad}") # Should be 2x = 2*3 = 6
Mathematical Explanation: - \(y = x^2\) - \(\frac{dy}{dx} = 2x\) - At \(x = 3\): \(\frac{dy}{dx} = 2(3) = 6\)
Multi-Step Computation¶
import torch
x = torch.tensor([2.0], requires_grad=True)
a = torch.tensor([3.0], requires_grad=True)
# Forward pass
y = x ** 2 # y = x²
z = a * y # z = a * y = a * x²
w = z + 5 # w = a * x² + 5
print(f"x = {x}, a = {a}")
print(f"y = x² = {y}")
print(f"z = a*y = {z}")
print(f"w = z+5 = {w}")
# Backward pass
w.backward()
# Check gradients
print(f"\ndw/dx = {x.grad}") # 2*a*x = 2*3*2 = 12
print(f"dw/da = {a.grad}") # x² = 4
Mathematical Explanation: - \(w = ax^2 + 5\) - \(\frac{\partial w}{\partial x} = 2ax = 2(3)(2) = 12\) - \(\frac{\partial w}{\partial a} = x^2 = 2^2 = 4\)
Vector Gradients¶
import torch
# Vector input
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
# Scalar output
y = (x ** 2).sum() # y = x₁² + x₂² + x₃²
print(f"y = {y}")
y.backward()
print(f"dy/dx = {x.grad}") # [2x₁, 2x₂, 2x₃] = [2, 4, 6]
Non-Scalar Backward Pass¶
For non-scalar outputs, you need to provide a gradient argument:
import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
# Vector output
y = x ** 2 # [1, 4, 9]
# Need to provide gradient for non-scalar
gradient = torch.tensor([1.0, 1.0, 1.0])
y.backward(gradient)
print(f"Gradient: {x.grad}") # [2, 4, 6]
Gradient Accumulation¶
import torch
x = torch.tensor([3.0], requires_grad=True)
# First backward pass
y1 = x ** 2
y1.backward()
print(f"After first backward: x.grad = {x.grad}")
# Second backward pass (gradients accumulate!)
y2 = x ** 3
y2.backward()
print(f"After second backward: x.grad = {x.grad}")
# Zero gradients
x.grad.zero_()
print(f"After zeroing: x.grad = {x.grad}")
# Third backward pass
y3 = x ** 2
y3.backward()
print(f"After third backward: x.grad = {x.grad}")
Important: Gradients accumulate by default! Always zero gradients between iterations.
Controlling Gradient Computation¶
No Gradient Context¶
import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
# Normal operation - gradients tracked
y = x ** 2
print(f"y requires_grad: {y.requires_grad}")
# Disable gradient tracking temporarily
with torch.no_grad():
z = x ** 2
print(f"z requires_grad: {z.requires_grad}")
# Gradients tracked again
w = x ** 2
print(f"w requires_grad: {w.requires_grad}")
Use Cases for torch.no_grad():
- Inference/evaluation (no training)
- Validation phase
- Saving memory
- Speeding up computation
Detach¶
import torch
x = torch.tensor([2.0], requires_grad=True)
# Create computation graph
y = x ** 2
z = y ** 3
# Detach y from graph
y_detached = y.detach()
print(f"y requires_grad: {y.requires_grad}")
print(f"y_detached requires_grad: {y_detached.requires_grad}")
# Backward through z
w = y_detached * 2
# w.backward() # Error! y_detached has no gradient
Set Gradient Enabled¶
import torch
# Enable/disable globally
torch.set_grad_enabled(False)
x = torch.tensor([1.0], requires_grad=True)
y = x ** 2
print(f"Grad enabled False - y.requires_grad: {y.requires_grad}")
torch.set_grad_enabled(True)
z = x ** 2
print(f"Grad enabled True - z.requires_grad: {z.requires_grad}")
Computational Graph¶
Understanding the Graph¶
import torch
x = torch.tensor([2.0], requires_grad=True)
w = torch.tensor([3.0], requires_grad=True)
b = torch.tensor([1.0], requires_grad=True)
# Build graph
y = w * x + b # Linear function
z = y ** 2 # Square
print(f"y: {y}")
print(f"z: {z}")
# Check graph
print(f"\nz.grad_fn: {z.grad_fn}") # PowBackward
print(f"y.grad_fn: {y.grad_fn}") # AddBackward
print(f"x.grad_fn: {x.grad_fn}") # None (leaf node)
Leaf Nodes vs Non-Leaf Nodes¶
import torch
x = torch.tensor([1.0], requires_grad=True) # Leaf
y = x * 2 # Non-leaf
z = y + 3 # Non-leaf
print(f"x is_leaf: {x.is_leaf}") # True
print(f"y is_leaf: {y.is_leaf}") # False
print(f"z is_leaf: {z.is_leaf}") # False
# Only leaf nodes retain gradients by default
z.backward()
print(f"\nx.grad: {x.grad}") # Available
print(f"y.grad: {y.grad}") # None (non-leaf)
print(f"z.grad: {z.grad}") # None (non-leaf)
Retaining Gradients for Non-Leaf Nodes¶
import torch
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2
y.retain_grad() # Keep gradient for non-leaf node
z = y ** 3
z.backward()
print(f"x.grad: {x.grad}") # Leaf - always available
print(f"y.grad: {y.grad}") # Non-leaf - available due to retain_grad()
Gradient Descent Example¶
Simple Linear Regression¶
import torch
import matplotlib.pyplot as plt
# Generate synthetic data: y = 2x + 1
torch.manual_seed(42)
x_data = torch.randn(100, 1)
y_true = 2 * x_data + 1 + torch.randn(100, 1) * 0.1
# Initialize parameters
w = torch.randn(1, requires_grad=True)
b = torch.randn(1, requires_grad=True)
# Training loop
learning_rate = 0.01
epochs = 100
losses = []
for epoch in range(epochs):
# Forward pass
y_pred = w * x_data + b
# Compute loss (Mean Squared Error)
loss = ((y_pred - y_true) ** 2).mean()
losses.append(loss.item())
# Backward pass
loss.backward()
# Update parameters
with torch.no_grad():
w -= learning_rate * w.grad
b -= learning_rate * b.grad
# Zero gradients
w.grad.zero_()
b.grad.zero_()
if (epoch + 1) % 20 == 0:
print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}, w: {w.item():.4f}, b: {b.item():.4f}")
print(f"\nFinal: w = {w.item():.4f}, b = {b.item():.4f}")
print(f"True: w = 2.0000, b = 1.0000")
Visualizing Gradient Descent¶
import torch
import numpy as np
import matplotlib.pyplot as plt
# Function: f(x) = x²
x_range = np.linspace(-5, 5, 100)
y_range = x_range ** 2
# Gradient descent
x = torch.tensor([-4.0], requires_grad=True)
learning_rate = 0.1
steps = 20
x_history = [x.item()]
for step in range(steps):
y = x ** 2
y.backward()
with torch.no_grad():
x -= learning_rate * x.grad
x.grad.zero_()
x_history.append(x.item())
print(f"Start: x = -4.0")
print(f"End: x = {x.item():.4f}")
print(f"Optimum: x = 0.0")
Higher-Order Gradients¶
Second Derivatives¶
import torch
x = torch.tensor([2.0], requires_grad=True)
# First derivative
y = x ** 3
y.backward(create_graph=True) # Keep graph for second derivative
first_grad = x.grad.clone()
print(f"First derivative (3x²): {first_grad.item()}")
# Second derivative
x.grad.zero_()
first_grad.backward()
second_grad = x.grad
print(f"Second derivative (6x): {second_grad.item()}")
Common Pitfalls and Solutions¶
Pitfall 1: In-place Operations¶
import torch
x = torch.tensor([1.0], requires_grad=True)
# Bad: In-place operation
# x += 1 # Error during backward!
# Good: Create new tensor
x = x + 1
y = x ** 2
y.backward()
print(f"Gradient: {x.grad}")
Pitfall 2: Forgetting to Zero Gradients¶
import torch
x = torch.tensor([2.0], requires_grad=True)
# Iteration 1
y = x ** 2
y.backward()
print(f"Iteration 1: {x.grad}")
# Iteration 2 (forgot to zero!)
y = x ** 2
y.backward()
print(f"Iteration 2 (accumulated): {x.grad}")
# Correct way
x.grad.zero_()
y = x ** 2
y.backward()
print(f"Iteration 3 (zeroed): {x.grad}")
Pitfall 3: Multiple Backward Passes¶
import torch
x = torch.tensor([2.0], requires_grad=True)
y = x ** 2
# First backward
y.backward(retain_graph=True)
print(f"First backward: {x.grad}")
# Second backward (need retain_graph=True)
x.grad.zero_()
y.backward()
print(f"Second backward: {x.grad}")
Autograd Functions¶
Custom Autograd Function¶
import torch
from torch.autograd import Function
class MySquare(Function):
@staticmethod
def forward(ctx, input):
# Save input for backward
ctx.save_for_backward(input)
return input ** 2
@staticmethod
def backward(ctx, grad_output):
# Retrieve saved input
input, = ctx.saved_tensors
# Gradient: d(x²)/dx = 2x
return grad_output * 2 * input
# Use custom function
x = torch.tensor([3.0], requires_grad=True)
square = MySquare.apply
y = square(x)
y.backward()
print(f"x: {x}")
print(f"y: {y}")
print(f"dy/dx: {x.grad}") # 2*3 = 6
Practice Exercises¶
Exercise 1: Basic Gradients¶
import torch
# Compute gradient of f(x, y) = x²y + y³ at x=2, y=3
x = torch.tensor([2.0], requires_grad=True)
y = torch.tensor([3.0], requires_grad=True)
f = x**2 * y + y**3
f.backward()
print(f"∂f/∂x = {x.grad.item()}") # 2xy = 2*2*3 = 12
print(f"∂f/∂y = {y.grad.item()}") # x² + 3y² = 4 + 27 = 31
Exercise 2: Training Loop Pattern¶
import torch
# Model parameters
w = torch.randn(1, requires_grad=True)
# Training loop structure
for epoch in range(10):
# 1. Forward pass
output = w * 2
loss = output ** 2
# 2. Backward pass
loss.backward()
# 3. Update parameters
with torch.no_grad():
w -= 0.01 * w.grad
# 4. Zero gradients
w.grad.zero_()
Optimization Tips¶
import torch
# 1. Use torch.no_grad() for inference
model_output = None
with torch.no_grad():
model_output = model(input_data)
# 2. Detach when you don't need gradients
loss_value = loss.detach().item()
# 3. Use gradient checkpointing for memory
from torch.utils.checkpoint import checkpoint
# 4. Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Next Steps¶
Continue to Chapter 5: Building Neural Networks to learn about: - nn.Module class - Layers and activation functions - Building custom models - Forward pass
Key Takeaways¶
- ✅ Autograd automatically computes gradients via backward()
- ✅ Use
requires_grad=Trueto track gradients - ✅ Always zero gradients between iterations
- ✅ Use
torch.no_grad()for inference - ✅ Computational graph is built dynamically
- ✅ Gradients accumulate by default
Recommended Reads¶
📚 Official Documentation
- Autograd Documentation - Complete autograd API
- Autograd Mechanics - How autograd works internally
- Gradient Computation - Understanding gradients
- Function API - Custom autograd functions
📖 Essential Articles
- Understanding Autograd - Official autograd tutorial
- Computation Graphs - Graph construction
- Gradient Accumulation - Accumulating gradients
- Custom Autograd Functions - Creating custom functions
🎓 Learning Resources
- Backpropagation Explained - Deep dive into backprop
- Automatic Differentiation - AD fundamentals
- Gradient Checking - Verifying gradients
💡 Best Practices
- Gradient Management - Efficient gradient computation
- No Grad Context - When to disable gradients
- Gradient Clipping - Preventing exploding gradients
🔬 Research Papers
- Automatic Differentiation in Machine Learning - AD survey paper
- Backpropagation Through Time - BPTT algorithm
- Gradient-Based Learning - Deep learning foundations