Understanding Convolution
The Core Operation of Convolutional Neural Networks
1. What is Convolution?
Convolution is a mathematical operation that slides a small matrix (kernel/filter) over a larger matrix (image), computing element-wise products and summing them up at each position.
where \( I \) is the input, \( K \) is the kernel, and \( * \) denotes convolution.
2. Why Convolution for Images?
🎯 Local Connectivity
Each output depends only on a small region (receptive field). Captures local patterns like edges.
🔄 Parameter Sharing
Same kernel weights used across entire image. Dramatically reduces parameters vs fully-connected.
📍 Translation Invariance
Detects features regardless of where they appear in the image. A cat is a cat anywhere!
3. Key Concepts
Stride
How many pixels the kernel moves at each step. Stride=1 moves one pixel, Stride=2 skips every other pixel.
Padding
Adding zeros around the input border to control output size. "SAME" padding keeps size, "VALID" shrinks it.
Common Kernels
Sharpen: $$ K = \begin{bmatrix} 0 & -1 & 0 \\ -1 & 5 & -1 \\ 0 & -1 & 0 \end{bmatrix} $$
Gaussian Blur: $$ K = \frac{1}{16}\begin{bmatrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1 \end{bmatrix} $$
4. Interactive Convolution Animation
Watch the kernel slide across the input matrix. Click "Step" to advance one position at a time.
Input (5×5)
Kernel (3×3)
Output
Computation at Current Position:
5. Numerical Example
Input: 5×5 matrix, Kernel: 3×3 edge detector, Stride: 1
Computation at position (0,0):
Sum = (1×1 + 1×0 + 1×(-1)) + (1×1 + 1×0 + 1×(-1)) + (1×1 + 1×0 + 1×(-1))
= (1 + 0 - 1) + (1 + 0 - 1) + (1 + 0 - 1) = 0
Computation at position (0,1):
Sum = (1×1 + 1×0 + 0×(-1)) + (1×1 + 1×0 + 0×(-1)) + (1×1 + 1×0 + 0×(-1))
= (1 + 0 + 0) + (1 + 0 + 0) + (1 + 0 + 0) = 3
The output detects the vertical edge at the boundary!
6. Building CNNs
Convolutional Neural Networks stack multiple convolution layers:
Input Image → [Conv → ReLU → Pool] × N → Flatten → Dense → Output
- Early layers: Learn low-level features (edges, textures)
- Middle layers: Learn mid-level features (patterns, shapes)
- Deep layers: Learn high-level features (object parts, faces)