Understanding Convolution

The Core Operation of Convolutional Neural Networks

1. What is Convolution?

Convolution is a mathematical operation that slides a small matrix (kernel/filter) over a larger matrix (image), computing element-wise products and summing them up at each position.

(I * K)[i, j] = \sum_m \sum_n I[i+m, j+n] \cdot K[m, n]

where \( I \) is the input, \( K \) is the kernel, and \( * \) denotes convolution.

2. Why Convolution for Images?

🎯 Local Connectivity

Each output depends only on a small region (receptive field). Captures local patterns like edges.

🔄 Parameter Sharing

Same kernel weights used across entire image. Dramatically reduces parameters vs fully-connected.

📍 Translation Invariance

Detects features regardless of where they appear in the image. A cat is a cat anywhere!

3. Key Concepts

Stride

How many pixels the kernel moves at each step. Stride=1 moves one pixel, Stride=2 skips every other pixel.

Output Size: $$ \text{Output Size} = \left\lfloor \frac{\text{Input Size} - \text{Kernel Size}}{\text{Stride}} \right\rfloor + 1 $$

Padding

Adding zeros around the input border to control output size. "SAME" padding keeps size, "VALID" shrinks it.

Common Kernels

Edge Detection (Horizontal): $$ K = \begin{bmatrix} -1 & -1 & -1 \\ 0 & 0 & 0 \\ 1 & 1 & 1 \end{bmatrix} $$ Sharpen: $$ K = \begin{bmatrix} 0 & -1 & 0 \\ -1 & 5 & -1 \\ 0 & -1 & 0 \end{bmatrix} $$ Gaussian Blur: $$ K = \frac{1}{16}\begin{bmatrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1 \end{bmatrix} $$

4. Interactive Convolution Animation

Watch the kernel slide across the input matrix. Click "Step" to advance one position at a time.

Stride: 1

Kernel Type:

Input (5×5)

Kernel (3×3)

Output

Computation at Current Position:

5. Numerical Example

Input: 5×5 matrix, Kernel: 3×3 edge detector, Stride: 1

Input I: $$ I = \begin{bmatrix} 1 & 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 0 & 0 \end{bmatrix} $$ Kernel K (Vertical Edge): $$ K = \begin{bmatrix} 1 & 0 & -1 \\ 1 & 0 & -1 \\ 1 & 0 & -1 \end{bmatrix} $$ Computation at position (0,0): Sum = (1×1 + 1×0 + 1×(-1)) + (1×1 + 1×0 + 1×(-1)) + (1×1 + 1×0 + 1×(-1)) = (1 + 0 - 1) + (1 + 0 - 1) + (1 + 0 - 1) = 0 Computation at position (0,1): Sum = (1×1 + 1×0 + 0×(-1)) + (1×1 + 1×0 + 0×(-1)) + (1×1 + 1×0 + 0×(-1)) = (1 + 0 + 0) + (1 + 0 + 0) + (1 + 0 + 0) = 3 The output detects the vertical edge at the boundary!

6. Building CNNs

Convolutional Neural Networks stack multiple convolution layers:

Typical CNN Architecture: Input Image \to [Conv \to ReLU \to Pool] \times N \to Flatten \to Dense \to Output

Early layers: Learn low-level features (edges, textures)
Middle layers: Learn mid-level features (patterns, shapes)
Deep layers: Learn high-level features (object parts, faces)