Pooling Layers | Math for ML

Pooling: Downsampling in CNNs

Pooling layers reduce spatial dimensions by aggregating information. They appear after convolutional layers to make networks more efficient and robust.

Why Pool?

📉 Reduce parameters: Fewer computations
🎯 Invariance: Small shifts don't change features
🔍 Zoom out: Capture larger patterns
⚡ Faster: Fewer values to process

Max Pooling

Takes the maximum value in a local window. Most common pooling operation.

Example: 2×2 Max Pooling with stride 2

Input (4×4):
1 2 | 5 6
3 4 | 7 8
----+----
9 10| 13 14
11 12| 15 16

Output (2×2):
4 8
12 16

Each 2×2 window → keep max value. "Wins" at positions with strongest features.

Average Pooling

Takes the average of values in a window. Smoother but less commonly used.

✅ Smoother feature reduction
❌ Loses information about peak values
❌ Generally performs worse than max pooling

Parameters

Pool size: Usually 2×2 or 3×3. Larger windows → more downsampling
Stride: How much to move the window. Stride=2 with 2×2 pool means no overlap
Padding: Whether to pad edges (typically "valid" = no padding)

Effect on Dimensions

For a feature map of size $H \times W$, with pool size $P$ and stride $S$:

$$H_{out} = \left\lfloor \frac{H - P}{S} \right\rfloor + 1$$ $$W_{out} = \left\lfloor \frac{W - P}{S} \right\rfloor + 1$$

Example: 28×28 input, 2×2 pooling, stride 2 → 14×14 output

Typical Architecture

In modern CNNs, pooling alternates with convolution:

Conv (extract features) → ReLU → Pool (downsample)
Conv → ReLU → Pool
Conv → ReLU → Pool
Flatten → Dense layers → Classification

Pattern: Each pooling halves spatial dimensions, allowing deeper layers to see larger context.

Modern Trends

🔄 Stride in conv: Some architectures use strided convolution instead of pooling
🚀 Vision Transformers: Don't use pooling, use attention
📊 Skip connections: Allow information from early layers to bypass pooling

Learn More

→ Convolution Operation • Neural Architecture