Pooling Layers
Downsampling in CNNs
Pooling: Downsampling in CNNs
Pooling layers reduce spatial dimensions by aggregating information. They appear after convolutional layers to make networks more efficient and robust.
Why Pool?
- 📉 Reduce parameters: Fewer computations
- 🎯 Invariance: Small shifts don't change features
- 🔍 Zoom out: Capture larger patterns
- ⚡ Faster: Fewer values to process
Max Pooling
Takes the maximum value in a local window. Most common pooling operation.
Example: 2×2 Max Pooling with stride 2
Input (4×4):
1 2 | 5 6
3 4 | 7 8
----+----
9 10| 13 14
11 12| 15 16
Output (2×2):
4 8
12 16
Each 2×2 window → keep max value. "Wins" at positions with strongest features.
Average Pooling
Takes the average of values in a window. Smoother but less commonly used.
- ✅ Smoother feature reduction
- ❌ Loses information about peak values
- ❌ Generally performs worse than max pooling
Parameters
- Pool size: Usually 2×2 or 3×3. Larger windows → more downsampling
- Stride: How much to move the window. Stride=2 with 2×2 pool means no overlap
- Padding: Whether to pad edges (typically "valid" = no padding)
Effect on Dimensions
For a feature map of size $H \times W$, with pool size $P$ and stride $S$:
$$H_{out} = \left\lfloor \frac{H - P}{S} \right\rfloor + 1$$ $$W_{out} = \left\lfloor \frac{W - P}{S} \right\rfloor + 1$$
Example: 28×28 input, 2×2 pooling, stride 2 → 14×14 output
Typical Architecture
In modern CNNs, pooling alternates with convolution:
- Conv (extract features) → ReLU → Pool (downsample)
- Conv → ReLU → Pool
- Conv → ReLU → Pool
- Flatten → Dense layers → Classification
Pattern: Each pooling halves spatial dimensions, allowing deeper layers to see larger context.
Modern Trends
- 🔄 Stride in conv: Some architectures use strided convolution instead of pooling
- 🚀 Vision Transformers: Don't use pooling, use attention
- 📊 Skip connections: Allow information from early layers to bypass pooling