Activation Functions
The Nonlinearity that Powers Neural Networks
Why We Need Activation Functions
Without activation functions, a neural network would just be a series of linear transformations, which collapse into a single linear function. Activation functions introduce nonlinearity, allowing networks to learn complex patterns.
1. Sigmoid: The Classic S-Curve
Pros: Smooth gradient, interpretable as probability
Cons: Vanishing gradients (when |x| is large, gradient → 0), outputs not zero-centered
2. Tanh: Zero-Centered Sigmoid
Pros: Zero-centered outputs (better for optimization)
Cons: Still suffers from vanishing gradients
3. ReLU: The Modern Standard
Pros: No vanishing gradient for positive values, computationally efficient, sparse activation
Cons: "Dying ReLU" problem (neurons can get stuck at 0)
4. Leaky ReLU: Fixing Dying Neurons
Derivative: $$ \text{LeakyReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ \alpha & \text{otherwise} \end{cases} $$
Pros: Allows small gradient when x < 0, prevents dead neurons
Cons: Adds hyperparameter \( \alpha \)
5. ELU: Smooth Negative Values
Pros: Smooth everywhere, negative values push mean activation toward zero
Cons: Slightly more computation due to exponential
6. GELU: Used in Transformers
Approximation: $$ \text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{2/\pi}(x + 0.044715x^3)\right]\right) $$
Use Case: BERT, GPT, and modern transformers
Pros: Smooth, non-monotonic, weights inputs by their magnitude
7. Swish/SiLU: Self-Gated
Discovered by: Google's AutoML
Pros: Smooth, non-monotonic, often outperforms ReLU
8. Interactive Comparison
Select activation functions to compare their shapes and derivatives.
9. Choosing the Right Activation
Hidden Layers
Use ReLU as default. Try Leaky ReLU or ELU if you experience dying neurons. GELU/Swish for transformers.
Output Layer (Classification)
Sigmoid for binary, Softmax for multiclass. These give probabilities.
Output Layer (Regression)
Linear (no activation) for unbounded outputs, ReLU for positive values only.