Activation Functions

The Nonlinearity that Powers Neural Networks

Why We Need Activation Functions

Without activation functions, a neural network would just be a series of linear transformations, which collapse into a single linear function. Activation functions introduce nonlinearity, allowing networks to learn complex patterns.

💡 Key Insight: \( f(g(x)) \) where both \( f \) and \( g \) are linear is still linear! We need nonlinear activations to approximate nonlinear functions.

1. Sigmoid: The Classic S-Curve

Function: $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$ Derivative: $$ \sigma'(x) = \sigma(x)(1 - \sigma(x)) $$ Range: (0, 1)

Pros: Smooth gradient, interpretable as probability
Cons: Vanishing gradients (when |x| is large, gradient → 0), outputs not zero-centered

2. Tanh: Zero-Centered Sigmoid

Function: $$ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{2}{1 + e^{-2x}} - 1 $$ Derivative: $$ \tanh'(x) = 1 - \tanh^2(x) $$ Range: (-1, 1)

Pros: Zero-centered outputs (better for optimization)
Cons: Still suffers from vanishing gradients

3. ReLU: The Modern Standard

Function: $$ \text{ReLU}(x) = \max(0, x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases} $$ Derivative: $$ \text{ReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases} $$ Range: [0, ∞)

Pros: No vanishing gradient for positive values, computationally efficient, sparse activation
Cons: "Dying ReLU" problem (neurons can get stuck at 0)

4. Leaky ReLU: Fixing Dying Neurons

Function: $$ \text{LeakyReLU}(x) = \max(\alpha x, x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{otherwise} \end{cases} $$ Typically: \( \alpha = 0.01 \)
Derivative: $$ \text{LeakyReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ \alpha & \text{otherwise} \end{cases} $$

Pros: Allows small gradient when x < 0, prevents dead neurons
Cons: Adds hyperparameter \( \alpha \)

5. ELU: Smooth Negative Values

Function: $$ \text{ELU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{otherwise} \end{cases} $$ Derivative: $$ \text{ELU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ \alpha e^x & \text{otherwise} \end{cases} $$

Pros: Smooth everywhere, negative values push mean activation toward zero
Cons: Slightly more computation due to exponential

6. GELU: Used in Transformers

Function: $$ \text{GELU}(x) = x \cdot \Phi(x) $$ where \( \Phi(x) \) is the CDF of standard Gaussian distribution.
Approximation: $$ \text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{2/\pi}(x + 0.044715x^3)\right]\right) $$

Use Case: BERT, GPT, and modern transformers
Pros: Smooth, non-monotonic, weights inputs by their magnitude

7. Swish/SiLU: Self-Gated

Function: $$ \text{Swish}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}} $$ Derivative: $$ \text{Swish}'(x) = \text{Swish}(x) + \sigma(x)(1 - \text{Swish}(x)) $$

Discovered by: Google's AutoML
Pros: Smooth, non-monotonic, often outperforms ReLU

8. Interactive Comparison

Select activation functions to compare their shapes and derivatives.

9. Choosing the Right Activation

Hidden Layers

Use ReLU as default. Try Leaky ReLU or ELU if you experience dying neurons. GELU/Swish for transformers.

Output Layer (Classification)

Sigmoid for binary, Softmax for multiclass. These give probabilities.

Output Layer (Regression)

Linear (no activation) for unbounded outputs, ReLU for positive values only.