beginnerDeep Learning

Explore activation functions - the non-linear transformations that give neural networks their power to learn complex patterns.

neural-networksrelugelunon-linearity

Activation Functions

Activation functions introduce non-linearity into neural networks. Without them, any deep network would collapse to a single linear transformation.

Activation Functions

Why Non-linearity?

A stack of linear layers is still linear:

W₃(W₂(W₁x)) = (W₃W₂W₁)x = Wx

Activation functions break this, enabling networks to learn complex patterns.

Classic Activation Functions

Sigmoid

σ(x) = 1 / (1 + e^(-x))

Range: (0, 1)

Pros:

  • Smooth, differentiable
  • Output interpretable as probability

Cons:

  • Vanishing gradients: σ'(x) ≤ 0.25
  • Outputs not zero-centered
  • Expensive (exponential)

Use: Output layer for binary classification.

Tanh

tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))

Range: (-1, 1)

Pros:

  • Zero-centered outputs
  • Stronger gradients than sigmoid

Cons:

  • Still has vanishing gradients
  • Expensive

Use: RNNs (historically), rarely in modern networks.

Modern Activation Functions

ReLU (Rectified Linear Unit)

ReLU(x) = max(0, x)

Range: [0, ∞)

Pros:

  • Simple and fast
  • No vanishing gradient for x > 0
  • Sparsity (neurons can be "off")
  • Enabled training of very deep networks

Cons:

  • "Dying ReLU": Neurons stuck at zero forever
  • Not zero-centered
  • Unbounded (can explode)

Use: Default choice for hidden layers.

Leaky ReLU

LeakyReLU(x) = x if x > 0 else αx

Typically α = 0.01

Pros:

  • Prevents dying ReLU
  • Almost as simple as ReLU

Cons:

  • α is another hyperparameter
  • Marginal improvement in practice

PReLU (Parametric ReLU)

PReLU(x) = x if x > 0 else αx

Same as Leaky ReLU but α is learned.

ELU (Exponential Linear Unit)

ELU(x) = x if x > 0 else α(e^x - 1)

Pros:

  • Smooth at x = 0
  • Negative values push mean toward zero
  • Self-normalizing properties

Cons:

  • More expensive than ReLU

SELU (Scaled ELU)

SELU(x) = λ × ELU(x, α)

With specific values of λ and α that maintain unit variance.

Use: Self-normalizing networks (no batch norm needed).

GELU (Gaussian Error Linear Unit)

GELU(x) = x × Φ(x)

Where Φ is the Gaussian CDF.

Pros:

  • Smooth
  • Works exceptionally well in transformers
  • Combines properties of ReLU and dropout

Use: Transformers, modern architectures (BERT, GPT).

Swish / SiLU

Swish(x) = x × σ(βx)

Often with β = 1.

Pros:

  • Smooth, bounded below
  • Sometimes outperforms ReLU

Use: EfficientNet, modern CNNs.

Comparison Table

FunctionRangeZero-centeredVanishingDead neurons
Sigmoid(0,1)NoYesNo
Tanh(-1,1)YesYesNo
ReLU[0,∞)NoNo*Yes
Leaky ReLU(-∞,∞)NoNoNo
ELU(-α,∞)~YesNoNo
GELU(~-0.17,∞)NoNoNo

Output Layer Activations

Different tasks need different output activations:

TaskActivationLoss
Binary classificationSigmoidBinary cross-entropy
Multi-class classificationSoftmaxCategorical cross-entropy
RegressionNone (linear)MSE
Bounded regressionSigmoid/TanhMSE

Choosing Activation Functions

Hidden Layers (2024 recommendations)

  1. Default: ReLU
  2. Transformers: GELU
  3. If ReLU dies: Leaky ReLU or ELU
  4. Cutting edge: Swish, Mish

Don't

  • Use sigmoid/tanh in hidden layers (vanishing gradients)
  • Mix too many different activations
  • Overthink it (ReLU works great)

The Gradient Flow Perspective

Backprop gradient = upstream_gradient × local_derivative
ActivationDerivative at typical inputs
Sigmoid0.1 - 0.25 (shrinks!)
ReLU0 or 1 (preserves or kills)
GELU~0 to ~1.1 (smooth)

ReLU's derivative of 1 for positive inputs is why it enabled deep learning.

Key Takeaways

  1. Activations add non-linearity - essential for deep learning
  2. ReLU is the default for hidden layers
  3. GELU is standard in transformers
  4. Sigmoid/softmax for output layers (classification)
  5. Vanishing gradients killed sigmoid/tanh for hidden layers
  6. Modern activations (GELU, Swish) are smooth variants of ReLU