Cross-Entropy Loss

Cross-entropy loss is the go-to loss function for classification problems. It measures how well predicted probability distributions match the true labels.

The Formula

Binary Cross-Entropy

For binary classification (two classes):

L = -[y × log(p) + (1-y) × log(1-p)]

Where:

y is the true label (0 or 1)
p is the predicted probability of class 1

Categorical Cross-Entropy

For multi-class classification:

L = -Σ yᵢ × log(pᵢ)

Where:

yᵢ is 1 for the true class, 0 otherwise (one-hot encoded)
pᵢ is the predicted probability for class i

Why Cross-Entropy?

1. Information Theory Foundation

Cross-entropy comes from information theory. It measures the average number of bits needed to encode data from distribution p using a code optimized for distribution q.

H(p, q) = -Σ p(x) × log(q(x))

Minimizing cross-entropy = making q match p.

2. Maximum Likelihood Connection

Minimizing cross-entropy is equivalent to maximizing likelihood. If we assume the model outputs parameterize a categorical distribution, the negative log-likelihood is exactly cross-entropy.

3. Better Gradients Than MSE

For classification with sigmoid/softmax outputs:

Cross-entropy: Gradient doesn't vanish when predictions are wrong
MSE: Gradient can be tiny even for confident wrong predictions

The Softmax-CrossEntropy Pairing

In practice, softmax and cross-entropy are computed together for numerical stability:

# Unstable:
probs = softmax(logits)
loss = cross_entropy(probs, labels)

# Stable (what libraries do):
loss = cross_entropy_with_logits(logits, labels)

The stable version avoids:

Computing exp() of large numbers
Taking log() of tiny probabilities

Variants

Weighted Cross-Entropy

For imbalanced classes, weight the loss:

L = -Σ wᵢ × yᵢ × log(pᵢ)

Focal Loss

Down-weights easy examples, focuses on hard ones:

L = -(1-p)^γ × y × log(p)

Used in object detection (RetinaNet).

Label Smoothing

Instead of hard 0/1 labels, use soft labels:

y_smooth = (1-ε)×y + ε/K

Prevents overconfidence, improves calibration.

Relationship to Other Concepts

Entropy

Entropy measures uncertainty in a distribution:

H(p) = -Σ p(x) × log(p(x))

KL Divergence

KL divergence measures how one distribution differs from another:

D_KL(p || q) = H(p, q) - H(p)

Minimizing cross-entropy = minimizing KL divergence (since H(p) is constant).

Common Pitfalls

1. Log of Zero

log(0) = -∞. Always clip predictions:

loss = -y * log(max(p, 1e-7))

2. Numerical Overflow

exp() of large numbers overflows. Use log-sum-exp trick:

log_softmax = logits - logsumexp(logits)

3. Wrong Activation

Binary: Use sigmoid + binary cross-entropy
Multi-class: Use softmax + categorical cross-entropy
Multi-label: Use sigmoid + binary cross-entropy per class

Practical Tips

Always use with logits: Let the library handle numerical stability
Check your labels: One-hot vs. integer labels require different functions
Monitor calibration: Low loss doesn't guarantee well-calibrated probabilities
Consider label smoothing: Especially with noisy labels

Key Takeaways

Cross-entropy measures how well predictions match true distributions
It's equivalent to negative log-likelihood
Softmax + cross-entropy should be computed together for stability
Use weighted variants for imbalanced data
Watch out for numerical issues with extreme probabilities