Cross-Entropy Loss
Cross-entropy loss is the go-to loss function for classification problems. It measures how well predicted probability distributions match the true labels.
The Formula
Binary Cross-Entropy
For binary classification (two classes):
L = -[y × log(p) + (1-y) × log(1-p)]
Where:
- y is the true label (0 or 1)
- p is the predicted probability of class 1
Categorical Cross-Entropy
For multi-class classification:
L = -Σ yᵢ × log(pᵢ)
Where:
- yᵢ is 1 for the true class, 0 otherwise (one-hot encoded)
- pᵢ is the predicted probability for class i
Why Cross-Entropy?
1. Information Theory Foundation
Cross-entropy comes from information theory. It measures the average number of bits needed to encode data from distribution p using a code optimized for distribution q.
H(p, q) = -Σ p(x) × log(q(x))
Minimizing cross-entropy = making q match p.
2. Maximum Likelihood Connection
Minimizing cross-entropy is equivalent to maximizing likelihood. If we assume the model outputs parameterize a categorical distribution, the negative log-likelihood is exactly cross-entropy.
3. Better Gradients Than MSE
For classification with sigmoid/softmax outputs:
- Cross-entropy: Gradient doesn't vanish when predictions are wrong
- MSE: Gradient can be tiny even for confident wrong predictions
The Softmax-CrossEntropy Pairing
In practice, softmax and cross-entropy are computed together for numerical stability:
# Unstable:
probs = softmax(logits)
loss = cross_entropy(probs, labels)
# Stable (what libraries do):
loss = cross_entropy_with_logits(logits, labels)
The stable version avoids:
- Computing exp() of large numbers
- Taking log() of tiny probabilities
Variants
Weighted Cross-Entropy
For imbalanced classes, weight the loss:
L = -Σ wᵢ × yᵢ × log(pᵢ)
Focal Loss
Down-weights easy examples, focuses on hard ones:
L = -(1-p)^γ × y × log(p)
Used in object detection (RetinaNet).
Label Smoothing
Instead of hard 0/1 labels, use soft labels:
y_smooth = (1-ε)×y + ε/K
Prevents overconfidence, improves calibration.
Relationship to Other Concepts
Entropy
Entropy measures uncertainty in a distribution:
H(p) = -Σ p(x) × log(p(x))
KL Divergence
KL divergence measures how one distribution differs from another:
D_KL(p || q) = H(p, q) - H(p)
Minimizing cross-entropy = minimizing KL divergence (since H(p) is constant).
Common Pitfalls
1. Log of Zero
log(0) = -∞. Always clip predictions:
loss = -y * log(max(p, 1e-7))
2. Numerical Overflow
exp() of large numbers overflows. Use log-sum-exp trick:
log_softmax = logits - logsumexp(logits)
3. Wrong Activation
- Binary: Use sigmoid + binary cross-entropy
- Multi-class: Use softmax + categorical cross-entropy
- Multi-label: Use sigmoid + binary cross-entropy per class
Practical Tips
- Always use with logits: Let the library handle numerical stability
- Check your labels: One-hot vs. integer labels require different functions
- Monitor calibration: Low loss doesn't guarantee well-calibrated probabilities
- Consider label smoothing: Especially with noisy labels
Key Takeaways
- Cross-entropy measures how well predictions match true distributions
- It's equivalent to negative log-likelihood
- Softmax + cross-entropy should be computed together for stability
- Use weighted variants for imbalanced data
- Watch out for numerical issues with extreme probabilities