intermediateFoundations

Understand cross-entropy loss - the most common loss function for classification tasks, measuring the difference between predicted and actual probability distributions.

loss-functionsclassificationinformation-theorydeep-learning

Cross-Entropy Loss

Cross-entropy loss is the go-to loss function for classification problems. It measures how well predicted probability distributions match the true labels.

The Formula

Binary Cross-Entropy

For binary classification (two classes):

L = -[y × log(p) + (1-y) × log(1-p)]

Where:

  • y is the true label (0 or 1)
  • p is the predicted probability of class 1

Categorical Cross-Entropy

For multi-class classification:

L = -Σ yᵢ × log(pᵢ)

Where:

  • yᵢ is 1 for the true class, 0 otherwise (one-hot encoded)
  • pᵢ is the predicted probability for class i

Why Cross-Entropy?

1. Information Theory Foundation

Cross-entropy comes from information theory. It measures the average number of bits needed to encode data from distribution p using a code optimized for distribution q.

H(p, q) = -Σ p(x) × log(q(x))

Minimizing cross-entropy = making q match p.

2. Maximum Likelihood Connection

Minimizing cross-entropy is equivalent to maximizing likelihood. If we assume the model outputs parameterize a categorical distribution, the negative log-likelihood is exactly cross-entropy.

3. Better Gradients Than MSE

For classification with sigmoid/softmax outputs:

  • Cross-entropy: Gradient doesn't vanish when predictions are wrong
  • MSE: Gradient can be tiny even for confident wrong predictions

The Softmax-CrossEntropy Pairing

In practice, softmax and cross-entropy are computed together for numerical stability:

# Unstable:
probs = softmax(logits)
loss = cross_entropy(probs, labels)

# Stable (what libraries do):
loss = cross_entropy_with_logits(logits, labels)

The stable version avoids:

  • Computing exp() of large numbers
  • Taking log() of tiny probabilities

Variants

Weighted Cross-Entropy

For imbalanced classes, weight the loss:

L = -Σ wᵢ × yᵢ × log(pᵢ)

Focal Loss

Down-weights easy examples, focuses on hard ones:

L = -(1-p)^γ × y × log(p)

Used in object detection (RetinaNet).

Label Smoothing

Instead of hard 0/1 labels, use soft labels:

y_smooth = (1-ε)×y + ε/K

Prevents overconfidence, improves calibration.

Relationship to Other Concepts

Entropy

Entropy measures uncertainty in a distribution:

H(p) = -Σ p(x) × log(p(x))

KL Divergence

KL divergence measures how one distribution differs from another:

D_KL(p || q) = H(p, q) - H(p)

Minimizing cross-entropy = minimizing KL divergence (since H(p) is constant).

Common Pitfalls

1. Log of Zero

log(0) = -∞. Always clip predictions:

loss = -y * log(max(p, 1e-7))

2. Numerical Overflow

exp() of large numbers overflows. Use log-sum-exp trick:

log_softmax = logits - logsumexp(logits)

3. Wrong Activation

  • Binary: Use sigmoid + binary cross-entropy
  • Multi-class: Use softmax + categorical cross-entropy
  • Multi-label: Use sigmoid + binary cross-entropy per class

Practical Tips

  1. Always use with logits: Let the library handle numerical stability
  2. Check your labels: One-hot vs. integer labels require different functions
  3. Monitor calibration: Low loss doesn't guarantee well-calibrated probabilities
  4. Consider label smoothing: Especially with noisy labels

Key Takeaways

  1. Cross-entropy measures how well predictions match true distributions
  2. It's equivalent to negative log-likelihood
  3. Softmax + cross-entropy should be computed together for stability
  4. Use weighted variants for imbalanced data
  5. Watch out for numerical issues with extreme probabilities

Practice Questions

Test your understanding with these related interview questions: