intermediateEvaluation & Metrics

Understand model calibration - ensuring that predicted probabilities reflect true likelihoods, essential for decision-making under uncertainty.

calibrationprobabilityevaluationuncertainty

Model Calibration

Calibration measures whether a model's predicted probabilities match actual outcome frequencies. A well-calibrated model that predicts 70% probability should be correct 70% of the time.

Why Calibration Matters

The Problem

Most models output confidence scores, not true probabilities:

Model predicts: P(rain) = 0.9
Actual frequency when P(rain) = 0.9: only 60% rain
→ Model is overconfident!

When It Matters

  • Medical diagnosis: "90% chance of disease" should mean 90%
  • Risk assessment: Probability directly affects decisions
  • Ensemble methods: Combining probabilities from multiple models
  • Threshold selection: Choosing classification thresholds

Measuring Calibration

Reliability Diagram (Calibration Curve)

Bin predictions by confidence, plot actual frequency:

Actual
   1 |          /
     |        /  ← Perfect calibration
 0.5 |      /
     |    /   ← Your model
   0 |__/______
     0  0.5  1
      Predicted
from sklearn.calibration import calibration_curve

prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=10)
plt.plot(prob_pred, prob_true, marker='o')
plt.plot([0, 1], [0, 1], 'k--')  # Perfect calibration

Expected Calibration Error (ECE)

Weighted average of calibration error per bin:

ECE = Σ (nᵢ/N) × |accuracy(bin_i) - confidence(bin_i)|

Lower is better. Perfect calibration = 0.

Brier Score

Mean squared error of probability predictions:

Brier = (1/N) × Σ (pᵢ - yᵢ)²

Lower is better. Combines calibration + discrimination.

Calibration Problems by Model

Neural Networks

Typically overconfident, especially:

  • With many parameters
  • Trained too long
  • High model capacity

Random Forests

Typically underconfident at extremes:

  • Predictions bunch around 0.5
  • Rarely predict very high/low probabilities

Boosting (XGBoost, etc.)

Varies, often overconfident.

Logistic Regression

Often well-calibrated out of the box (when model fits data well).

SVMs

Output scores, not probabilities at all (need Platt scaling).

Calibration Methods

Platt Scaling

Fit a logistic regression on the model's outputs:

from sklearn.calibration import CalibratedClassifierCV

calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
calibrated.fit(X_train, y_train)

Learns:

P(y=1|s) = 1 / (1 + exp(As + B))

Where s is the original model output.

Best for: Sigmoidal distortion, smaller datasets

Isotonic Regression

Non-parametric, monotonic calibration:

calibrated = CalibratedClassifierCV(model, method='isotonic', cv=5)
calibrated.fit(X_train, y_train)

Best for: Non-sigmoidal distortion, larger datasets

Warning: Can overfit with small data.

Temperature Scaling

For neural networks, divide logits by temperature T:

class TemperatureScaling(nn.Module):
    def __init__(self):
        super().__init__()
        self.temperature = nn.Parameter(torch.ones(1))
    
    def forward(self, logits):
        return logits / self.temperature

Optimize T on validation set to minimize NLL.

Best for: Neural networks, simple and effective.

Histogram Binning

Assign calibrated probability per bin:

If prediction in [0.8, 0.9], output 0.72 (actual frequency in that bin)

Simple but loses information.

Calibration vs Discrimination

A model can have:

  • Good discrimination, bad calibration: Ranks correctly but probabilities wrong
  • Good calibration, bad discrimination: Probabilities correct but poor ranking
Perfect discrimination + miscalibration:
  Predictions: [0.9, 0.9, 0.9, 0.1, 0.1, 0.1]
  Actuals:     [1,   1,   1,   0,   0,   0]
  → AUC = 1.0, but all "positive" predictions are 0.9 (should be 1.0)

Perfect calibration + poor discrimination:
  Predictions: [0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
  Actuals:     [1,   0,   1,   0,   1,   0]
  → Perfectly calibrated, but AUC = 0.5 (random)

Ideal: Good at both!

Calibration Best Practices

Do

  • Always check calibration for probability-sensitive tasks
  • Use separate validation set for calibration (not training data!)
  • Use cross-validation for calibration when data is limited
  • Consider temperature scaling for neural networks

Don't

  • Use isotonic regression with small datasets (overfits)
  • Calibrate on training data (overfitting!)
  • Ignore calibration when probabilities matter

Code Example

from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import brier_score_loss
import matplotlib.pyplot as plt

# Calibrate model
calibrated_model = CalibratedClassifierCV(base_model, method='isotonic', cv=5)
calibrated_model.fit(X_train, y_train)

# Get predictions
y_prob_uncal = base_model.predict_proba(X_test)[:, 1]
y_prob_cal = calibrated_model.predict_proba(X_test)[:, 1]

# Evaluate calibration
print(f"Brier (uncalibrated): {brier_score_loss(y_test, y_prob_uncal):.4f}")
print(f"Brier (calibrated): {brier_score_loss(y_test, y_prob_cal):.4f}")

# Plot reliability diagram
fig, ax = plt.subplots()
for name, y_prob in [('Uncalibrated', y_prob_uncal), ('Calibrated', y_prob_cal)]:
    prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10)
    ax.plot(prob_pred, prob_true, marker='o', label=name)
ax.plot([0, 1], [0, 1], 'k--', label='Perfect')
ax.legend()

Key Takeaways

  1. Calibration = predicted probabilities match actual frequencies
  2. Most models are NOT well-calibrated out of the box
  3. Neural nets tend to be overconfident
  4. Use reliability diagrams and ECE/Brier score to measure
  5. Platt scaling and temperature scaling are common fixes
  6. Calibrate on held-out data, never training data
  7. Essential when probabilities drive decisions