Model Calibration
Calibration measures whether a model's predicted probabilities match actual outcome frequencies. A well-calibrated model that predicts 70% probability should be correct 70% of the time.
Why Calibration Matters
The Problem
Most models output confidence scores, not true probabilities:
Model predicts: P(rain) = 0.9
Actual frequency when P(rain) = 0.9: only 60% rain
→ Model is overconfident!
When It Matters
- Medical diagnosis: "90% chance of disease" should mean 90%
- Risk assessment: Probability directly affects decisions
- Ensemble methods: Combining probabilities from multiple models
- Threshold selection: Choosing classification thresholds
Measuring Calibration
Reliability Diagram (Calibration Curve)
Bin predictions by confidence, plot actual frequency:
Actual
1 | /
| / ← Perfect calibration
0.5 | /
| / ← Your model
0 |__/______
0 0.5 1
Predicted
from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=10)
plt.plot(prob_pred, prob_true, marker='o')
plt.plot([0, 1], [0, 1], 'k--') # Perfect calibration
Expected Calibration Error (ECE)
Weighted average of calibration error per bin:
ECE = Σ (nᵢ/N) × |accuracy(bin_i) - confidence(bin_i)|
Lower is better. Perfect calibration = 0.
Brier Score
Mean squared error of probability predictions:
Brier = (1/N) × Σ (pᵢ - yᵢ)²
Lower is better. Combines calibration + discrimination.
Calibration Problems by Model
Neural Networks
Typically overconfident, especially:
- With many parameters
- Trained too long
- High model capacity
Random Forests
Typically underconfident at extremes:
- Predictions bunch around 0.5
- Rarely predict very high/low probabilities
Boosting (XGBoost, etc.)
Varies, often overconfident.
Logistic Regression
Often well-calibrated out of the box (when model fits data well).
SVMs
Output scores, not probabilities at all (need Platt scaling).
Calibration Methods
Platt Scaling
Fit a logistic regression on the model's outputs:
from sklearn.calibration import CalibratedClassifierCV
calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
calibrated.fit(X_train, y_train)
Learns:
P(y=1|s) = 1 / (1 + exp(As + B))
Where s is the original model output.
Best for: Sigmoidal distortion, smaller datasets
Isotonic Regression
Non-parametric, monotonic calibration:
calibrated = CalibratedClassifierCV(model, method='isotonic', cv=5)
calibrated.fit(X_train, y_train)
Best for: Non-sigmoidal distortion, larger datasets
Warning: Can overfit with small data.
Temperature Scaling
For neural networks, divide logits by temperature T:
class TemperatureScaling(nn.Module):
def __init__(self):
super().__init__()
self.temperature = nn.Parameter(torch.ones(1))
def forward(self, logits):
return logits / self.temperature
Optimize T on validation set to minimize NLL.
Best for: Neural networks, simple and effective.
Histogram Binning
Assign calibrated probability per bin:
If prediction in [0.8, 0.9], output 0.72 (actual frequency in that bin)
Simple but loses information.
Calibration vs Discrimination
A model can have:
- Good discrimination, bad calibration: Ranks correctly but probabilities wrong
- Good calibration, bad discrimination: Probabilities correct but poor ranking
Perfect discrimination + miscalibration:
Predictions: [0.9, 0.9, 0.9, 0.1, 0.1, 0.1]
Actuals: [1, 1, 1, 0, 0, 0]
→ AUC = 1.0, but all "positive" predictions are 0.9 (should be 1.0)
Perfect calibration + poor discrimination:
Predictions: [0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
Actuals: [1, 0, 1, 0, 1, 0]
→ Perfectly calibrated, but AUC = 0.5 (random)
Ideal: Good at both!
Calibration Best Practices
Do
- Always check calibration for probability-sensitive tasks
- Use separate validation set for calibration (not training data!)
- Use cross-validation for calibration when data is limited
- Consider temperature scaling for neural networks
Don't
- Use isotonic regression with small datasets (overfits)
- Calibrate on training data (overfitting!)
- Ignore calibration when probabilities matter
Code Example
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import brier_score_loss
import matplotlib.pyplot as plt
# Calibrate model
calibrated_model = CalibratedClassifierCV(base_model, method='isotonic', cv=5)
calibrated_model.fit(X_train, y_train)
# Get predictions
y_prob_uncal = base_model.predict_proba(X_test)[:, 1]
y_prob_cal = calibrated_model.predict_proba(X_test)[:, 1]
# Evaluate calibration
print(f"Brier (uncalibrated): {brier_score_loss(y_test, y_prob_uncal):.4f}")
print(f"Brier (calibrated): {brier_score_loss(y_test, y_prob_cal):.4f}")
# Plot reliability diagram
fig, ax = plt.subplots()
for name, y_prob in [('Uncalibrated', y_prob_uncal), ('Calibrated', y_prob_cal)]:
prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10)
ax.plot(prob_pred, prob_true, marker='o', label=name)
ax.plot([0, 1], [0, 1], 'k--', label='Perfect')
ax.legend()
Key Takeaways
- Calibration = predicted probabilities match actual frequencies
- Most models are NOT well-calibrated out of the box
- Neural nets tend to be overconfident
- Use reliability diagrams and ECE/Brier score to measure
- Platt scaling and temperature scaling are common fixes
- Calibrate on held-out data, never training data
- Essential when probabilities drive decisions