Model Explainability & Interpretability
Explainability answers: "Why did the model make this prediction?" It's crucial for trust, debugging, regulatory compliance, and scientific understanding.
Interpretability vs Explainability
Interpretable models: Inherently understandable
- Linear regression, decision trees, rule lists
- Can directly inspect how features affect output
Explainable AI (XAI): Methods to explain black-box models
- SHAP, LIME, attention visualization
- Post-hoc explanations of complex models
Global vs Local Explanations
Global: How does the model work overall?
- Feature importance
- Partial dependence plots
- Model-level summaries
Local: Why this specific prediction?
- LIME, SHAP values for one instance
- Counterfactual explanations
Inherently Interpretable Models
Linear Models
y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
Interpretation:
- βᵢ = change in y per unit change in xᵢ (holding others constant)
- Sign indicates direction of effect
- Magnitude indicates importance (if features scaled)
Decision Trees
[Age > 30?]
/ \
Yes No
↓ ↓
[Income > 50K?] [Deny]
/ \
[Approve] [Deny]
Explanation: "Denied because age ≤ 30"
Rule Lists
IF age > 30 AND income > 50K THEN approve
ELSE IF credit_score > 700 THEN approve
ELSE deny
Feature Importance
Permutation Importance
Shuffle one feature, measure accuracy drop:
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_test, y_test, n_repeats=10)
importances = result.importances_mean
Pros: Model-agnostic, works on any model Cons: Slow, correlated features split importance
Tree-Based Importance
# Mean decrease in impurity
importances = model.feature_importances_
# Or use permutation for more reliable results
Warning: MDI importance biased toward high-cardinality features.
SHAP (SHapley Additive exPlanations)
Based on game theory - fairly distribute prediction among features:
import shap
explainer = shap.TreeExplainer(model) # Or KernelExplainer for any model
shap_values = explainer.shap_values(X)
# Summary plot
shap.summary_plot(shap_values, X)
# Single prediction explanation
shap.force_plot(explainer.expected_value, shap_values[0], X.iloc[0])
Interpreting SHAP Values
Base value (average prediction): 0.5
Feature contributions:
+0.3 income = $80K
+0.15 age = 45
-0.1 debt_ratio = 0.4
------
Final prediction: 0.85
Sum of SHAP values = prediction - base value
SHAP Plots
Summary plot: Feature importance + direction of effects
Dependence plot: How one feature affects predictions
Force plot: Single prediction breakdown
Waterfall plot: Step-by-step contribution
LIME (Local Interpretable Model-agnostic Explanations)
Approximate complex model locally with simple model:
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(
X_train.values,
feature_names=X_train.columns,
class_names=['Negative', 'Positive']
)
exp = explainer.explain_instance(X_test.iloc[0], model.predict_proba)
exp.show_in_notebook()
How LIME Works
- Generate perturbed samples around the instance
- Get model predictions for perturbations
- Fit weighted linear model locally
- Linear coefficients = feature importance for that instance
Pros: Model-agnostic, intuitive explanations Cons: Explanations can be unstable, depends on perturbation method
Partial Dependence Plots
Show average effect of a feature on predictions:
from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(model, X, features=['age', 'income'])
Shows: Marginal effect, averaging over other features Limitation: Assumes feature independence
Individual Conditional Expectation (ICE)
Like PDP but shows line for each instance:
PartialDependenceDisplay.from_estimator(
model, X, features=['age'], kind='both' # Shows both ICE and PDP
)
Reveals heterogeneous effects across instances.
Counterfactual Explanations
"What would need to change for a different prediction?"
Prediction: Loan denied
Counterfactual:
- If income increased from $40K to $52K → Approved
- OR if debt_ratio decreased from 0.45 to 0.30 → Approved
import dice_ml
dice_exp = dice_ml.Dice(data, model)
counterfactuals = dice_exp.generate_counterfactuals(instance, total_CFs=3)
Attention Visualization (Deep Learning)
For transformer models, visualize attention weights:
# Which tokens the model focuses on
attention_weights = model.get_attention_weights(input)
# Visualize
import bertviz
bertviz.head_view(attention_weights, tokens)
Warning: Attention ≠ explanation. High attention doesn't always mean importance for prediction.
Explainability Trade-offs
| Model Type | Accuracy | Interpretability |
|---|---|---|
| Linear/Logistic | Lower | High |
| Decision Tree | Lower | High |
| Random Forest | Higher | Medium (importance) |
| XGBoost | Higher | Medium (SHAP) |
| Neural Network | Highest | Low (needs XAI) |
Best Practices
1. Start Interpretable
# Try simple first
baseline = LogisticRegression()
if baseline_score > threshold:
deploy(baseline) # Interpretable by default!
2. Combine Methods
- Global: Feature importance + PDP
- Local: SHAP + counterfactuals
3. Validate Explanations
- Do explanations match domain knowledge?
- Are they consistent across similar instances?
- Do they reveal actual model behavior?
4. Document Limitations
- Explanations are approximations
- May not capture all model behavior
- Can be misleading if misused
Key Takeaways
- Interpretability builds trust and aids debugging
- Simple models (linear, trees) are inherently interpretable
- SHAP provides theoretically grounded feature attributions
- LIME gives local linear approximations
- Use multiple methods for robust understanding
- Always validate explanations against domain knowledge
- Trade-off exists between accuracy and interpretability