Model Explainability & Interpretability

Explainability answers: "Why did the model make this prediction?" It's crucial for trust, debugging, regulatory compliance, and scientific understanding.

Interpretability vs Explainability

Interpretable models: Inherently understandable

Linear regression, decision trees, rule lists
Can directly inspect how features affect output

Explainable AI (XAI): Methods to explain black-box models

SHAP, LIME, attention visualization
Post-hoc explanations of complex models

Global vs Local Explanations

Global: How does the model work overall?

Feature importance
Partial dependence plots
Model-level summaries

Local: Why this specific prediction?

LIME, SHAP values for one instance
Counterfactual explanations

Inherently Interpretable Models

Linear Models

y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

Interpretation: 
- βᵢ = change in y per unit change in xᵢ (holding others constant)
- Sign indicates direction of effect
- Magnitude indicates importance (if features scaled)

Decision Trees

            [Age > 30?]
            /         \
          Yes          No
           ↓            ↓
     [Income > 50K?]  [Deny]
       /      \
    [Approve] [Deny]

Explanation: "Denied because age ≤ 30"

Rule Lists

IF age > 30 AND income > 50K THEN approve
ELSE IF credit_score > 700 THEN approve
ELSE deny

Feature Importance

Permutation Importance

Shuffle one feature, measure accuracy drop:

from sklearn.inspection import permutation_importance

result = permutation_importance(model, X_test, y_test, n_repeats=10)
importances = result.importances_mean

Pros: Model-agnostic, works on any model Cons: Slow, correlated features split importance

Tree-Based Importance

# Mean decrease in impurity
importances = model.feature_importances_

# Or use permutation for more reliable results

Warning: MDI importance biased toward high-cardinality features.

SHAP (SHapley Additive exPlanations)

Based on game theory - fairly distribute prediction among features:

import shap

explainer = shap.TreeExplainer(model)  # Or KernelExplainer for any model
shap_values = explainer.shap_values(X)

# Summary plot
shap.summary_plot(shap_values, X)

# Single prediction explanation
shap.force_plot(explainer.expected_value, shap_values[0], X.iloc[0])

Interpreting SHAP Values

Base value (average prediction): 0.5
Feature contributions:
  +0.3  income = $80K
  +0.15 age = 45
  -0.1  debt_ratio = 0.4
  ------
Final prediction: 0.85

Sum of SHAP values = prediction - base value

SHAP Plots

Summary plot: Feature importance + direction of effects

Dependence plot: How one feature affects predictions

Force plot: Single prediction breakdown

Waterfall plot: Step-by-step contribution

LIME (Local Interpretable Model-agnostic Explanations)

Approximate complex model locally with simple model:

import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(
    X_train.values,
    feature_names=X_train.columns,
    class_names=['Negative', 'Positive']
)

exp = explainer.explain_instance(X_test.iloc[0], model.predict_proba)
exp.show_in_notebook()

How LIME Works

Generate perturbed samples around the instance
Get model predictions for perturbations
Fit weighted linear model locally
Linear coefficients = feature importance for that instance

Pros: Model-agnostic, intuitive explanations Cons: Explanations can be unstable, depends on perturbation method

Partial Dependence Plots

Show average effect of a feature on predictions:

from sklearn.inspection import PartialDependenceDisplay

PartialDependenceDisplay.from_estimator(model, X, features=['age', 'income'])

Shows: Marginal effect, averaging over other features Limitation: Assumes feature independence

Individual Conditional Expectation (ICE)

Like PDP but shows line for each instance:

PartialDependenceDisplay.from_estimator(
    model, X, features=['age'], kind='both'  # Shows both ICE and PDP
)

Reveals heterogeneous effects across instances.

Counterfactual Explanations

"What would need to change for a different prediction?"

Prediction: Loan denied

Counterfactual:
- If income increased from $40K to $52K → Approved
- OR if debt_ratio decreased from 0.45 to 0.30 → Approved

import dice_ml

dice_exp = dice_ml.Dice(data, model)
counterfactuals = dice_exp.generate_counterfactuals(instance, total_CFs=3)

Attention Visualization (Deep Learning)

For transformer models, visualize attention weights:

# Which tokens the model focuses on
attention_weights = model.get_attention_weights(input)

# Visualize
import bertviz
bertviz.head_view(attention_weights, tokens)

Warning: Attention ≠ explanation. High attention doesn't always mean importance for prediction.

Explainability Trade-offs

Model Type	Accuracy	Interpretability
Linear/Logistic	Lower	High
Decision Tree	Lower	High
Random Forest	Higher	Medium (importance)
XGBoost	Higher	Medium (SHAP)
Neural Network	Highest	Low (needs XAI)

Best Practices

1. Start Interpretable

# Try simple first
baseline = LogisticRegression()
if baseline_score > threshold:
    deploy(baseline)  # Interpretable by default!

2. Combine Methods

Global: Feature importance + PDP
Local: SHAP + counterfactuals

3. Validate Explanations

Do explanations match domain knowledge?
Are they consistent across similar instances?
Do they reveal actual model behavior?

4. Document Limitations

Explanations are approximations
May not capture all model behavior
Can be misleading if misused

Key Takeaways

Interpretability builds trust and aids debugging
Simple models (linear, trees) are inherently interpretable
SHAP provides theoretically grounded feature attributions
LIME gives local linear approximations
Use multiple methods for robust understanding
Always validate explanations against domain knowledge
Trade-off exists between accuracy and interpretability