Ensemble Methods
Ensemble methods combine multiple models to produce better predictions. The idea is simple: a committee of models is often smarter than any individual model.
Why Ensembles Work
Wisdom of the Crowd
If models make independent errors, averaging reduces total error:
Single model error: ε
N independent models averaged: ε/√N
Bias-Variance Decomposition
- Bagging: Reduces variance (averaging stabilizes)
- Boosting: Reduces bias (sequential correction)
- Stacking: Can reduce both
Bagging (Bootstrap Aggregating)
How It Works
- Create N bootstrap samples (sample with replacement)
- Train one model on each sample
- Aggregate predictions (vote or average)
Data → [Bootstrap 1] → Model 1 → \
→ [Bootstrap 2] → Model 2 → → Aggregate → Prediction
→ [Bootstrap 3] → Model 3 → /
Random Forest
Bagging + random feature selection:
- Each tree sees random subset of features
- Reduces correlation between trees
- Even better variance reduction
When to Use
- High-variance models (deep trees)
- Want to reduce overfitting
- Can parallelize training
Boosting
How It Works
Train models sequentially, each focusing on previous errors:
Model 1: Train on all data
↓ identify errors
Model 2: Focus on Model 1's mistakes
↓ identify remaining errors
Model 3: Focus on remaining mistakes
↓
Final: Weighted combination of all models
AdaBoost
- Reweight samples based on errors
- Misclassified samples get higher weight
- Each model votes with weight based on accuracy
Gradient Boosting
- Fit new model to residuals (errors)
- Add to ensemble with learning rate
- More flexible than AdaBoost
XGBoost, LightGBM, CatBoost
Optimized gradient boosting implementations:
- Regularization
- Efficient computation
- Handling missing values
- State-of-the-art for tabular data
When to Use
- Want maximum predictive power
- Tabular/structured data
- Can accept longer training time
Stacking
How It Works
Use model predictions as features for a meta-model:
Level 0: [Model A, Model B, Model C]
↓ ↓ ↓
pred_A pred_B pred_C
\ | /
\ | /
Level 1: [Meta-Model (Blender)]
↓
Final Prediction
Key Points
- Use cross-validation predictions to avoid leakage
- Meta-model learns optimal combination
- Can stack multiple levels
When to Use
- Diverse base models available
- Competition/maximum performance needed
- Have enough data for validation
Voting
Hard Voting
Majority vote for classification:
Model A: Class 1
Model B: Class 0
Model C: Class 1
Result: Class 1 (2 vs 1)
Soft Voting
Average probabilities:
Model A: [0.7, 0.3]
Model B: [0.4, 0.6]
Model C: [0.8, 0.2]
Average: [0.63, 0.37] → Class 0
Soft voting usually works better.
Comparison
| Method | Reduces | Training | Diversity |
|---|---|---|---|
| Bagging | Variance | Parallel | Bootstrap samples |
| Boosting | Bias | Sequential | Error focus |
| Stacking | Both | Two-stage | Different algorithms |
| Voting | Variance | Independent | Different algorithms |
Practical Tips
Diversity Matters
Ensemble gains come from diversity:
- Different algorithms (trees, linear, neural)
- Different hyperparameters
- Different feature subsets
- Different training samples
Diminishing Returns
1 model → 5 models: Big improvement
5 models → 10 models: Moderate improvement
10 → 100 models: Small improvement
Computation vs Accuracy
- Simple average of 3-5 models often sufficient
- Beyond 10 models rarely worth the cost
- Production constraints may limit ensemble size
Code Example
from sklearn.ensemble import (
RandomForestClassifier,
GradientBoostingClassifier,
VotingClassifier,
StackingClassifier
)
# Voting
voting = VotingClassifier([
('rf', RandomForestClassifier()),
('gb', GradientBoostingClassifier()),
('svm', SVC(probability=True))
], voting='soft')
# Stacking
stacking = StackingClassifier(
estimators=[('rf', rf), ('gb', gb)],
final_estimator=LogisticRegression()
)
Key Takeaways
- Ensembles combine models for better predictions
- Bagging reduces variance (Random Forest)
- Boosting reduces bias (XGBoost, LightGBM)
- Stacking learns optimal combination
- Diversity among base models is crucial
- Often the winning approach in competitions