Ensemble Methods

Ensemble methods combine multiple models to produce better predictions. The idea is simple: a committee of models is often smarter than any individual model.

Why Ensembles Work

Wisdom of the Crowd

If models make independent errors, averaging reduces total error:

Single model error: ε
N independent models averaged: ε/√N

Bias-Variance Decomposition

Bagging: Reduces variance (averaging stabilizes)
Boosting: Reduces bias (sequential correction)
Stacking: Can reduce both

Bagging (Bootstrap Aggregating)

How It Works

Create N bootstrap samples (sample with replacement)
Train one model on each sample
Aggregate predictions (vote or average)

Data → [Bootstrap 1] → Model 1 → \
     → [Bootstrap 2] → Model 2 →  → Aggregate → Prediction
     → [Bootstrap 3] → Model 3 → /

Random Forest

Bagging + random feature selection:

Each tree sees random subset of features
Reduces correlation between trees
Even better variance reduction

When to Use

High-variance models (deep trees)
Want to reduce overfitting
Can parallelize training

Boosting

How It Works

Train models sequentially, each focusing on previous errors:

Model 1: Train on all data
         ↓ identify errors
Model 2: Focus on Model 1's mistakes
         ↓ identify remaining errors
Model 3: Focus on remaining mistakes
         ↓
Final: Weighted combination of all models

AdaBoost

Reweight samples based on errors
Misclassified samples get higher weight
Each model votes with weight based on accuracy

Gradient Boosting

Fit new model to residuals (errors)
Add to ensemble with learning rate
More flexible than AdaBoost

XGBoost, LightGBM, CatBoost

Optimized gradient boosting implementations:

Regularization
Efficient computation
Handling missing values
State-of-the-art for tabular data

When to Use

Want maximum predictive power
Tabular/structured data
Can accept longer training time

Stacking

How It Works

Use model predictions as features for a meta-model:

Level 0: [Model A, Model B, Model C]
              ↓         ↓         ↓
         pred_A    pred_B    pred_C
              \        |        /
               \       |       /
Level 1:      [Meta-Model (Blender)]
                      ↓
               Final Prediction

Key Points

Use cross-validation predictions to avoid leakage
Meta-model learns optimal combination
Can stack multiple levels

When to Use

Diverse base models available
Competition/maximum performance needed
Have enough data for validation

Voting

Hard Voting

Majority vote for classification:

Model A: Class 1
Model B: Class 0
Model C: Class 1
Result: Class 1 (2 vs 1)

Soft Voting

Average probabilities:

Model A: [0.7, 0.3]
Model B: [0.4, 0.6]
Model C: [0.8, 0.2]
Average: [0.63, 0.37] → Class 0

Soft voting usually works better.

Comparison

Method	Reduces	Training	Diversity
Bagging	Variance	Parallel	Bootstrap samples
Boosting	Bias	Sequential	Error focus
Stacking	Both	Two-stage	Different algorithms
Voting	Variance	Independent	Different algorithms

Practical Tips

Diversity Matters

Ensemble gains come from diversity:

Different algorithms (trees, linear, neural)
Different hyperparameters
Different feature subsets
Different training samples

Diminishing Returns

1 model  → 5 models:  Big improvement
5 models → 10 models: Moderate improvement
10 → 100 models:      Small improvement

Computation vs Accuracy

Simple average of 3-5 models often sufficient
Beyond 10 models rarely worth the cost
Production constraints may limit ensemble size

Code Example

from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    VotingClassifier,
    StackingClassifier
)

# Voting
voting = VotingClassifier([
    ('rf', RandomForestClassifier()),
    ('gb', GradientBoostingClassifier()),
    ('svm', SVC(probability=True))
], voting='soft')

# Stacking
stacking = StackingClassifier(
    estimators=[('rf', rf), ('gb', gb)],
    final_estimator=LogisticRegression()
)

Key Takeaways

Ensembles combine models for better predictions
Bagging reduces variance (Random Forest)
Boosting reduces bias (XGBoost, LightGBM)
Stacking learns optimal combination
Diversity among base models is crucial
Often the winning approach in competitions

Ensemble Methods

Why Ensembles Work

Wisdom of the Crowd

Bias-Variance Decomposition

Bagging (Bootstrap Aggregating)

How It Works

Random Forest

When to Use

Boosting

How It Works

AdaBoost

Gradient Boosting

XGBoost, LightGBM, CatBoost

When to Use

Stacking

How It Works

Key Points

When to Use

Voting

Hard Voting

Soft Voting

Comparison

Practical Tips

Diversity Matters

Diminishing Returns

Computation vs Accuracy

Code Example

Key Takeaways

Related Concepts

Practice Questions