Random Forests
Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions. It's one of the most successful and widely used algorithms in machine learning.
The Core Idea
The wisdom of crowds: aggregate predictions from many diverse trees to get a better answer than any single tree.
Tree 1 Tree 2 Tree 3 ... Tree N
↓ ↓ ↓ ↓
Pred 1 Pred 2 Pred 3 Pred N
\ | | /
\ | | /
↘ ↓ ↓ ↙
[Aggregate Predictions]
↓
Final Prediction
- Classification: Majority vote
- Regression: Average of predictions
Why It Works: The Two Randomizations
1. Bootstrap Sampling (Bagging)
Each tree is trained on a different random sample:
- Sample n points with replacement from training data
- About 63% unique points per tree (rest are duplicates)
- Creates diversity among trees
2. Feature Randomization
At each split, only consider a random subset of features:
- Typically √p features for classification
- Typically p/3 features for regression
- Decorrelates trees even more
The Variance Reduction
If trees were independent with variance σ², averaging N trees gives variance σ²/N.
But trees are correlated (trained on same data). The actual formula:
Var(average) = ρσ² + (1-ρ)σ²/N
Where ρ is the correlation between trees. Feature randomization reduces ρ!
Out-of-Bag (OOB) Evaluation
A clever trick: each tree doesn't see ~37% of the data. Use this for free validation:
- For each training point, find trees that didn't train on it
- Get their predictions
- OOB score ≈ test set performance
No need for separate validation set!
Key Hyperparameters
| Parameter | Typical Values | Effect |
|---|---|---|
| n_estimators | 100-1000 | More trees = better (diminishing returns) |
| max_depth | None or deep | Deep trees for low bias |
| min_samples_leaf | 1-5 | Higher = more regularization |
| max_features | sqrt(p) or p/3 | Lower = less correlation, more bias |
| bootstrap | True | False gives extra trees |
Advantages
- Excellent performance: Often among top methods
- Hard to overfit: Adding trees doesn't hurt
- Handles high dimensions: Feature sampling helps
- No feature scaling needed: Tree-based
- Gives feature importance: Based on splits
- Built-in validation: OOB score
- Parallelizable: Trees are independent
Disadvantages
- Less interpretable than single tree: Many trees to examine
- Slow for very large data: Many trees to build/predict
- Memory intensive: Stores all trees
- Doesn't extrapolate: Can't predict beyond training range
- Biased toward high-cardinality features: More split points
Feature Importance
Two common methods:
Mean Decrease in Impurity (MDI)
- Sum of impurity decreases from splits using each feature
- Fast but biased toward high-cardinality features
Permutation Importance
- Shuffle feature values, measure accuracy drop
- More reliable but slower
- Works on validation data
When to Use Random Forests
Good for:
- Tabular data
- When interpretability matters somewhat
- Quick baseline that often works well
- When you want feature importance
Consider alternatives:
- Very large datasets: Gradient boosting may be faster
- Structured data (images, text): Neural networks
- Need maximum performance: XGBoost/LightGBM often better
Random Forest vs. Gradient Boosting
| Aspect | Random Forest | Gradient Boosting |
|---|---|---|
| Tree building | Parallel | Sequential |
| Reduces | Variance | Bias (then variance) |
| Tuning | Easy | More sensitive |
| Risk of overfitting | Low | Higher |
| Training speed | Fast | Can be slow |
| Prediction | Moderate | Fast |
Key Takeaways
- Random Forest = many trees + bootstrap + feature sampling
- Reduces variance through averaging diverse trees
- OOB score gives free validation
- Robust, hard to overfit, easy to tune
- Often an excellent first choice for tabular data