Random Forests

Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions. It's one of the most successful and widely used algorithms in machine learning.

The Core Idea

The wisdom of crowds: aggregate predictions from many diverse trees to get a better answer than any single tree.

    Tree 1      Tree 2      Tree 3    ...    Tree N
       ↓           ↓           ↓               ↓
    Pred 1      Pred 2      Pred 3          Pred N
       \          |           |              /
        \         |           |             /
         ↘        ↓           ↓           ↙
              [Aggregate Predictions]
                      ↓
              Final Prediction

Classification: Majority vote
Regression: Average of predictions

Why It Works: The Two Randomizations

1. Bootstrap Sampling (Bagging)

Each tree is trained on a different random sample:

Sample n points with replacement from training data
About 63% unique points per tree (rest are duplicates)
Creates diversity among trees

2. Feature Randomization

At each split, only consider a random subset of features:

Typically √p features for classification
Typically p/3 features for regression
Decorrelates trees even more

The Variance Reduction

If trees were independent with variance σ², averaging N trees gives variance σ²/N.

But trees are correlated (trained on same data). The actual formula:

Var(average) = ρσ² + (1-ρ)σ²/N

Where ρ is the correlation between trees. Feature randomization reduces ρ!

Out-of-Bag (OOB) Evaluation

A clever trick: each tree doesn't see ~37% of the data. Use this for free validation:

For each training point, find trees that didn't train on it
Get their predictions
OOB score ≈ test set performance

No need for separate validation set!

Key Hyperparameters

Parameter	Typical Values	Effect
n_estimators	100-1000	More trees = better (diminishing returns)
max_depth	None or deep	Deep trees for low bias
min_samples_leaf	1-5	Higher = more regularization
max_features	sqrt(p) or p/3	Lower = less correlation, more bias
bootstrap	True	False gives extra trees

Advantages

Excellent performance: Often among top methods
Hard to overfit: Adding trees doesn't hurt
Handles high dimensions: Feature sampling helps
No feature scaling needed: Tree-based
Gives feature importance: Based on splits
Built-in validation: OOB score
Parallelizable: Trees are independent

Disadvantages

Less interpretable than single tree: Many trees to examine
Slow for very large data: Many trees to build/predict
Memory intensive: Stores all trees
Doesn't extrapolate: Can't predict beyond training range
Biased toward high-cardinality features: More split points

Feature Importance

Two common methods:

Mean Decrease in Impurity (MDI)

Sum of impurity decreases from splits using each feature
Fast but biased toward high-cardinality features

Permutation Importance

Shuffle feature values, measure accuracy drop
More reliable but slower
Works on validation data

When to Use Random Forests

Good for:

Tabular data
When interpretability matters somewhat
Quick baseline that often works well
When you want feature importance

Consider alternatives:

Very large datasets: Gradient boosting may be faster
Structured data (images, text): Neural networks
Need maximum performance: XGBoost/LightGBM often better

Random Forest vs. Gradient Boosting

Aspect	Random Forest	Gradient Boosting
Tree building	Parallel	Sequential
Reduces	Variance	Bias (then variance)
Tuning	Easy	More sensitive
Risk of overfitting	Low	Higher
Training speed	Fast	Can be slow
Prediction	Moderate	Fast

Key Takeaways

Random Forest = many trees + bootstrap + feature sampling
Reduces variance through averaging diverse trees
OOB score gives free validation
Robust, hard to overfit, easy to tune
Often an excellent first choice for tabular data