beginnerClassical Machine Learning

Understand Random Forests - powerful ensemble models that combine many decision trees to achieve better predictions and reduce overfitting.

ensembleclassificationregressiontree-basedbagging

Random Forests

Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions. It's one of the most successful and widely used algorithms in machine learning.

The Core Idea

The wisdom of crowds: aggregate predictions from many diverse trees to get a better answer than any single tree.

    Tree 1      Tree 2      Tree 3    ...    Tree N
       ↓           ↓           ↓               ↓
    Pred 1      Pred 2      Pred 3          Pred N
       \          |           |              /
        \         |           |             /
         ↘        ↓           ↓           ↙
              [Aggregate Predictions]
                      ↓
              Final Prediction
  • Classification: Majority vote
  • Regression: Average of predictions

Why It Works: The Two Randomizations

1. Bootstrap Sampling (Bagging)

Each tree is trained on a different random sample:

  • Sample n points with replacement from training data
  • About 63% unique points per tree (rest are duplicates)
  • Creates diversity among trees

2. Feature Randomization

At each split, only consider a random subset of features:

  • Typically √p features for classification
  • Typically p/3 features for regression
  • Decorrelates trees even more

The Variance Reduction

If trees were independent with variance σ², averaging N trees gives variance σ²/N.

But trees are correlated (trained on same data). The actual formula:

Var(average) = ρσ² + (1-ρ)σ²/N

Where ρ is the correlation between trees. Feature randomization reduces ρ!

Out-of-Bag (OOB) Evaluation

A clever trick: each tree doesn't see ~37% of the data. Use this for free validation:

  1. For each training point, find trees that didn't train on it
  2. Get their predictions
  3. OOB score ≈ test set performance

No need for separate validation set!

Key Hyperparameters

ParameterTypical ValuesEffect
n_estimators100-1000More trees = better (diminishing returns)
max_depthNone or deepDeep trees for low bias
min_samples_leaf1-5Higher = more regularization
max_featuressqrt(p) or p/3Lower = less correlation, more bias
bootstrapTrueFalse gives extra trees

Advantages

  1. Excellent performance: Often among top methods
  2. Hard to overfit: Adding trees doesn't hurt
  3. Handles high dimensions: Feature sampling helps
  4. No feature scaling needed: Tree-based
  5. Gives feature importance: Based on splits
  6. Built-in validation: OOB score
  7. Parallelizable: Trees are independent

Disadvantages

  1. Less interpretable than single tree: Many trees to examine
  2. Slow for very large data: Many trees to build/predict
  3. Memory intensive: Stores all trees
  4. Doesn't extrapolate: Can't predict beyond training range
  5. Biased toward high-cardinality features: More split points

Feature Importance

Two common methods:

Mean Decrease in Impurity (MDI)

  • Sum of impurity decreases from splits using each feature
  • Fast but biased toward high-cardinality features

Permutation Importance

  • Shuffle feature values, measure accuracy drop
  • More reliable but slower
  • Works on validation data

When to Use Random Forests

Good for:

  • Tabular data
  • When interpretability matters somewhat
  • Quick baseline that often works well
  • When you want feature importance

Consider alternatives:

  • Very large datasets: Gradient boosting may be faster
  • Structured data (images, text): Neural networks
  • Need maximum performance: XGBoost/LightGBM often better

Random Forest vs. Gradient Boosting

AspectRandom ForestGradient Boosting
Tree buildingParallelSequential
ReducesVarianceBias (then variance)
TuningEasyMore sensitive
Risk of overfittingLowHigher
Training speedFastCan be slow
PredictionModerateFast

Key Takeaways

  1. Random Forest = many trees + bootstrap + feature sampling
  2. Reduces variance through averaging diverse trees
  3. OOB score gives free validation
  4. Robust, hard to overfit, easy to tune
  5. Often an excellent first choice for tabular data

Practice Questions

Test your understanding with these related interview questions: