beginnerClassical Machine Learning

Learn how decision trees work - interpretable models that make predictions by learning simple decision rules from data.

classificationregressioninterpretabilitytree-based

Decision Trees

Decision trees are one of the most interpretable machine learning models. They make predictions by learning a series of if-then-else decision rules from the data.

Decision Tree Boundaries

How They Work

A decision tree splits data recursively based on feature values:

                 [Is age > 30?]
                 /            \
              Yes              No
               |                |
        [Income > 50k?]    [Student?]
         /        \         /      \
       Yes        No      Yes      No
        |          |        |        |
    [Buy: Yes] [Buy: No] [Buy: Yes] [Buy: No]

Each internal node tests a feature, each branch represents an outcome, and each leaf node makes a prediction.

Building a Tree

The Recursive Algorithm

  1. Select the best feature to split on
  2. Create child nodes for each outcome
  3. Recursively repeat for each child
  4. Stop when a stopping criterion is met

Choosing the Best Split

The key question: which feature should we split on?

For Classification: Gini Impurity

Gini = 1 - Σ pᵢ²

Measures how "mixed" the classes are. Lower is better.

For Classification: Entropy

Entropy = -Σ pᵢ × log₂(pᵢ)

Information gain = parent entropy - weighted average child entropy.

For Regression: Variance Reduction

Choose splits that minimize variance within each group.

Stopping Criteria

Without stopping rules, trees grow until each leaf is pure (overfitting!).

Common criteria:

  • Maximum depth reached
  • Minimum samples per leaf
  • Minimum information gain
  • Maximum number of leaves

Advantages

  1. Interpretable: Easy to explain and visualize
  2. No feature scaling needed: Works with raw values
  3. Handles mixed types: Numerical and categorical
  4. Fast prediction: Just traverse the tree
  5. Captures non-linear relationships: Through multiple splits

Disadvantages

  1. High variance: Small data changes → different trees
  2. Prone to overfitting: Deep trees memorize data
  3. Greedy: Locally optimal splits, not globally optimal
  4. Axis-aligned: Can only split parallel to feature axes
  5. Unstable: Sensitive to training data

Important Hyperparameters

ParameterEffect
max_depthDeeper = more complex, more overfitting
min_samples_splitHigher = more regularization
min_samples_leafHigher = smoother predictions
max_featuresLower = more randomness, less overfitting

Pruning

Two approaches to prevent overfitting:

Pre-pruning (Early Stopping)

Stop growing before the tree becomes too complex.

  • Set max_depth, min_samples_leaf, etc.

Post-pruning

Grow full tree, then remove branches that don't help.

  • Cost-complexity pruning uses a complexity parameter α

Feature Importance

Trees provide a natural measure of feature importance:

  • Sum of information gain from all splits using that feature
  • Weighted by number of samples reaching those splits

Trees in Practice

Single decision trees are rarely used alone because of high variance. Instead:

  1. Random Forests: Ensemble of decorrelated trees
  2. Gradient Boosting: Sequential ensemble correcting errors
  3. XGBoost/LightGBM: Optimized gradient boosting

These ensemble methods retain interpretability benefits while dramatically improving performance.

Key Takeaways

  1. Trees make predictions through learned decision rules
  2. Splits are chosen to maximize information gain (or minimize impurity)
  3. Easy to interpret but prone to overfitting
  4. Hyperparameters control the bias-variance tradeoff
  5. Ensembles (Random Forest, XGBoost) address the variance problem

Practice Questions

Test your understanding with these related interview questions: