intermediateClassical Machine Learning

Master XGBoost - the gradient boosting algorithm that dominates machine learning competitions with its speed, performance, and flexibility.

ensembleboostingtree-basedclassificationregression

XGBoost

XGBoost (eXtreme Gradient Boosting) is one of the most successful machine learning algorithms, dominating Kaggle competitions and widely used in industry. It's an optimized implementation of gradient boosting.

What is Gradient Boosting?

Build an ensemble of trees sequentially, where each tree corrects the errors of previous trees:

Model = Tree₁ + Tree₂ + Tree₃ + ... + Treeₙ

Tree₁: Fit to original targets
Tree₂: Fit to residuals of Tree₁
Tree₃: Fit to residuals of Tree₁ + Tree₂
...

Each tree learns to predict the gradient of the loss function.

XGBoost Innovations

1. Regularized Objective

Objective = Loss + Ω(trees)

Ω = γT + ½λΣwⱼ²

Where:

  • T: Number of leaves (penalizes complexity)
  • wⱼ: Leaf weights (L2 regularization)
  • γ, λ: Regularization parameters

2. Second-Order Approximation

Uses both gradient (g) and Hessian (h) for better optimization:

Optimal weight = -Σgᵢ / (Σhᵢ + λ)

Faster convergence than first-order methods.

3. Efficient Split Finding

  • Weighted quantile sketch: Handles weighted data efficiently
  • Sparsity-aware: Native handling of missing values
  • Cache-aware: Optimized for CPU cache

4. Parallelization

  • Feature-parallel split finding
  • Tree-level parallelism
  • GPU support

Key Hyperparameters

Tree Structure

ParameterDefaultDescription
max_depth6Maximum tree depth
min_child_weight1Minimum sum of instance weight in child
gamma0Minimum loss reduction for split

Regularization

ParameterDefaultDescription
lambda (reg_lambda)1L2 regularization on weights
alpha (reg_alpha)0L1 regularization on weights
subsample1Fraction of samples per tree
colsample_bytree1Fraction of features per tree

Learning

ParameterDefaultDescription
learning_rate (eta)0.3Step size shrinkage
n_estimators100Number of trees

Tuning Strategy

Step 1: Fix learning rate, find n_estimators

xgb.cv(..., early_stopping_rounds=50)

Step 2: Tune tree parameters

max_depth: [3, 5, 7, 9]
min_child_weight: [1, 3, 5]

Step 3: Tune regularization

gamma: [0, 0.1, 0.2]
subsample: [0.6, 0.8, 1.0]
colsample_bytree: [0.6, 0.8, 1.0]

Step 4: Lower learning rate, increase trees

learning_rate=0.01, n_estimators=1000

Handling Missing Values

XGBoost learns optimal direction for missing values:

For each split:
  - Try sending missing to left
  - Try sending missing to right
  - Choose direction with best gain

No imputation needed!

Feature Importance

Weight (Frequency)

Number of times feature is used in splits.

Gain

Average gain from splits using the feature.

Cover

Average number of samples affected by splits.

xgb.plot_importance(model, importance_type='gain')

XGBoost vs LightGBM vs CatBoost

AspectXGBoostLightGBMCatBoost
SpeedFastFasterModerate
MemoryModerateLowHigher
CategoricalsManual encodingNativeBest native
GPUYesYesYes
AccuracyExcellentExcellentExcellent

Common Patterns

Classification

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=10)

Regression

model = xgb.XGBRegressor(
    objective='reg:squarederror',
    n_estimators=100,
    max_depth=5
)

With Cross-Validation

params = {'max_depth': 5, 'eta': 0.1, 'objective': 'binary:logistic'}
cv_results = xgb.cv(params, dtrain, num_boost_round=100, nfold=5, early_stopping_rounds=10)

When to Use XGBoost

Great for:

  • Tabular data
  • Structured/relational data
  • Kaggle competitions
  • When you need feature importance
  • Medium-sized datasets

Consider alternatives:

  • Images/text: Deep learning
  • Very large data: LightGBM may be faster
  • Many categoricals: CatBoost
  • Need interpretability: Single decision tree

Key Takeaways

  1. XGBoost is gradient boosting with regularization and optimizations
  2. Uses second-order gradients for better optimization
  3. Handles missing values natively
  4. Key params: max_depth, learning_rate, n_estimators, subsample
  5. Lower learning rate + more trees = better (with early stopping)
  6. Often the best choice for tabular data competitions

Practice Questions

Test your understanding with these related interview questions: