XGBoost

XGBoost (eXtreme Gradient Boosting) is one of the most successful machine learning algorithms, dominating Kaggle competitions and widely used in industry. It's an optimized implementation of gradient boosting.

What is Gradient Boosting?

Build an ensemble of trees sequentially, where each tree corrects the errors of previous trees:

Model = Tree₁ + Tree₂ + Tree₃ + ... + Treeₙ

Tree₁: Fit to original targets
Tree₂: Fit to residuals of Tree₁
Tree₃: Fit to residuals of Tree₁ + Tree₂
...

Each tree learns to predict the gradient of the loss function.

XGBoost Innovations

1. Regularized Objective

Objective = Loss + Ω(trees)

Ω = γT + ½λΣwⱼ²

Where:

T: Number of leaves (penalizes complexity)
wⱼ: Leaf weights (L2 regularization)
γ, λ: Regularization parameters

2. Second-Order Approximation

Uses both gradient (g) and Hessian (h) for better optimization:

Optimal weight = -Σgᵢ / (Σhᵢ + λ)

Faster convergence than first-order methods.

3. Efficient Split Finding

Weighted quantile sketch: Handles weighted data efficiently
Sparsity-aware: Native handling of missing values
Cache-aware: Optimized for CPU cache

4. Parallelization

Feature-parallel split finding
Tree-level parallelism
GPU support

Key Hyperparameters

Tree Structure

Parameter	Default	Description
max_depth	6	Maximum tree depth
min_child_weight	1	Minimum sum of instance weight in child
gamma	0	Minimum loss reduction for split

Regularization

Parameter	Default	Description
lambda (reg_lambda)	1	L2 regularization on weights
alpha (reg_alpha)	0	L1 regularization on weights
subsample	1	Fraction of samples per tree
colsample_bytree	1	Fraction of features per tree

Learning

Parameter	Default	Description
learning_rate (eta)	0.3	Step size shrinkage
n_estimators	100	Number of trees

Tuning Strategy

Step 1: Fix learning rate, find n_estimators

xgb.cv(..., early_stopping_rounds=50)

Step 2: Tune tree parameters

max_depth: [3, 5, 7, 9]
min_child_weight: [1, 3, 5]

Step 3: Tune regularization

gamma: [0, 0.1, 0.2]
subsample: [0.6, 0.8, 1.0]
colsample_bytree: [0.6, 0.8, 1.0]

Step 4: Lower learning rate, increase trees

learning_rate=0.01, n_estimators=1000

Handling Missing Values

XGBoost learns optimal direction for missing values:

For each split:
  - Try sending missing to left
  - Try sending missing to right
  - Choose direction with best gain

No imputation needed!

Feature Importance

Weight (Frequency)

Number of times feature is used in splits.

Gain

Average gain from splits using the feature.

Cover

Average number of samples affected by splits.

xgb.plot_importance(model, importance_type='gain')

XGBoost vs LightGBM vs CatBoost

Aspect	XGBoost	LightGBM	CatBoost
Speed	Fast	Faster	Moderate
Memory	Moderate	Low	Higher
Categoricals	Manual encoding	Native	Best native
GPU	Yes	Yes	Yes
Accuracy	Excellent	Excellent	Excellent

Common Patterns

Classification

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=10)

Regression

model = xgb.XGBRegressor(
    objective='reg:squarederror',
    n_estimators=100,
    max_depth=5
)

With Cross-Validation

params = {'max_depth': 5, 'eta': 0.1, 'objective': 'binary:logistic'}
cv_results = xgb.cv(params, dtrain, num_boost_round=100, nfold=5, early_stopping_rounds=10)

When to Use XGBoost

Great for:

Tabular data
Structured/relational data
Kaggle competitions
When you need feature importance
Medium-sized datasets

Consider alternatives:

Images/text: Deep learning
Very large data: LightGBM may be faster
Many categoricals: CatBoost
Need interpretability: Single decision tree

Key Takeaways

XGBoost is gradient boosting with regularization and optimizations
Uses second-order gradients for better optimization
Handles missing values natively
Key params: max_depth, learning_rate, n_estimators, subsample
Lower learning rate + more trees = better (with early stopping)
Often the best choice for tabular data competitions