XGBoost
XGBoost (eXtreme Gradient Boosting) is one of the most successful machine learning algorithms, dominating Kaggle competitions and widely used in industry. It's an optimized implementation of gradient boosting.
What is Gradient Boosting?
Build an ensemble of trees sequentially, where each tree corrects the errors of previous trees:
Model = Tree₁ + Tree₂ + Tree₃ + ... + Treeₙ
Tree₁: Fit to original targets
Tree₂: Fit to residuals of Tree₁
Tree₃: Fit to residuals of Tree₁ + Tree₂
...
Each tree learns to predict the gradient of the loss function.
XGBoost Innovations
1. Regularized Objective
Objective = Loss + Ω(trees)
Ω = γT + ½λΣwⱼ²
Where:
- T: Number of leaves (penalizes complexity)
- wⱼ: Leaf weights (L2 regularization)
- γ, λ: Regularization parameters
2. Second-Order Approximation
Uses both gradient (g) and Hessian (h) for better optimization:
Optimal weight = -Σgᵢ / (Σhᵢ + λ)
Faster convergence than first-order methods.
3. Efficient Split Finding
- Weighted quantile sketch: Handles weighted data efficiently
- Sparsity-aware: Native handling of missing values
- Cache-aware: Optimized for CPU cache
4. Parallelization
- Feature-parallel split finding
- Tree-level parallelism
- GPU support
Key Hyperparameters
Tree Structure
| Parameter | Default | Description |
|---|---|---|
| max_depth | 6 | Maximum tree depth |
| min_child_weight | 1 | Minimum sum of instance weight in child |
| gamma | 0 | Minimum loss reduction for split |
Regularization
| Parameter | Default | Description |
|---|---|---|
| lambda (reg_lambda) | 1 | L2 regularization on weights |
| alpha (reg_alpha) | 0 | L1 regularization on weights |
| subsample | 1 | Fraction of samples per tree |
| colsample_bytree | 1 | Fraction of features per tree |
Learning
| Parameter | Default | Description |
|---|---|---|
| learning_rate (eta) | 0.3 | Step size shrinkage |
| n_estimators | 100 | Number of trees |
Tuning Strategy
Step 1: Fix learning rate, find n_estimators
xgb.cv(..., early_stopping_rounds=50)
Step 2: Tune tree parameters
max_depth: [3, 5, 7, 9]
min_child_weight: [1, 3, 5]
Step 3: Tune regularization
gamma: [0, 0.1, 0.2]
subsample: [0.6, 0.8, 1.0]
colsample_bytree: [0.6, 0.8, 1.0]
Step 4: Lower learning rate, increase trees
learning_rate=0.01, n_estimators=1000
Handling Missing Values
XGBoost learns optimal direction for missing values:
For each split:
- Try sending missing to left
- Try sending missing to right
- Choose direction with best gain
No imputation needed!
Feature Importance
Weight (Frequency)
Number of times feature is used in splits.
Gain
Average gain from splits using the feature.
Cover
Average number of samples affected by splits.
xgb.plot_importance(model, importance_type='gain')
XGBoost vs LightGBM vs CatBoost
| Aspect | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Speed | Fast | Faster | Moderate |
| Memory | Moderate | Low | Higher |
| Categoricals | Manual encoding | Native | Best native |
| GPU | Yes | Yes | Yes |
| Accuracy | Excellent | Excellent | Excellent |
Common Patterns
Classification
import xgboost as xgb
model = xgb.XGBClassifier(
n_estimators=100,
max_depth=5,
learning_rate=0.1,
subsample=0.8,
colsample_bytree=0.8,
random_state=42
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=10)
Regression
model = xgb.XGBRegressor(
objective='reg:squarederror',
n_estimators=100,
max_depth=5
)
With Cross-Validation
params = {'max_depth': 5, 'eta': 0.1, 'objective': 'binary:logistic'}
cv_results = xgb.cv(params, dtrain, num_boost_round=100, nfold=5, early_stopping_rounds=10)
When to Use XGBoost
Great for:
- Tabular data
- Structured/relational data
- Kaggle competitions
- When you need feature importance
- Medium-sized datasets
Consider alternatives:
- Images/text: Deep learning
- Very large data: LightGBM may be faster
- Many categoricals: CatBoost
- Need interpretability: Single decision tree
Key Takeaways
- XGBoost is gradient boosting with regularization and optimizations
- Uses second-order gradients for better optimization
- Handles missing values natively
- Key params: max_depth, learning_rate, n_estimators, subsample
- Lower learning rate + more trees = better (with early stopping)
- Often the best choice for tabular data competitions