Hyperparameter Tuning
Hyperparameters are settings that control the learning process (learning rate, regularization, tree depth). Unlike model parameters, they're set before training. Finding good hyperparameters is crucial for model performance.
Parameters vs Hyperparameters
| Parameters | Hyperparameters |
|---|---|
| Learned during training | Set before training |
| Weights, biases | Learning rate, regularization |
| Optimized by gradient descent | Optimized by search |
Search Strategies
Grid Search
Try all combinations of specified values:
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [3, 5, 7, 10],
'min_samples_split': [2, 5, 10],
'n_estimators': [100, 200, 300]
}
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)
print(grid_search.best_params_)
print(grid_search.best_score_)
Pros:
- Exhaustive, won't miss good combinations
- Simple to implement and parallelize
Cons:
- Exponential cost: 4×3×3 = 36 combinations
- Doesn't scale to many hyperparameters
- Uniform grid may miss optimal values
Random Search
Sample random combinations:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
param_dist = {
'max_depth': randint(3, 15),
'min_samples_split': randint(2, 20),
'learning_rate': uniform(0.01, 0.3),
'n_estimators': randint(100, 1000)
}
random_search = RandomizedSearchCV(
model, param_dist, n_iter=100, cv=5, random_state=42
)
random_search.fit(X, y)
Pros:
- More efficient than grid search
- Explores continuous ranges
- Can focus budget on important hyperparameters
Cons:
- May miss optimal by chance
- No learning from previous trials
Bayesian Optimization
Build a probabilistic model of the objective function, use it to choose next points:
from skopt import BayesSearchCV
from skopt.space import Real, Integer
search_space = {
'max_depth': Integer(3, 15),
'learning_rate': Real(0.01, 0.3, prior='log-uniform'),
'n_estimators': Integer(100, 1000)
}
bayes_search = BayesSearchCV(model, search_space, n_iter=50, cv=5)
bayes_search.fit(X, y)
Pros:
- Learns from previous trials
- More sample-efficient
- Good for expensive evaluations
Cons:
- More complex to implement
- Overhead may not be worth it for cheap evaluations
- Can get stuck in local optima
Successive Halving / Hyperband
Start with many configurations, progressively eliminate poor performers:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
halving_search = HalvingRandomSearchCV(
model, param_dist, n_candidates=100, factor=3, cv=5
)
halving_search.fit(X, y)
How it works:
Round 1: 81 configs × 1 epoch each
Round 2: 27 configs × 3 epochs each (keep top 1/3)
Round 3: 9 configs × 9 epochs each (keep top 1/3)
Round 4: 3 configs × 27 epochs each (keep top 1/3)
Round 5: 1 config × 81 epochs (final)
Pros:
- Very efficient for deep learning
- Early stopping of bad configs
Optuna
Modern framework combining Bayesian optimization with pruning:
import optuna
def objective(trial):
params = {
'max_depth': trial.suggest_int('max_depth', 3, 15),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
'n_estimators': trial.suggest_int('n_estimators', 100, 1000)
}
model = XGBClassifier(**params)
score = cross_val_score(model, X, y, cv=5).mean()
return score
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
Features:
- Pruning (early stopping of bad trials)
- Visualization tools
- Parallelization
- Multiple samplers (TPE, CMA-ES, Grid, Random)
Important Hyperparameters by Model
Random Forest
| Hyperparameter | Typical Range | Impact |
|---|---|---|
| n_estimators | 100-1000 | More = better, diminishing returns |
| max_depth | 5-30 or None | Controls overfitting |
| min_samples_split | 2-20 | Higher = regularization |
| max_features | 'sqrt', 'log2', 0.3-0.8 | Lower = more regularization |
XGBoost
| Hyperparameter | Typical Range | Impact |
|---|---|---|
| learning_rate | 0.01-0.3 | Lower = more trees needed |
| max_depth | 3-10 | Deeper = more complex |
| n_estimators | 100-1000 | With early stopping |
| subsample | 0.5-1.0 | Regularization |
| colsample_bytree | 0.5-1.0 | Regularization |
| reg_lambda | 0-10 | L2 regularization |
Neural Networks
| Hyperparameter | Typical Range | Impact |
|---|---|---|
| learning_rate | 1e-5 - 1e-2 | Critical! |
| batch_size | 16-512 | Affects convergence |
| hidden_layers | 1-5 | Depth |
| hidden_units | 32-1024 | Width |
| dropout | 0.1-0.5 | Regularization |
| weight_decay | 1e-5 - 1e-2 | L2 regularization |
Best Practices
1. Start Simple
# Quick random search to find good region
RandomizedSearchCV(model, param_dist, n_iter=20, cv=3)
# Then refine with more iterations
RandomizedSearchCV(model, narrower_dist, n_iter=50, cv=5)
2. Use Log Scale for Learning Rates
# Wrong
'learning_rate': [0.01, 0.02, 0.03, ...]
# Right
'learning_rate': [0.001, 0.003, 0.01, 0.03, 0.1]
# Or
'learning_rate': Real(1e-4, 1e-1, prior='log-uniform')
3. Early Stopping for Iterative Models
model = XGBClassifier(
n_estimators=1000,
early_stopping_rounds=50
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])
4. Separate Validation Set
# For final hyperparameter selection
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
# Cross-validation for search
grid_search.fit(X_train, y_train) # Not all of X!
# Final evaluation on held-out test set
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
5. Track Everything
import mlflow
with mlflow.start_run():
mlflow.log_params(params)
model.fit(X_train, y_train)
mlflow.log_metric('accuracy', accuracy)
mlflow.sklearn.log_model(model, 'model')
Key Takeaways
- Random search often beats grid search (more efficient)
- Bayesian optimization is best for expensive evaluations
- Use log scale for learning rates and regularization
- Early stopping reduces search space for iterative models
- Optuna is a great modern choice
- Always validate on held-out data, not training data
- Track experiments systematically