Hyperparameter Tuning

Hyperparameters are settings that control the learning process (learning rate, regularization, tree depth). Unlike model parameters, they're set before training. Finding good hyperparameters is crucial for model performance.

Parameters vs Hyperparameters

Parameters	Hyperparameters
Learned during training	Set before training
Weights, biases	Learning rate, regularization
Optimized by gradient descent	Optimized by search

Search Strategies

Grid Search

Try all combinations of specified values:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [100, 200, 300]
}

grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)

print(grid_search.best_params_)
print(grid_search.best_score_)

Pros:

Exhaustive, won't miss good combinations
Simple to implement and parallelize

Cons:

Exponential cost: 4×3×3 = 36 combinations
Doesn't scale to many hyperparameters
Uniform grid may miss optimal values

Random Search

Sample random combinations:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

param_dist = {
    'max_depth': randint(3, 15),
    'min_samples_split': randint(2, 20),
    'learning_rate': uniform(0.01, 0.3),
    'n_estimators': randint(100, 1000)
}

random_search = RandomizedSearchCV(
    model, param_dist, n_iter=100, cv=5, random_state=42
)
random_search.fit(X, y)

Pros:

More efficient than grid search
Explores continuous ranges
Can focus budget on important hyperparameters

Cons:

May miss optimal by chance
No learning from previous trials

Bayesian Optimization

Build a probabilistic model of the objective function, use it to choose next points:

from skopt import BayesSearchCV
from skopt.space import Real, Integer

search_space = {
    'max_depth': Integer(3, 15),
    'learning_rate': Real(0.01, 0.3, prior='log-uniform'),
    'n_estimators': Integer(100, 1000)
}

bayes_search = BayesSearchCV(model, search_space, n_iter=50, cv=5)
bayes_search.fit(X, y)

Pros:

Learns from previous trials
More sample-efficient
Good for expensive evaluations

Cons:

More complex to implement
Overhead may not be worth it for cheap evaluations
Can get stuck in local optima

Successive Halving / Hyperband

Start with many configurations, progressively eliminate poor performers:

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

halving_search = HalvingRandomSearchCV(
    model, param_dist, n_candidates=100, factor=3, cv=5
)
halving_search.fit(X, y)

How it works:

Round 1: 81 configs × 1 epoch each
Round 2: 27 configs × 3 epochs each (keep top 1/3)
Round 3: 9 configs × 9 epochs each (keep top 1/3)
Round 4: 3 configs × 27 epochs each (keep top 1/3)
Round 5: 1 config × 81 epochs (final)

Pros:

Very efficient for deep learning
Early stopping of bad configs

Optuna

Modern framework combining Bayesian optimization with pruning:

import optuna

def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000)
    }
    
    model = XGBClassifier(**params)
    score = cross_val_score(model, X, y, cv=5).mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print(study.best_params)

Features:

Pruning (early stopping of bad trials)
Visualization tools
Parallelization
Multiple samplers (TPE, CMA-ES, Grid, Random)

Important Hyperparameters by Model

Random Forest

Hyperparameter	Typical Range	Impact
n_estimators	100-1000	More = better, diminishing returns
max_depth	5-30 or None	Controls overfitting
min_samples_split	2-20	Higher = regularization
max_features	'sqrt', 'log2', 0.3-0.8	Lower = more regularization

XGBoost

Hyperparameter	Typical Range	Impact
learning_rate	0.01-0.3	Lower = more trees needed
max_depth	3-10	Deeper = more complex
n_estimators	100-1000	With early stopping
subsample	0.5-1.0	Regularization
colsample_bytree	0.5-1.0	Regularization
reg_lambda	0-10	L2 regularization

Neural Networks

Hyperparameter	Typical Range	Impact
learning_rate	1e-5 - 1e-2	Critical!
batch_size	16-512	Affects convergence
hidden_layers	1-5	Depth
hidden_units	32-1024	Width
dropout	0.1-0.5	Regularization
weight_decay	1e-5 - 1e-2	L2 regularization

Best Practices

1. Start Simple

# Quick random search to find good region
RandomizedSearchCV(model, param_dist, n_iter=20, cv=3)

# Then refine with more iterations
RandomizedSearchCV(model, narrower_dist, n_iter=50, cv=5)

2. Use Log Scale for Learning Rates

# Wrong
'learning_rate': [0.01, 0.02, 0.03, ...]

# Right
'learning_rate': [0.001, 0.003, 0.01, 0.03, 0.1]
# Or
'learning_rate': Real(1e-4, 1e-1, prior='log-uniform')

3. Early Stopping for Iterative Models

model = XGBClassifier(
    n_estimators=1000,
    early_stopping_rounds=50
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])

4. Separate Validation Set

# For final hyperparameter selection
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

# Cross-validation for search
grid_search.fit(X_train, y_train)  # Not all of X!

# Final evaluation on held-out test set
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

5. Track Everything

import mlflow

with mlflow.start_run():
    mlflow.log_params(params)
    model.fit(X_train, y_train)
    mlflow.log_metric('accuracy', accuracy)
    mlflow.sklearn.log_model(model, 'model')

Key Takeaways

Random search often beats grid search (more efficient)
Bayesian optimization is best for expensive evaluations
Use log scale for learning rates and regularization
Early stopping reduces search space for iterative models
Optuna is a great modern choice
Always validate on held-out data, not training data
Track experiments systematically

Hyperparameter Tuning

Parameters vs Hyperparameters

Search Strategies

Grid Search

Random Search

Bayesian Optimization

Successive Halving / Hyperband

Optuna

Important Hyperparameters by Model

Random Forest

XGBoost

Neural Networks

Best Practices

1. Start Simple

2. Use Log Scale for Learning Rates

3. Early Stopping for Iterative Models

4. Separate Validation Set

5. Track Everything

Key Takeaways

Related Concepts