intermediateClassical Machine Learning

Learn hyperparameter tuning strategies - from grid search to Bayesian optimization - to find the best model configuration.

hyperparametersmodel-selectiongrid-searchbayesian-optimization

Hyperparameter Tuning

Hyperparameters are settings that control the learning process (learning rate, regularization, tree depth). Unlike model parameters, they're set before training. Finding good hyperparameters is crucial for model performance.

Parameters vs Hyperparameters

ParametersHyperparameters
Learned during trainingSet before training
Weights, biasesLearning rate, regularization
Optimized by gradient descentOptimized by search

Search Strategies

Grid Search

Try all combinations of specified values:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [100, 200, 300]
}

grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)

print(grid_search.best_params_)
print(grid_search.best_score_)

Pros:

  • Exhaustive, won't miss good combinations
  • Simple to implement and parallelize

Cons:

  • Exponential cost: 4×3×3 = 36 combinations
  • Doesn't scale to many hyperparameters
  • Uniform grid may miss optimal values

Random Search

Sample random combinations:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

param_dist = {
    'max_depth': randint(3, 15),
    'min_samples_split': randint(2, 20),
    'learning_rate': uniform(0.01, 0.3),
    'n_estimators': randint(100, 1000)
}

random_search = RandomizedSearchCV(
    model, param_dist, n_iter=100, cv=5, random_state=42
)
random_search.fit(X, y)

Pros:

  • More efficient than grid search
  • Explores continuous ranges
  • Can focus budget on important hyperparameters

Cons:

  • May miss optimal by chance
  • No learning from previous trials

Bayesian Optimization

Build a probabilistic model of the objective function, use it to choose next points:

from skopt import BayesSearchCV
from skopt.space import Real, Integer

search_space = {
    'max_depth': Integer(3, 15),
    'learning_rate': Real(0.01, 0.3, prior='log-uniform'),
    'n_estimators': Integer(100, 1000)
}

bayes_search = BayesSearchCV(model, search_space, n_iter=50, cv=5)
bayes_search.fit(X, y)

Pros:

  • Learns from previous trials
  • More sample-efficient
  • Good for expensive evaluations

Cons:

  • More complex to implement
  • Overhead may not be worth it for cheap evaluations
  • Can get stuck in local optima

Successive Halving / Hyperband

Start with many configurations, progressively eliminate poor performers:

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

halving_search = HalvingRandomSearchCV(
    model, param_dist, n_candidates=100, factor=3, cv=5
)
halving_search.fit(X, y)

How it works:

Round 1: 81 configs × 1 epoch each
Round 2: 27 configs × 3 epochs each (keep top 1/3)
Round 3: 9 configs × 9 epochs each (keep top 1/3)
Round 4: 3 configs × 27 epochs each (keep top 1/3)
Round 5: 1 config × 81 epochs (final)

Pros:

  • Very efficient for deep learning
  • Early stopping of bad configs

Optuna

Modern framework combining Bayesian optimization with pruning:

import optuna

def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000)
    }
    
    model = XGBClassifier(**params)
    score = cross_val_score(model, X, y, cv=5).mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print(study.best_params)

Features:

  • Pruning (early stopping of bad trials)
  • Visualization tools
  • Parallelization
  • Multiple samplers (TPE, CMA-ES, Grid, Random)

Important Hyperparameters by Model

Random Forest

HyperparameterTypical RangeImpact
n_estimators100-1000More = better, diminishing returns
max_depth5-30 or NoneControls overfitting
min_samples_split2-20Higher = regularization
max_features'sqrt', 'log2', 0.3-0.8Lower = more regularization

XGBoost

HyperparameterTypical RangeImpact
learning_rate0.01-0.3Lower = more trees needed
max_depth3-10Deeper = more complex
n_estimators100-1000With early stopping
subsample0.5-1.0Regularization
colsample_bytree0.5-1.0Regularization
reg_lambda0-10L2 regularization

Neural Networks

HyperparameterTypical RangeImpact
learning_rate1e-5 - 1e-2Critical!
batch_size16-512Affects convergence
hidden_layers1-5Depth
hidden_units32-1024Width
dropout0.1-0.5Regularization
weight_decay1e-5 - 1e-2L2 regularization

Best Practices

1. Start Simple

# Quick random search to find good region
RandomizedSearchCV(model, param_dist, n_iter=20, cv=3)

# Then refine with more iterations
RandomizedSearchCV(model, narrower_dist, n_iter=50, cv=5)

2. Use Log Scale for Learning Rates

# Wrong
'learning_rate': [0.01, 0.02, 0.03, ...]

# Right
'learning_rate': [0.001, 0.003, 0.01, 0.03, 0.1]
# Or
'learning_rate': Real(1e-4, 1e-1, prior='log-uniform')

3. Early Stopping for Iterative Models

model = XGBClassifier(
    n_estimators=1000,
    early_stopping_rounds=50
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])

4. Separate Validation Set

# For final hyperparameter selection
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

# Cross-validation for search
grid_search.fit(X_train, y_train)  # Not all of X!

# Final evaluation on held-out test set
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

5. Track Everything

import mlflow

with mlflow.start_run():
    mlflow.log_params(params)
    model.fit(X_train, y_train)
    mlflow.log_metric('accuracy', accuracy)
    mlflow.sklearn.log_model(model, 'model')

Key Takeaways

  1. Random search often beats grid search (more efficient)
  2. Bayesian optimization is best for expensive evaluations
  3. Use log scale for learning rates and regularization
  4. Early stopping reduces search space for iterative models
  5. Optuna is a great modern choice
  6. Always validate on held-out data, not training data
  7. Track experiments systematically