Hyperparameter tuning

Parameters vs hyperparameters

A model's parameters are learned from data — the weights of a neural network, the coefficients of a regression, the split thresholds of a tree. Hyperparameters are set before training and control how the learning process works. They cannot be learned from the training data in the usual sense because they govern the training procedure itself.

Examples: learning rate, number of trees, max tree depth, regularization strength, dropout rate, number of layers, batch size, kernel type.

Hyperparameter tuning is the process of finding the values that produce the best model.

Why it matters

The same algorithm with different hyperparameters can produce dramatically different results. A neural network with a learning rate of 0.1 may diverge; with 0.001 it may converge to an excellent model. A gradient boosting model with max_depth=10 may overfit badly; with max_depth=4 it may generalize well. Tuning is not a detail — it can be the difference between a useful model and a useless one.

The tuning workflow

Define a search space: for each hyperparameter, specify the range or set of values to search. Use logarithmic spacing for scale parameters (learning rate, regularization strength) and linear spacing for parameters like depth or number of trees.
Choose a search strategy: grid search, random search, or Bayesian optimization (covered in the next lesson).
Evaluate with cross-validation: for each candidate configuration, train the model and evaluate on held-out validation data. Never use the test set during tuning.
Select the best configuration: pick the hyperparameter setting with the best validation performance.
Retrain on the full training set: using the selected hyperparameters, retrain on all available training data before final evaluation.

Common hyperparameters by model type

Linear models:

Regularization strength $\lambda$ (or $C = 1/\lambda$ in scikit-learn): most important. Search log-scale: $\{10^{-4}, 10^{-3}, \ldots, 10^3\}$ .
Penalty type (L1 vs L2).

Gradient boosting:

Learning rate: try $\{0.01, 0.05, 0.1, 0.3\}$ with early stopping to find optimal trees.
Max depth: try $\{3, 4, 5, 6\}$ .
Subsampling fractions.

Neural networks:

Learning rate: most critical. Use a learning rate finder.
Architecture (number of layers, neurons per layer).
Dropout rate.
Batch size.

Tree-based:

Max depth, min samples per leaf.
For Random Forest: number of trees, max features.

Tuning order

Start with the most impactful hyperparameters. For most models:

Regularization strength (biggest impact on generalization).
Learning rate (biggest impact on convergence).
Model capacity (depth, width, number of trees).
Secondary parameters (subsampling, momentum) — tune these last.

Overfitting to the validation set

Repeated tuning against the same validation set can cause the selected hyperparameters to overfit to it — the model appears to perform well on validation because you tried many configurations and selected the best. This is analogous to overfitting training data but at the hyperparameter level.

Mitigation strategies: use $k$ -fold cross-validation instead of a single split, limit the number of configurations tried, and keep a true held-out test set that is never used during tuning.