Dropout

The overfitting problem in neural networks

Neural networks with many parameters can memorize training data. A model with millions of weights and thousands of training examples has ample capacity to fit every training point — and every noise quirk — perfectly. The result is excellent training accuracy and poor generalization.

Standard remedies like L2 weight decay help but are often insufficient for deep networks. Dropout, introduced by Srivastava et al. in 2014, is a simple, powerful, and now-standard regularization technique specifically designed for neural networks.

What dropout does

During each training step, each neuron in a hidden layer is independently set to zero with probability $p$ (the dropout rate). Neurons that survive are scaled up by $\frac{1}{1-p}$ to keep the expected activation magnitude the same.

A different random subset is dropped at every training step. Across training, each neuron learns to be useful on its own — it cannot rely on any specific other neuron always being present.

During evaluation (test time): dropout is turned off. All neurons are active. No scaling is needed because the training scaling already corrected for the missing neurons. The test network computes a single deterministic prediction.

Why it works: the ensemble interpretation

At each training step, dropout creates a different "thinned" network by removing a random subset of neurons. With $n$ neurons in a layer and dropout rate 0.5, there are $2^n$ possible thinned networks — an astronomical ensemble.

At test time, using all neurons without dropping is approximately equivalent to averaging the predictions of this exponentially large ensemble. The full network's weights encode a kind of geometric mean of all the thinned networks seen during training.

This ensemble effect is the primary reason dropout works: it forces many redundant representations to be learned rather than any single fragile representation.

The co-adaptation problem

Without dropout, neurons can develop co-adaptations — neurons become dependent on specific other neurons, effectively learning a single complex detector that requires all its parts to be present. This produces brittle, overfit representations.

Dropout prevents co-adaptation by randomly removing neurons, forcing each neuron to develop features useful in many contexts rather than features that only work alongside specific partners.

Choosing the dropout rate $p$

$p = 0$ : no dropout.
$p = 0.5$ : the original recommendation for fully connected layers. Half the neurons are dropped at each step.
$p = 0.1$ – $0.2$ : common for convolutional layers, where spatial structure means neighboring neurons encode similar information (more redundancy built in).

Higher dropout rates provide more regularization but can underfit if too aggressive. Tune $p$ as a hyperparameter; $0.3$ – $0.5$ is a common range for fully connected layers.

Where to apply dropout

Apply dropout to hidden layers, typically after the activation function. Do not apply dropout to:

The input layer (rarely helpful; can discard important features).
The output layer (should produce deterministic predictions).
Batch normalization layers (the two interact poorly when combined carelessly).

Dropout and other regularization

Dropout is often combined with L2 weight decay. The two address different failure modes: weight decay prevents individually large weights; dropout prevents co-adaptation. Together they are more effective than either alone.

Early stopping should also be used alongside dropout — even with dropout, training long enough will eventually overfit.