Train/test split

Why you cannot evaluate on training data

A model trained on a dataset can always be made to fit that data better by adding complexity — eventually memorizing every example perfectly. Training error is therefore a hopelessly optimistic measure of how the model will perform on new data. To get an honest estimate of real-world performance, you need to evaluate on data the model has never seen.

The train/test split is the foundational technique: divide the dataset into two non-overlapping subsets before training, use one to fit the model and the other to evaluate it.

The split

A typical split is 80% training, 20% test, though the right ratio depends on dataset size:

Small datasets (~1,000 examples): consider 70/30 or even 60/40 to give the test set enough examples for a reliable estimate.
Large datasets (~100,000+): 90/10 or even 95/5 is fine — 5,000 test examples are more than enough for a stable evaluation.

The split should be random for i.i.d. data. Use a fixed random_state to make the split reproducible.

The test set is sacred

The test set has one job: provide a final, unbiased estimate of performance after all decisions are made. Violating this is one of the most common mistakes in applied ML.

The moment you use the test set to make any decision — choosing a model, tuning a hyperparameter, selecting features — it is no longer an unbiased estimate. You have effectively trained on it, even if indirectly. The test set must be set aside and not looked at until you have completely finished building your model.

The need for a validation set

With only a train/test split, how do you tune hyperparameters or compare models? You cannot use the test set. The solution is a three-way split:

Training set: fit model parameters.
Validation set: tune hyperparameters, compare models, make all decisions.
Test set: final evaluation only.

A typical split is 70% train / 15% validation / 15% test, adjusted for dataset size.

The validation set serves the same purpose for model selection that the test set serves for final evaluation — it is held out from training, but it is acceptable to use it multiple times during development. Its estimates become slightly optimistic over time as you repeatedly optimize against it, which is why the test set must remain separate.

Stratified splits

For classification, a random split may by chance put most of one class in the test set and few in training. Stratified splitting ensures the class proportions in each split match the full dataset. Always use stratified splits for classification — scikit-learn's train_test_split has stratify=y for this.

When the split is not straightforward

Random splitting assumes examples are independent. This breaks in:

Time series: examples close in time are correlated. Use a temporal split — put earlier data in training and later data in test. Never shuffle.
Grouped data: if multiple rows belong to the same entity (patient, user, session), all rows from one entity must be in the same split. Otherwise the model appears to generalize when it is really memorizing entity-specific patterns.

These cases are covered in detail in the cross-validation and time-series CV lessons.