Imbalanced data handling

The problem

Class imbalance occurs when one class greatly outnumbers another in the training data. Fraud detection (0.1% fraud), medical diagnosis (rare disease), spam filtering — in all of these, the interesting class is the minority.

The danger: a model that always predicts the majority class achieves high accuracy while being completely useless. If 99% of transactions are legitimate, predicting "not fraud" for everything gives 99% accuracy and catches zero fraud. Accuracy is the wrong metric and the wrong training signal.

Why standard training fails

Most loss functions treat every example equally. When the negative class outnumbers the positive 99:1, the model sees 99 times more gradient signal from negatives. It learns that the safe default is to predict negative — and the training loss rewards it for this. The minority class is effectively ignored.

Evaluation metrics that actually matter

Before fixing the imbalance, fix the evaluation metric.

Accuracy is misleading. Use instead:

Precision: of all positive predictions, what fraction is correct? Penalizes false alarms.
Recall (sensitivity): of all actual positives, what fraction did the model catch? Penalizes missed cases.
F1 score: harmonic mean of precision and recall. A single balanced metric.
ROC-AUC: measures ranking quality across all thresholds — how well the model separates classes. Threshold-independent.
PR-AUC (Average Precision): area under the precision-recall curve. More informative than ROC-AUC when the positive class is very rare, because it is not inflated by the large number of true negatives.

For severely imbalanced problems, PR-AUC is often the most honest metric.

Remedy 1: adjust the classification threshold

The default threshold of 0.5 is not optimal under imbalance. Lowering it to 0.2 or 0.1 means the model predicts positive more readily — increasing recall at the cost of precision.

This is the simplest fix and should always be tried first. Use the ROC or precision-recall curve on a validation set to choose the threshold that best suits your cost requirements.

Remedy 2: class weights

Most algorithms accept a class_weight parameter that increases the penalty for misclassifying minority-class examples. Setting class_weight='balanced' in scikit-learn automatically weights classes inversely proportional to their frequency.

$w_k = \frac{m}{K \cdot m_k}$

where $m$ is total examples, $K$ is number of classes, and $m_k$ is examples in class $k$ . The model then sees the minority class as proportionally more important during training.

This is computationally free and works well. Try it before any resampling.

Remedy 3: oversampling the minority class

Random oversampling duplicates minority-class examples at random until the classes are balanced. Simple, but the duplicated examples add no new information — the model just sees the same minority examples more often.

SMOTE (Synthetic Minority Over-sampling Technique) creates new synthetic minority examples by interpolating between existing ones. For a minority example, it randomly selects one of its $K$ nearest neighbors and creates a new point along the line between them. This generates diverse, plausible minority examples rather than duplicates.

SMOTE should be applied only to the training set — never to validation or test. Apply it after the train-test split to avoid leakage.

Remedy 4: undersampling the majority class

Random undersampling removes majority-class examples at random until the classes are balanced. Fast and reduces training time, but discards potentially useful information.

Most useful when the majority class is very large and training speed is a concern. Often combined with oversampling: undersample the majority and oversample the minority to reach a moderate imbalance rather than forcing perfect balance.

Remedy 5: algorithm-level approaches

Gradient boosting with appropriate scale_pos_weight (XGBoost) or class_weight (LightGBM) handles imbalance directly in the loss without resampling.

Anomaly detection framing: for extreme imbalances (0.01% positive), reframe the problem entirely — treat the minority class as anomalous and use unsupervised or semi-supervised anomaly detection rather than classification.

What to try and in what order

Fix the evaluation metric first. Switch from accuracy to F1, ROC-AUC, or PR-AUC.
Adjust the threshold. Plot the precision-recall curve and pick the operating point.
Set class weights. Free, effective, no data manipulation required.
Try SMOTE if class weights are insufficient.
Consider undersampling if the dataset is very large and training is slow.
Do not force perfect balance. A 1:1 ratio is not always better than 1:5 or 1:10. Tune the balance ratio as a hyperparameter.

A common mistake: resampling before splitting

Applying SMOTE or oversampling to the full dataset before splitting into train and test sets introduces leakage — synthetic examples derived from test-set points appear in training. Always split first, then resample only the training set.