Anomaly detection

What is an anomaly?

An anomaly (also called an outlier) is a data point that differs significantly from the rest of the data. Anomaly detection is the task of identifying such points.

The challenge is that anomalies are rare and often unknown in advance — you typically cannot train a supervised model because you have very few (or no) labeled anomalies to learn from. Most anomaly detection methods are therefore unsupervised: they learn what "normal" looks like and flag anything that does not fit.

Applications: fraud detection, network intrusion, equipment failure prediction, medical diagnosis, quality control.

Approaches

Statistical methods

The simplest approach: model the normal data with a distribution and flag points in the low-probability tails.

For univariate data, the z-score measures how many standard deviations a point is from the mean. Points beyond $|z| > 3$ are often flagged as outliers. Works well when data is approximately Gaussian.

For multivariate data, the Mahalanobis distance generalizes the z-score — it accounts for the correlations between features so that a point that is extreme in a correlated combination of features is also flagged, even if it is not extreme on any single feature individually.

Isolation Forest

Isolation Forest is the most widely used anomaly detection algorithm for tabular data. The key insight: anomalies are easy to isolate. Build many random decision trees, each splitting features at random values. Normal points, embedded in dense regions, require many splits to isolate. Anomalies, sitting in sparse regions, are isolated in just a few splits.

The anomaly score is the average depth at which a point is isolated across all trees — shallower isolation = more anomalous. This is fast, scales well to high dimensions, and requires no distributional assumptions.

KNN-based detection

Flag points whose distance to their $K$ -th nearest neighbor is large. A point with few close neighbors is isolated — likely an anomaly. Simple and interpretable, but slow ( $O(m^2)$ ) and degrades in high dimensions.

Autoencoder-based detection

For complex, high-dimensional data (images, time series), train an autoencoder — a neural network that learns to compress data into a low-dimensional representation and then reconstruct it. Normal examples, which the network has seen many of, are reconstructed accurately. Anomalies, being unlike anything in training, produce large reconstruction error. Flag high reconstruction error as anomalous.

The threshold problem

Anomaly detection produces a score, not a binary label. To flag specific points, you apply a threshold. Setting the threshold requires knowing (or estimating) what false positive rate is acceptable.

If labeled anomalies are available for validation (even a few), use them to calibrate the threshold. Otherwise, set the threshold by the expected anomaly rate: "I expect 1% of transactions to be fraudulent — flag the top 1% by anomaly score."

Supervised anomaly detection

When labeled anomaly examples exist but are rare, the problem becomes imbalanced classification rather than unsupervised detection. Use techniques from the imbalanced data lesson: class weights, SMOTE, PR-AUC as the evaluation metric. Supervised models with careful class weighting often outperform unsupervised methods when any labeled anomalies are available.

Practical guidance

Situation	Recommended approach
No labeled anomalies, tabular data	Isolation Forest
No labeled anomalies, image/time-series	Autoencoder reconstruction error
Small data, interpretability needed	KNN distance or z-score
Some labeled anomalies available	Imbalanced classification
Need probability of anomaly	Gaussian mixture model