Decision trees

The idea

A decision tree makes predictions by asking a sequence of yes/no questions about the input features, following a path through a branching structure until it reaches a final answer. Each internal node tests one feature; each branch represents an outcome of that test; each leaf holds a prediction.

The appeal is immediacy: you can read a shallow tree out loud and a non-technical person can follow it. No other model offers this level of transparency.

How a tree is built

Trees are grown using a greedy, recursive splitting algorithm (CART is the standard). At each node:

Try every possible split — every feature, every threshold.
Pick the split that most reduces impurity in the resulting child nodes.
Recurse on each child until a stopping condition is met.

The stopping conditions are hyperparameters: maximum depth, minimum examples per leaf, minimum impurity decrease. Without them, the tree grows until every leaf is pure — perfect training accuracy, extreme overfitting.

Predictions

Classification: each leaf stores the majority class of the training examples that reached it. Prediction for a new example is that majority class.

Regression: each leaf stores the mean of the training targets that reached it. The prediction surface is piecewise constant — step functions.

Strengths

Requires no feature scaling.
Handles mixed feature types (numerical and categorical) naturally.
Highly interpretable for shallow trees.
Captures non-linear relationships and feature interactions without manual engineering.

Weaknesses

High variance. This is the central problem with individual trees. The greedy splitting algorithm is sensitive to small changes in training data — a different sample can produce a completely different tree. This instability is what motivates ensemble methods.

Poor extrapolation. Regression trees predict the mean of their training leaves. For inputs outside the training range, they predict the nearest leaf's mean — no extrapolation.

Axis-aligned boundaries only. Every split is of the form $x_j \leq t$ . A diagonal or curved decision boundary requires many splits to approximate.

Controlling overfitting

The main levers:

max_depth: most impactful. Depth 3–5 often works well as a standalone tree.
min_samples_leaf: prevents very small, noisy leaves.
min_impurity_decrease: only split if the gain exceeds a threshold.
Post-pruning (cost-complexity pruning): grow the full tree, then cut back branches that do not improve validation performance.

In practice, standalone trees are rarely the final model. Their real value is as building blocks for Random Forests and gradient boosting.