Feature Scaling and Normalization

Why scale features?

When features have very different ranges, gradient descent can behave poorly. Imagine predicting house prices with two features:

$x_1$ : size in square feet, ranging from 500 to 5000.
$x_2$ : number of bedrooms, ranging from 1 to 8.

The cost surface in $(\theta_1, \theta_2)$ space becomes a very elongated ellipse. Gradient descent steps that are appropriately sized for $\theta_2$ are far too large for $\theta_1$ , causing the algorithm to zigzag inefficiently and requiring a very small learning rate to avoid divergence.

Feature scaling transforms all features to a similar range, making the cost surface more circular and allowing gradient descent to converge much faster.

Note: OLS (the Normal Equation) does not require feature scaling, because it solves the equations algebraically. Scaling is primarily important for gradient-based methods.

Method 1: Min-max normalization

Rescales each feature to the range $[0, 1]$ :

$x_j' = \frac{x_j - \min(x_j)}{\max(x_j) - \min(x_j)}$

After scaling, the minimum value becomes 0 and the maximum becomes 1. Simple and interpretable, but sensitive to outliers — a single extreme value compresses all others into a narrow band.

Method 2: Standardization (Z-score normalization)

Rescales each feature to have mean 0 and standard deviation 1:

$x_j' = \frac{x_j - \mu_j}{\sigma_j}$

where $\mu_j$ is the mean and $\sigma_j$ is the standard deviation of feature $j$ across the training set.

After standardization, most values fall in the range $[-3, 3]$ . This is the most common choice in practice because it handles outliers better and works well with regularization.

Worked example

Suppose $x_1$ (size) has values: 1000, 1500, 2000, 2500.

$\mu_1 = 1750, \qquad \sigma_1 = \sqrt{\frac{(−750)^2+(−250)^2+(250)^2+(750)^2}{4}} = 559.0$

Standardized values:

$x_1' \in \left\{ \frac{1000-1750}{559}, \frac{1500-1750}{559}, \frac{2000-1750}{559}, \frac{2500-1750}{559} \right\} = \{-1.34,\; -0.45,\; 0.45,\; 1.34\}$

The golden rule: fit on training data only

Always compute $\mu_j$ and $\sigma_j$ (or min/max) using the training set alone. Then apply the same transformation to the validation and test sets using the training statistics.

If you compute statistics on the full dataset (including test data), you leak information about the test set into training — a form of data leakage that gives over-optimistic evaluation results.

How scaling affects the parameters

After scaling, the learned parameters $\theta_j$ correspond to the scaled features, not the original ones. If you need to interpret or deploy the model in original units, transform the parameters back:

$\theta_j^{\text{original}} = \frac{\theta_j^{\text{scaled}}}{\sigma_j}$

$\theta_0^{\text{original}} = \theta_0^{\text{scaled}} - \sum_{j=1}^{n} \theta_j^{\text{scaled}} \frac{\mu_j}{\sigma_j}$

Most ML libraries handle this automatically if you use their built-in scalers.

Practical checklist

Fit the scaler on the training set.
Transform training, validation, and test sets with training statistics.
Train the model on scaled features.
When making predictions on new data, apply the same scaling before passing data to the model.

Summary

Method	Output range	Best for
Min-max normalization	$[0, 1]$	Bounded features, no outliers
Standardization (Z-score)	$\approx [-3, 3]$	General use, outliers present
No scaling	Original	OLS only

← Prev Next →