Feature Scaling and Normalization

Why scale features?

When features have very different ranges, gradient descent can behave poorly. Imagine predicting house prices with two features:

  • x1x_1: size in square feet, ranging from 500 to 5000.
  • x2x_2: number of bedrooms, ranging from 1 to 8.

The cost surface in (θ1,θ2)(\theta_1, \theta_2) space becomes a very elongated ellipse. Gradient descent steps that are appropriately sized for θ2\theta_2 are far too large for θ1\theta_1, causing the algorithm to zigzag inefficiently and requiring a very small learning rate to avoid divergence.

Feature scaling transforms all features to a similar range, making the cost surface more circular and allowing gradient descent to converge much faster.

Note: OLS (the Normal Equation) does not require feature scaling, because it solves the equations algebraically. Scaling is primarily important for gradient-based methods.

Method 1: Min-max normalization

Rescales each feature to the range [0,1][0, 1]:

xj=xjmin(xj)max(xj)min(xj)x_j' = \frac{x_j - \min(x_j)}{\max(x_j) - \min(x_j)}

After scaling, the minimum value becomes 0 and the maximum becomes 1. Simple and interpretable, but sensitive to outliers — a single extreme value compresses all others into a narrow band.

Method 2: Standardization (Z-score normalization)

Rescales each feature to have mean 0 and standard deviation 1:

xj=xjμjσjx_j' = \frac{x_j - \mu_j}{\sigma_j}

where μj\mu_j is the mean and σj\sigma_j is the standard deviation of feature jj across the training set.

After standardization, most values fall in the range [3,3][-3, 3]. This is the most common choice in practice because it handles outliers better and works well with regularization.

Worked example

Suppose x1x_1 (size) has values: 1000, 1500, 2000, 2500.

μ1=1750,σ1=(750)2+(250)2+(250)2+(750)24=559.0\mu_1 = 1750, \qquad \sigma_1 = \sqrt{\frac{(−750)^2+(−250)^2+(250)^2+(750)^2}{4}} = 559.0

Standardized values:

x1{10001750559,15001750559,20001750559,25001750559}={1.34,  0.45,  0.45,  1.34}x_1' \in \left\{ \frac{1000-1750}{559}, \frac{1500-1750}{559}, \frac{2000-1750}{559}, \frac{2500-1750}{559} \right\} = \{-1.34,\; -0.45,\; 0.45,\; 1.34\}

The golden rule: fit on training data only

Always compute μj\mu_j and σj\sigma_j (or min/max) using the training set alone. Then apply the same transformation to the validation and test sets using the training statistics.

If you compute statistics on the full dataset (including test data), you leak information about the test set into training — a form of data leakage that gives over-optimistic evaluation results.

How scaling affects the parameters

After scaling, the learned parameters θj\theta_j correspond to the scaled features, not the original ones. If you need to interpret or deploy the model in original units, transform the parameters back:

θjoriginal=θjscaledσj\theta_j^{\text{original}} = \frac{\theta_j^{\text{scaled}}}{\sigma_j}

θ0original=θ0scaledj=1nθjscaledμjσj\theta_0^{\text{original}} = \theta_0^{\text{scaled}} - \sum_{j=1}^{n} \theta_j^{\text{scaled}} \frac{\mu_j}{\sigma_j}

Most ML libraries handle this automatically if you use their built-in scalers.

Practical checklist

  1. Fit the scaler on the training set.
  2. Transform training, validation, and test sets with training statistics.
  3. Train the model on scaled features.
  4. When making predictions on new data, apply the same scaling before passing data to the model.

Summary

MethodOutput rangeBest for
Min-max normalization[0,1][0, 1]Bounded features, no outliers
Standardization (Z-score)[3,3]\approx [-3, 3]General use, outliers present
No scalingOriginalOLS only