Feature Scaling and Normalization
Why scale features?
When features have very different ranges, gradient descent can behave poorly. Imagine predicting house prices with two features:
- : size in square feet, ranging from 500 to 5000.
- : number of bedrooms, ranging from 1 to 8.
The cost surface in space becomes a very elongated ellipse. Gradient descent steps that are appropriately sized for are far too large for , causing the algorithm to zigzag inefficiently and requiring a very small learning rate to avoid divergence.
Feature scaling transforms all features to a similar range, making the cost surface more circular and allowing gradient descent to converge much faster.
Note: OLS (the Normal Equation) does not require feature scaling, because it solves the equations algebraically. Scaling is primarily important for gradient-based methods.
Method 1: Min-max normalization
Rescales each feature to the range :
After scaling, the minimum value becomes 0 and the maximum becomes 1. Simple and interpretable, but sensitive to outliers — a single extreme value compresses all others into a narrow band.
Method 2: Standardization (Z-score normalization)
Rescales each feature to have mean 0 and standard deviation 1:
where is the mean and is the standard deviation of feature across the training set.
After standardization, most values fall in the range . This is the most common choice in practice because it handles outliers better and works well with regularization.
Worked example
Suppose (size) has values: 1000, 1500, 2000, 2500.
Standardized values:
The golden rule: fit on training data only
Always compute and (or min/max) using the training set alone. Then apply the same transformation to the validation and test sets using the training statistics.
If you compute statistics on the full dataset (including test data), you leak information about the test set into training — a form of data leakage that gives over-optimistic evaluation results.
How scaling affects the parameters
After scaling, the learned parameters correspond to the scaled features, not the original ones. If you need to interpret or deploy the model in original units, transform the parameters back:
Most ML libraries handle this automatically if you use their built-in scalers.
Practical checklist
- Fit the scaler on the training set.
- Transform training, validation, and test sets with training statistics.
- Train the model on scaled features.
- When making predictions on new data, apply the same scaling before passing data to the model.
Summary
| Method | Output range | Best for |
|---|---|---|
| Min-max normalization | Bounded features, no outliers | |
| Standardization (Z-score) | General use, outliers present | |
| No scaling | Original | OLS only |