PCA

The problem PCA solves

High-dimensional data is hard to work with — it is slow to train on, hard to visualize, and often noisy. Many features in a real dataset measure overlapping things. A survey with 50 questions about political views likely has far fewer than 50 truly independent dimensions — the questions cluster into a handful of underlying themes.

Principal Component Analysis (PCA) finds the directions in which the data varies the most and projects the data onto those directions, producing a compact representation that retains as much information as possible.

The core idea: directions of maximum variance

Variance is information. A feature that barely changes across examples tells you almost nothing to distinguish them. PCA finds the directions — called principal components — along which the data spreads the most, and uses those as the new coordinate axes.

The first principal component (PC1) is the direction of maximum variance. The second (PC2) is the direction of maximum remaining variance that is perpendicular to PC1. Each subsequent component is perpendicular to all previous ones and captures the most remaining variance.

The procedure

Center the data. Subtract the mean of each feature. PCA requires zero-mean data.
Standardize (usually). Divide by the standard deviation of each feature so no feature dominates due to scale.
Compute the covariance matrix $\Sigma = \frac{1}{m} X^T X$ .
Find the eigenvectors and eigenvalues of $\Sigma$ . Sort by eigenvalue, largest first.
Choose the top $k$ eigenvectors as the new axes.
Project the data onto those axes: $Z = XW$ , where $W$ contains the top $k$ eigenvectors as columns.

In practice, PCA is computed via Singular Value Decomposition (SVD) directly on the data matrix rather than explicitly forming the covariance matrix — it is numerically more stable and efficient.

What you get

After PCA, each training example is represented by $k$ numbers (its scores along each principal component) instead of the original $n$ features. The new features are:

Uncorrelated — the principal components are orthogonal by construction.
Ordered by variance — PC1 explains more variance than PC2, which explains more than PC3, and so on.
Linear combinations of the original features — each component is a weighted sum of the original variables.

A simple example

Imagine data points scattered in 2D where height and weight are both measured. These two features are strongly correlated — taller people tend to weigh more. PCA would find that most variance lies along a diagonal axis (a blend of height and weight, roughly "overall body size"), with little variance perpendicular to it. Projecting onto just that one axis captures most of the information in both original features.

What PCA does not do

PCA is unsupervised — it ignores labels. It finds directions of maximum variance, not directions that are most predictive of the target. The most variable directions and the most predictive directions can be completely different. If the signal you care about lies in a low-variance dimension, PCA will discard it.