K-means

The idea

K-means partitions a dataset into $K$ groups by finding $K$ cluster centers (centroids) such that every point is assigned to its nearest centroid, and the centroids minimize the total distance from points to their assigned center.

It is the most widely used clustering algorithm — fast, simple, and effective when clusters are roughly spherical and similarly sized.

The algorithm

K-means alternates between two steps until convergence:

Step 1 — Assign: assign each data point to the nearest centroid.

$c^{(i)} = \arg\min_k \|\mathbf{x}^{(i)} - \boldsymbol{\mu}_k\|^2$

Step 2 — Update: recompute each centroid as the mean of all points assigned to it.

$\boldsymbol{\mu}_k = \frac{1}{|C_k|} \sum_{i \in C_k} \mathbf{x}^{(i)}$

Repeat until assignments stop changing. The algorithm is guaranteed to converge — but not necessarily to the global optimum.

Initialization matters

The final result depends heavily on the starting centroids. Poor initialization can trap the algorithm in a bad local minimum.

Random initialization: pick $K$ random data points as starting centroids. Fast but unreliable — high variance across runs.

K-means++: the standard approach. The first centroid is chosen randomly; each subsequent centroid is chosen from the remaining data points with probability proportional to its squared distance from the nearest already-chosen centroid. This spreads initial centroids apart, dramatically reducing the chance of poor solutions.

In scikit-learn, KMeans uses K-means++ by default (init='k-means++'). Always use it.

Even with K-means++, run the algorithm multiple times with different initializations (n_init=10) and keep the best result (lowest inertia).

Inertia

The objective K-means minimizes is called inertia — the sum of squared distances from each point to its assigned centroid:

$\text{Inertia} = \sum_{i=1}^{m} \|\mathbf{x}^{(i)} - \boldsymbol{\mu}_{c^{(i)}}\|^2$

Lower inertia means tighter, more compact clusters. Inertia always decreases as $K$ increases — a model with $K = m$ (every point its own cluster) has zero inertia but is useless. Choosing $K$ is covered in the next lesson.

Practical requirements

Feature scaling is essential. K-means uses Euclidean distance. Features with large numerical ranges dominate the distance calculation. Standardize all features before clustering.
Assumes spherical, similarly sized clusters. K-means partitions space using Voronoi boundaries (equidistant from adjacent centroids). It cannot handle elongated, curved, or wildly different-sized clusters well.
Sensitive to outliers. Outliers pull centroids toward them, distorting the cluster shape.

Complexity

Each iteration costs $O(mKn)$ — linear in the number of points $m$ , clusters $K$ , and features $n$ . K-means scales well to large datasets, which is a key practical advantage over hierarchical methods.