Multiclass Classification: One-vs-Rest and Softmax

Beyond binary classification

So far, logistic regression has been framed as a binary classifier: $y \in \{0, 1\}$ . Many real problems have more than two classes — handwritten digit recognition ( $y \in \{0,\ldots,9\}$ ), news categorization (sports, politics, technology, …), or medical diagnosis with multiple conditions.

Two main strategies extend logistic regression to $K > 2$ classes: One-vs-Rest and Softmax regression.

Strategy 1: One-vs-Rest (OvR)

One-vs-Rest (also called One-vs-All, OvA) trains $K$ separate binary classifiers, one per class. Classifier $k$ is trained to distinguish class $k$ from all other classes combined:

Positive examples: all training examples with label $k$ .
Negative examples: all training examples with any other label.

At prediction time, all $K$ classifiers are run and the class whose classifier outputs the highest probability wins:

$\hat{y} = \arg\max_k \hat{p}_k(\mathbf{x})$

Advantages of OvR

Simple — reuses the binary logistic regression you already know.
Parallelizable — each classifier trains independently.
Works with any binary classifier, not just logistic regression.

Disadvantages of OvR

The $K$ classifiers are trained on imbalanced data (one class vs. all others).
The output probabilities from different classifiers are not calibrated to sum to 1.
With very large $K$ , training $K$ classifiers is expensive.

Strategy 2: Softmax regression (Multinomial logistic regression)

Softmax regression generalizes logistic regression directly to $K$ classes in a single unified model. It learns $K$ separate parameter vectors $\boldsymbol{\theta}_1, \ldots, \boldsymbol{\theta}_K$ , one per class, and computes a score for each:

$z_k = \boldsymbol{\theta}_k^T \mathbf{x}, \qquad k = 1, \ldots, K$

The softmax function converts these $K$ scores into a valid probability distribution (non-negative, summing to 1):

$\hat{p}_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$

The predicted class is the one with the highest probability:

$\hat{y} = \arg\max_k \hat{p}_k$

Connection to the sigmoid

When $K = 2$ , softmax reduces exactly to the binary sigmoid. Setting $\boldsymbol{\theta}_2 = \mathbf{0}$ (fixing one class as reference):

$\hat{p}_1 = \frac{e^{z_1}}{e^{z_1} + e^{0}} = \frac{1}{1 + e^{-z_1}} = \sigma(z_1)$

Cost function for softmax

The log-loss generalizes to the categorical cross-entropy:

$J = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} \mathbf{1}[y^{(i)} = k] \log \hat{p}_k^{(i)}$

where $\mathbf{1}[y^{(i)} = k]$ is 1 if example $i$ belongs to class $k$ and 0 otherwise. Like binary cross-entropy, this cost is convex and optimized with gradient descent.

Advantages of softmax

Produces a proper probability distribution across all classes.
Trained as a single model — more coherent than $K$ independent classifiers.
Directly models the relationship between classes.

Choosing between OvR and softmax

	One-vs-Rest	Softmax
Number of models	$K$	1
Output	$K$ independent probabilities	Proper probability distribution
Training	Parallelizable	Single joint optimization
Class relationships	Ignored	Modelled jointly
Best for	Many classes, simple setup	Mutually exclusive classes

In scikit-learn, LogisticRegression uses OvR by default for multiclass problems. Set multi_class='multinomial' to use softmax.

A note on mutual exclusivity

Softmax assumes the classes are mutually exclusive — each example belongs to exactly one class. If an example can belong to multiple classes simultaneously (e.g. a news article tagged as both "politics" and "economy"), use multiple independent binary classifiers instead.

← Prev Next →