Multiclass Classification: One-vs-Rest and Softmax

Beyond binary classification

So far, logistic regression has been framed as a binary classifier: y{0,1}y \in \{0, 1\}. Many real problems have more than two classes — handwritten digit recognition (y{0,,9}y \in \{0,\ldots,9\}), news categorization (sports, politics, technology, …), or medical diagnosis with multiple conditions.

Two main strategies extend logistic regression to K>2K > 2 classes: One-vs-Rest and Softmax regression.

Strategy 1: One-vs-Rest (OvR)

One-vs-Rest (also called One-vs-All, OvA) trains KK separate binary classifiers, one per class. Classifier kk is trained to distinguish class kk from all other classes combined:

  • Positive examples: all training examples with label kk.
  • Negative examples: all training examples with any other label.

At prediction time, all KK classifiers are run and the class whose classifier outputs the highest probability wins:

y^=argmaxkp^k(x)\hat{y} = \arg\max_k \hat{p}_k(\mathbf{x})

Advantages of OvR

  • Simple — reuses the binary logistic regression you already know.
  • Parallelizable — each classifier trains independently.
  • Works with any binary classifier, not just logistic regression.

Disadvantages of OvR

  • The KK classifiers are trained on imbalanced data (one class vs. all others).
  • The output probabilities from different classifiers are not calibrated to sum to 1.
  • With very large KK, training KK classifiers is expensive.

Strategy 2: Softmax regression (Multinomial logistic regression)

Softmax regression generalizes logistic regression directly to KK classes in a single unified model. It learns KK separate parameter vectors θ1,,θK\boldsymbol{\theta}_1, \ldots, \boldsymbol{\theta}_K, one per class, and computes a score for each:

zk=θkTx,k=1,,Kz_k = \boldsymbol{\theta}_k^T \mathbf{x}, \qquad k = 1, \ldots, K

The softmax function converts these KK scores into a valid probability distribution (non-negative, summing to 1):

p^k=ezkj=1Kezj\hat{p}_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}

The predicted class is the one with the highest probability:

y^=argmaxkp^k\hat{y} = \arg\max_k \hat{p}_k

Connection to the sigmoid

When K=2K = 2, softmax reduces exactly to the binary sigmoid. Setting θ2=0\boldsymbol{\theta}_2 = \mathbf{0} (fixing one class as reference):

p^1=ez1ez1+e0=11+ez1=σ(z1)\hat{p}_1 = \frac{e^{z_1}}{e^{z_1} + e^{0}} = \frac{1}{1 + e^{-z_1}} = \sigma(z_1)

Cost function for softmax

The log-loss generalizes to the categorical cross-entropy:

J=1mi=1mk=1K1[y(i)=k]logp^k(i)J = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} \mathbf{1}[y^{(i)} = k] \log \hat{p}_k^{(i)}

where 1[y(i)=k]\mathbf{1}[y^{(i)} = k] is 1 if example ii belongs to class kk and 0 otherwise. Like binary cross-entropy, this cost is convex and optimized with gradient descent.

Advantages of softmax

  • Produces a proper probability distribution across all classes.
  • Trained as a single model — more coherent than KK independent classifiers.
  • Directly models the relationship between classes.

Choosing between OvR and softmax

One-vs-RestSoftmax
Number of modelsKK1
OutputKK independent probabilitiesProper probability distribution
TrainingParallelizableSingle joint optimization
Class relationshipsIgnoredModelled jointly
Best forMany classes, simple setupMutually exclusive classes

In scikit-learn, LogisticRegression uses OvR by default for multiclass problems. Set multi_class='multinomial' to use softmax.

A note on mutual exclusivity

Softmax assumes the classes are mutually exclusive — each example belongs to exactly one class. If an example can belong to multiple classes simultaneously (e.g. a news article tagged as both "politics" and "economy"), use multiple independent binary classifiers instead.