Multiclass Classification: One-vs-Rest and Softmax
Beyond binary classification
So far, logistic regression has been framed as a binary classifier: . Many real problems have more than two classes — handwritten digit recognition (), news categorization (sports, politics, technology, …), or medical diagnosis with multiple conditions.
Two main strategies extend logistic regression to classes: One-vs-Rest and Softmax regression.
Strategy 1: One-vs-Rest (OvR)
One-vs-Rest (also called One-vs-All, OvA) trains separate binary classifiers, one per class. Classifier is trained to distinguish class from all other classes combined:
- Positive examples: all training examples with label .
- Negative examples: all training examples with any other label.
At prediction time, all classifiers are run and the class whose classifier outputs the highest probability wins:
Advantages of OvR
- Simple — reuses the binary logistic regression you already know.
- Parallelizable — each classifier trains independently.
- Works with any binary classifier, not just logistic regression.
Disadvantages of OvR
- The classifiers are trained on imbalanced data (one class vs. all others).
- The output probabilities from different classifiers are not calibrated to sum to 1.
- With very large , training classifiers is expensive.
Strategy 2: Softmax regression (Multinomial logistic regression)
Softmax regression generalizes logistic regression directly to classes in a single unified model. It learns separate parameter vectors , one per class, and computes a score for each:
The softmax function converts these scores into a valid probability distribution (non-negative, summing to 1):
The predicted class is the one with the highest probability:
Connection to the sigmoid
When , softmax reduces exactly to the binary sigmoid. Setting (fixing one class as reference):
Cost function for softmax
The log-loss generalizes to the categorical cross-entropy:
where is 1 if example belongs to class and 0 otherwise. Like binary cross-entropy, this cost is convex and optimized with gradient descent.
Advantages of softmax
- Produces a proper probability distribution across all classes.
- Trained as a single model — more coherent than independent classifiers.
- Directly models the relationship between classes.
Choosing between OvR and softmax
| One-vs-Rest | Softmax | |
|---|---|---|
| Number of models | 1 | |
| Output | independent probabilities | Proper probability distribution |
| Training | Parallelizable | Single joint optimization |
| Class relationships | Ignored | Modelled jointly |
| Best for | Many classes, simple setup | Mutually exclusive classes |
In scikit-learn, LogisticRegression uses OvR by default for multiclass problems. Set multi_class='multinomial' to use softmax.
A note on mutual exclusivity
Softmax assumes the classes are mutually exclusive — each example belongs to exactly one class. If an example can belong to multiple classes simultaneously (e.g. a news article tagged as both "politics" and "economy"), use multiple independent binary classifiers instead.