The Log-Loss Cost Function
Why MSE fails for classification
It is natural to ask: why not use the same Mean Squared Error cost function from linear regression?
When is substituted in, the resulting cost surface is non-convex — it has many local minima. Gradient descent would not reliably find the global optimum. A different cost function is needed, one that is convex when combined with the sigmoid.
The log-loss (binary cross-entropy)
The standard cost function for logistic regression is log-loss, also called binary cross-entropy:
This looks complex, but consider each term separately:
When the true label is :
The cost for that example is . If the model predicts (perfectly correct), the cost is . If the model predicts (confidently wrong), the cost is .
When the true label is :
The cost for that example is . If the model predicts (perfectly correct), the cost is . If the model predicts (confidently wrong), the cost is .
In both cases: correct confident predictions are rewarded with zero cost; confidently wrong predictions are punished with very large cost.
Intuition from information theory
Log-loss has a deep motivation from information theory. It equals the negative log-likelihood of the training data under the model — minimizing log-loss is equivalent to maximum likelihood estimation (MLE) of the parameters.
Intuitively: among all parameter settings consistent with the data, MLE picks the one that assigns the highest probability to the observed labels.
A concrete example
Suppose examples with true labels and predicted probabilities:
| Per-example loss | |||
|---|---|---|---|
| 1 | 1 | 0.90 | |
| 2 | 0 | 0.20 | |
| 3 | 1 | 0.60 |
Convexity of log-loss
When the predicted probability is , the log-loss cost function is convex in . This means:
- There are no local minima to get stuck in.
- Gradient descent is guaranteed to find the global minimum.
This is the key reason log-loss is preferred over MSE for logistic regression.
The gradient
Taking the partial derivative of with respect to yields a remarkably clean result:
This is identical in form to the gradient of MSE in linear regression — only replaces . This means the gradient descent update rule carries over almost unchanged (details in the next lesson).
Log-loss vs. accuracy
Accuracy (the fraction of correctly classified examples) is the most intuitive metric, but it is a poor training objective because it is not differentiable — small changes in parameters rarely change whether a prediction is above or below the threshold.
Log-loss is smooth and differentiable everywhere, making it suitable for gradient-based optimization. In practice: optimize log-loss during training, evaluate accuracy (and other metrics) during evaluation.