The Log-Loss Cost Function

Why MSE fails for classification

It is natural to ask: why not use the same Mean Squared Error cost function from linear regression?

J=12mi=1m(p^(i)y(i))2J = \frac{1}{2m}\sum_{i=1}^{m}(\hat{p}^{(i)} - y^{(i)})^2

When p^=σ(z)\hat{p} = \sigma(z) is substituted in, the resulting cost surface is non-convex — it has many local minima. Gradient descent would not reliably find the global optimum. A different cost function is needed, one that is convex when combined with the sigmoid.

The log-loss (binary cross-entropy)

The standard cost function for logistic regression is log-loss, also called binary cross-entropy:

J(θ)=1mi=1m[y(i)log(p^(i))+(1y(i))log(1p^(i))]J(\boldsymbol{\theta}) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{p}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{p}^{(i)}) \right]

This looks complex, but consider each term separately:

When the true label is y=1y = 1:

The cost for that example is log(p^)-\log(\hat{p}). If the model predicts p^=1\hat{p} = 1 (perfectly correct), the cost is log(1)=0-\log(1) = 0. If the model predicts p^0\hat{p} \to 0 (confidently wrong), the cost is log(0)+-\log(0) \to +\infty.

When the true label is y=0y = 0:

The cost for that example is log(1p^)-\log(1 - \hat{p}). If the model predicts p^=0\hat{p} = 0 (perfectly correct), the cost is log(1)=0-\log(1) = 0. If the model predicts p^1\hat{p} \to 1 (confidently wrong), the cost is log(0)+-\log(0) \to +\infty.

In both cases: correct confident predictions are rewarded with zero cost; confidently wrong predictions are punished with very large cost.

Intuition from information theory

Log-loss has a deep motivation from information theory. It equals the negative log-likelihood of the training data under the model — minimizing log-loss is equivalent to maximum likelihood estimation (MLE) of the parameters.

Intuitively: among all parameter settings consistent with the data, MLE picks the one that assigns the highest probability to the observed labels.

A concrete example

Suppose m=3m = 3 examples with true labels and predicted probabilities:

iiy(i)y^{(i)}p^(i)\hat{p}^{(i)}Per-example loss
110.90log(0.90)=0.105-\log(0.90) = 0.105
200.20log(10.20)=log(0.80)=0.223-\log(1 - 0.20) = -\log(0.80) = 0.223
310.60log(0.60)=0.511-\log(0.60) = 0.511

J=13(0.105+0.223+0.511)=0.83930.280J = \frac{1}{3}(0.105 + 0.223 + 0.511) = \frac{0.839}{3} \approx 0.280

Convexity of log-loss

When the predicted probability is p^=σ(θTx)\hat{p} = \sigma(\boldsymbol{\theta}^T \mathbf{x}), the log-loss cost function is convex in θ\boldsymbol{\theta}. This means:

  • There are no local minima to get stuck in.
  • Gradient descent is guaranteed to find the global minimum.

This is the key reason log-loss is preferred over MSE for logistic regression.

The gradient

Taking the partial derivative of JJ with respect to θj\theta_j yields a remarkably clean result:

Jθj=1mi=1m(p^(i)y(i))xj(i)\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (\hat{p}^{(i)} - y^{(i)})\, x_j^{(i)}

This is identical in form to the gradient of MSE in linear regression — only p^(i)\hat{p}^{(i)} replaces y^(i)\hat{y}^{(i)}. This means the gradient descent update rule carries over almost unchanged (details in the next lesson).

Log-loss vs. accuracy

Accuracy (the fraction of correctly classified examples) is the most intuitive metric, but it is a poor training objective because it is not differentiable — small changes in parameters rarely change whether a prediction is above or below the threshold.

Log-loss is smooth and differentiable everywhere, making it suitable for gradient-based optimization. In practice: optimize log-loss during training, evaluate accuracy (and other metrics) during evaluation.