The Log-Loss Cost Function

Why MSE fails for classification

It is natural to ask: why not use the same Mean Squared Error cost function from linear regression?

$J = \frac{1}{2m}\sum_{i=1}^{m}(\hat{p}^{(i)} - y^{(i)})^2$

When $\hat{p} = \sigma(z)$ is substituted in, the resulting cost surface is non-convex — it has many local minima. Gradient descent would not reliably find the global optimum. A different cost function is needed, one that is convex when combined with the sigmoid.

The log-loss (binary cross-entropy)

The standard cost function for logistic regression is log-loss, also called binary cross-entropy:

$J(\boldsymbol{\theta}) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{p}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{p}^{(i)}) \right]$

This looks complex, but consider each term separately:

When the true label is $y = 1$ :

The cost for that example is $-\log(\hat{p})$ . If the model predicts $\hat{p} = 1$ (perfectly correct), the cost is $-\log(1) = 0$ . If the model predicts $\hat{p} \to 0$ (confidently wrong), the cost is $-\log(0) \to +\infty$ .

When the true label is $y = 0$ :

The cost for that example is $-\log(1 - \hat{p})$ . If the model predicts $\hat{p} = 0$ (perfectly correct), the cost is $-\log(1) = 0$ . If the model predicts $\hat{p} \to 1$ (confidently wrong), the cost is $-\log(0) \to +\infty$ .

In both cases: correct confident predictions are rewarded with zero cost; confidently wrong predictions are punished with very large cost.

Intuition from information theory

Log-loss has a deep motivation from information theory. It equals the negative log-likelihood of the training data under the model — minimizing log-loss is equivalent to maximum likelihood estimation (MLE) of the parameters.

Intuitively: among all parameter settings consistent with the data, MLE picks the one that assigns the highest probability to the observed labels.

A concrete example

Suppose $m = 3$ examples with true labels and predicted probabilities:

$i$	$y^{(i)}$	$\hat{p}^{(i)}$	Per-example loss
1	1	0.90	$-\log(0.90) = 0.105$
2	0	0.20	$-\log(1 - 0.20) = -\log(0.80) = 0.223$
3	1	0.60	$-\log(0.60) = 0.511$

$J = \frac{1}{3}(0.105 + 0.223 + 0.511) = \frac{0.839}{3} \approx 0.280$

Convexity of log-loss

When the predicted probability is $\hat{p} = \sigma(\boldsymbol{\theta}^T \mathbf{x})$ , the log-loss cost function is convex in $\boldsymbol{\theta}$ . This means:

There are no local minima to get stuck in.
Gradient descent is guaranteed to find the global minimum.

This is the key reason log-loss is preferred over MSE for logistic regression.

The gradient

Taking the partial derivative of $J$ with respect to $\theta_j$ yields a remarkably clean result:

$\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (\hat{p}^{(i)} - y^{(i)})\, x_j^{(i)}$

This is identical in form to the gradient of MSE in linear regression — only $\hat{p}^{(i)}$ replaces $\hat{y}^{(i)}$ . This means the gradient descent update rule carries over almost unchanged (details in the next lesson).

Log-loss vs. accuracy

Accuracy (the fraction of correctly classified examples) is the most intuitive metric, but it is a poor training objective because it is not differentiable — small changes in parameters rarely change whether a prediction is above or below the threshold.

Log-loss is smooth and differentiable everywhere, making it suitable for gradient-based optimization. In practice: optimize log-loss during training, evaluate accuracy (and other metrics) during evaluation.

← Prev Next →