The Sigmoid Function

The need for a squashing function

Logistic regression starts with the same linear combination as linear regression:

$z = \theta_0 + \theta_1 x_1 + \cdots + \theta_n x_n$

The value $z$ ranges from $-\infty$ to $+\infty$ . To interpret it as a probability, we need to map it into the interval $(0, 1)$ . The sigmoid function does exactly this.

Definition

The sigmoid function (also called the logistic function) is:

$\sigma(z) = \frac{1}{1 + e^{-z}}$

where $e \approx 2.718$ is Euler's number. The output $\sigma(z)$ is always strictly between 0 and 1.

Key properties

Input $z$	Output $\sigma(z)$	Interpretation
$z \to -\infty$	$\sigma(z) \to 0$	Strongly predicts class 0
$z = 0$	$\sigma(z) = 0.5$	Maximum uncertainty
$z \to +\infty$	$\sigma(z) \to 1$	Strongly predicts class 1

The function is symmetric around the point $(0, 0.5)$ : flipping the sign of $z$ gives $\sigma(-z) = 1 - \sigma(z)$ .

The shape of the sigmoid

The sigmoid has an S-shaped (sigmoidal) curve:

For large negative $z$ , the output is close to 0 and nearly flat.
Near $z = 0$ , the function rises steeply — small changes in $z$ produce large changes in probability.
For large positive $z$ , the output saturates close to 1 and flattens again.

This saturation behavior is important: extreme inputs produce very confident predictions (near 0 or near 1), while inputs near the decision boundary produce uncertain predictions (near 0.5).

The derivative of the sigmoid

The sigmoid has an elegant derivative that is used in gradient descent:

$\sigma'(z) = \sigma(z)\,(1 - \sigma(z))$

This can be derived from the quotient rule. The derivative is maximized at $z = 0$ (where $\sigma'(0) = 0.25$ ) and approaches 0 at both extremes — this is the vanishing gradient behavior that makes training slow when inputs are very large or very small.

From log-odds to probability

The logit (log-odds) of a probability $p$ is defined as:

$\text{logit}(p) = \log\frac{p}{1 - p}$

The ratio $\frac{p}{1-p}$ is called the odds. If $p = 0.75$ , the odds are $3:1$ — the event is three times as likely to occur as not.

The sigmoid is exactly the inverse of the logit:

$\sigma(z) = \frac{1}{1 + e^{-z}} \iff z = \log\frac{\sigma(z)}{1 - \sigma(z)}$

This means logistic regression is modelling the log-odds of the positive class as a linear function of the features:

$\log\frac{\hat{p}}{1 - \hat{p}} = \theta_0 + \theta_1 x_1 + \cdots + \theta_n x_n$

A concrete example

Suppose a trained model gives $z = 2.1$ for a particular input. The predicted probability is:

$\hat{p} = \frac{1}{1 + e^{-2.1}} = \frac{1}{1 + 0.122} \approx 0.891$

The model is about 89% confident this example belongs to the positive class.

Why the sigmoid specifically?

The sigmoid is not the only function that maps $\mathbb{R} \to (0,1)$ . Alternatives include the probit function (based on the normal CDF). The sigmoid is preferred because:

It leads to a convex log-loss cost function with a clean gradient (covered in the Cost Function lesson).
It is the canonical link function for the Bernoulli distribution in the framework of Generalized Linear Models.
It is computationally simple and numerically stable for most inputs.

← Prev Next →