The Sigmoid Function

The need for a squashing function

Logistic regression starts with the same linear combination as linear regression:

z=θ0+θ1x1++θnxnz = \theta_0 + \theta_1 x_1 + \cdots + \theta_n x_n

The value zz ranges from -\infty to ++\infty. To interpret it as a probability, we need to map it into the interval (0,1)(0, 1). The sigmoid function does exactly this.

Definition

The sigmoid function (also called the logistic function) is:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

where e2.718e \approx 2.718 is Euler's number. The output σ(z)\sigma(z) is always strictly between 0 and 1.

Key properties

Input zzOutput σ(z)\sigma(z)Interpretation
zz \to -\inftyσ(z)0\sigma(z) \to 0Strongly predicts class 0
z=0z = 0σ(z)=0.5\sigma(z) = 0.5Maximum uncertainty
z+z \to +\inftyσ(z)1\sigma(z) \to 1Strongly predicts class 1

The function is symmetric around the point (0,0.5)(0, 0.5): flipping the sign of zz gives σ(z)=1σ(z)\sigma(-z) = 1 - \sigma(z).

The shape of the sigmoid

The sigmoid has an S-shaped (sigmoidal) curve:

  • For large negative zz, the output is close to 0 and nearly flat.
  • Near z=0z = 0, the function rises steeply — small changes in zz produce large changes in probability.
  • For large positive zz, the output saturates close to 1 and flattens again.

This saturation behavior is important: extreme inputs produce very confident predictions (near 0 or near 1), while inputs near the decision boundary produce uncertain predictions (near 0.5).

The derivative of the sigmoid

The sigmoid has an elegant derivative that is used in gradient descent:

σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)\,(1 - \sigma(z))

This can be derived from the quotient rule. The derivative is maximized at z=0z = 0 (where σ(0)=0.25\sigma'(0) = 0.25) and approaches 0 at both extremes — this is the vanishing gradient behavior that makes training slow when inputs are very large or very small.

From log-odds to probability

The logit (log-odds) of a probability pp is defined as:

logit(p)=logp1p\text{logit}(p) = \log\frac{p}{1 - p}

The ratio p1p\frac{p}{1-p} is called the odds. If p=0.75p = 0.75, the odds are 3:13:1 — the event is three times as likely to occur as not.

The sigmoid is exactly the inverse of the logit:

σ(z)=11+ez    z=logσ(z)1σ(z)\sigma(z) = \frac{1}{1 + e^{-z}} \iff z = \log\frac{\sigma(z)}{1 - \sigma(z)}

This means logistic regression is modelling the log-odds of the positive class as a linear function of the features:

logp^1p^=θ0+θ1x1++θnxn\log\frac{\hat{p}}{1 - \hat{p}} = \theta_0 + \theta_1 x_1 + \cdots + \theta_n x_n

A concrete example

Suppose a trained model gives z=2.1z = 2.1 for a particular input. The predicted probability is:

p^=11+e2.1=11+0.1220.891\hat{p} = \frac{1}{1 + e^{-2.1}} = \frac{1}{1 + 0.122} \approx 0.891

The model is about 89% confident this example belongs to the positive class.

Why the sigmoid specifically?

The sigmoid is not the only function that maps R(0,1)\mathbb{R} \to (0,1). Alternatives include the probit function (based on the normal CDF). The sigmoid is preferred because:

  • It leads to a convex log-loss cost function with a clean gradient (covered in the Cost Function lesson).
  • It is the canonical link function for the Bernoulli distribution in the framework of Generalized Linear Models.
  • It is computationally simple and numerically stable for most inputs.