The Sigmoid Function
The need for a squashing function
Logistic regression starts with the same linear combination as linear regression:
The value ranges from to . To interpret it as a probability, we need to map it into the interval . The sigmoid function does exactly this.
Definition
The sigmoid function (also called the logistic function) is:
where is Euler's number. The output is always strictly between 0 and 1.
Key properties
| Input | Output | Interpretation |
|---|---|---|
| Strongly predicts class 0 | ||
| Maximum uncertainty | ||
| Strongly predicts class 1 |
The function is symmetric around the point : flipping the sign of gives .
The shape of the sigmoid
The sigmoid has an S-shaped (sigmoidal) curve:
- For large negative , the output is close to 0 and nearly flat.
- Near , the function rises steeply — small changes in produce large changes in probability.
- For large positive , the output saturates close to 1 and flattens again.
This saturation behavior is important: extreme inputs produce very confident predictions (near 0 or near 1), while inputs near the decision boundary produce uncertain predictions (near 0.5).
The derivative of the sigmoid
The sigmoid has an elegant derivative that is used in gradient descent:
This can be derived from the quotient rule. The derivative is maximized at (where ) and approaches 0 at both extremes — this is the vanishing gradient behavior that makes training slow when inputs are very large or very small.
From log-odds to probability
The logit (log-odds) of a probability is defined as:
The ratio is called the odds. If , the odds are — the event is three times as likely to occur as not.
The sigmoid is exactly the inverse of the logit:
This means logistic regression is modelling the log-odds of the positive class as a linear function of the features:
A concrete example
Suppose a trained model gives for a particular input. The predicted probability is:
The model is about 89% confident this example belongs to the positive class.
Why the sigmoid specifically?
The sigmoid is not the only function that maps . Alternatives include the probit function (based on the normal CDF). The sigmoid is preferred because:
- It leads to a convex log-loss cost function with a clean gradient (covered in the Cost Function lesson).
- It is the canonical link function for the Bernoulli distribution in the framework of Generalized Linear Models.
- It is computationally simple and numerically stable for most inputs.