Derivatives

Log in to access the full course.

The rate of change

A derivative measures how much a function's output changes in response to a small change in its input. It is the instantaneous rate of change — the slope of the function at a specific point.

For a function f(x)f(x), the derivative at point xx is:

f(x)=dfdx=limh0f(x+h)f(x)hf'(x) = \frac{df}{dx} = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}

You nudge the input by a tiny amount hh, measure how much the output changes, and divide by hh. As hh shrinks to zero, this ratio settles at the derivative.

Geometric interpretation: the derivative is the slope of the tangent line to the curve y=f(x)y = f(x) at the point xx.

  • f(x)>0f'(x) > 0: the function is increasing at xx — moving right increases ff.
  • f(x)<0f'(x) < 0: the function is decreasing at xx.
  • f(x)=0f'(x) = 0: the function is flat at xx — a potential minimum, maximum, or saddle point.

Common derivatives

Function f(x)f(x)Derivative f(x)f'(x)
Constant cc00
xnx^nnxn1nx^{n-1}
exe^xexe^x
lnx\ln x1/x1/x
sinx\sin xcosx\cos x
σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}}σ(x)(1σ(x))\sigma(x)(1 - \sigma(x))

The sigmoid derivative is particularly important in ML — it appears in backpropagation whenever sigmoid activations are used.

Key rules

Sum rule: ddx[f(x)+g(x)]=f(x)+g(x)\frac{d}{dx}[f(x) + g(x)] = f'(x) + g'(x)

Product rule: ddx[f(x)g(x)]=f(x)g(x)+f(x)g(x)\frac{d}{dx}[f(x)g(x)] = f'(x)g(x) + f(x)g'(x)

Chain rule: ddx[f(g(x))]=f(g(x))g(x)\frac{d}{dx}[f(g(x))] = f'(g(x)) \cdot g'(x) — covered fully in its own lesson.

Finding minima

A critical property: at a local minimum or maximum, the derivative is zero (f(x)=0f'(x) = 0). The function is momentarily flat. This is the mathematical foundation for optimization — to find where a function is minimized, find where its derivative is zero.

For a function of one variable, you check the sign of the second derivative f(x)f''(x) to distinguish minima from maxima:

  • f(x)>0f''(x) > 0: local minimum (the function is concave up, like a bowl).
  • f(x)<0f''(x) < 0: local maximum (concave down, like a hill).

In ML, we do not solve for zero derivatives analytically — the functions are too complex. Instead, we use gradient descent: repeatedly move in the direction of decreasing derivative until we reach a flat region.