Perceptron

The origin

The perceptron, proposed by Frank Rosenblatt in 1958, is the simplest possible model of a neuron and the historical ancestor of modern neural networks. Understanding it builds the foundation for everything that follows.

What a perceptron does

A perceptron takes $n$ binary or continuous inputs $x_1, x_2, \ldots, x_n$ , computes a weighted sum, and outputs a binary decision:

$\hat{y} = \begin{cases} 1 & \text{if } \sum_{j=1}^n w_j x_j + b > 0 \\ 0 & \text{otherwise} \end{cases}$

The parameters are the weights $w_j$ (one per input) and the bias $b$ . The weights control how much each input influences the decision; the bias shifts the threshold.

This is essentially a linear classifier — it draws a hyperplane through the input space and assigns each side a class label.

The perceptron learning rule

The perceptron has a simple learning algorithm:

Initialize all weights to zero (or small random values).
For each training example, make a prediction.
If the prediction is correct, do nothing.
If the prediction is wrong, update weights:
- If predicted 0, true is 1: add the input to the weights ( $w_j \leftarrow w_j + \alpha x_j$ ).
- If predicted 1, true is 0: subtract the input ( $w_j \leftarrow w_j - \alpha x_j$ ).
Repeat until all examples are classified correctly (or a maximum number of iterations).

The perceptron convergence theorem guarantees that if the training data is linearly separable, this algorithm will find a separating hyperplane in a finite number of steps.

The fundamental limitation

The perceptron can only learn linearly separable problems. The most famous example of its failure is XOR: the two classes (input pairs that produce 0 and pairs that produce 1) cannot be separated by any straight line. Minsky and Papert proved this limitation formally in 1969, which temporarily dampened enthusiasm for neural networks.

The solution — combining multiple perceptrons in layers — took another decade to develop fully.

The perceptron vs. logistic regression

A perceptron is a hard threshold classifier: it outputs 0 or 1. Logistic regression uses the sigmoid function instead of a hard threshold, producing a probability and allowing gradient-based learning. Modern neural networks use logistic regression-style units (soft threshold, continuous output) rather than true perceptrons.

The perceptron's learning rule is also different from gradient descent: it only updates when the prediction is wrong, rather than computing a gradient from a smooth loss. This makes it less flexible and harder to generalize.

Why it still matters

The perceptron establishes the core structure that persists through all of deep learning: a unit takes weighted inputs, applies a function, and produces an output. That output is the building block of every neural network. The key innovation of deep learning is stacking many such units in layers and learning the weights with gradient descent — but the unit itself has not fundamentally changed since 1958.