CNN basics

The problem with fully connected networks on images

A 256×256 color image has $256 \times 256 \times 3 = 196{,}608$ input values. A fully connected hidden layer with 1,000 neurons would require nearly 200 million weights just for the first layer — impractical to train and massively prone to overfitting.

More fundamentally, fully connected networks throw away the structure of images. A cat is a cat whether it appears in the top-left or bottom-right of the image, whether it is slightly rotated or slightly larger. Fully connected networks must relearn the concept of "cat" for every position independently. There is no sharing of knowledge across locations.

Convolutional Neural Networks (CNNs) exploit the structure of images — and any spatially organized data — to learn far more efficiently.

The key ideas: locality and weight sharing

Locality: features in images are local. An edge, a corner, a texture is defined by a small region of neighboring pixels, not by pixels across the entire image. A neuron only needs to look at a small local patch to detect such a feature.

Weight sharing (translation equivariance): the same feature detector — the same edge detector, the same curve detector — is useful everywhere in the image. CNNs use the same set of weights (a filter or kernel) at every position, scanning across the image. This dramatically reduces the number of parameters and means the network automatically learns features that work regardless of where in the image they appear.

Convolution

A filter is a small weight matrix, typically $3 \times 3$ or $5 \times 5$ . It is slid across the input image, computing a dot product between the filter weights and the local patch of pixels at each position. The result is a 2D feature map — each value represents how strongly that filter's pattern was detected at that location.

A single filter detects one type of pattern. A convolutional layer applies many filters simultaneously, producing a stack of feature maps — one per filter. The network learns what filters to use during training.

Pooling

After a convolutional layer, pooling reduces the spatial dimensions of the feature maps. Max pooling takes the maximum activation in each small region (typically $2 \times 2$ ), halving the height and width.

Pooling provides:

Spatial invariance: if a feature is detected at any position within the pooling region, the output is the same — small translations do not change the result.
Computational efficiency: smaller feature maps reduce computation in later layers.

The CNN architecture

A typical CNN stacks:

Convolutional layer → detects local features.
Activation (ReLU) → introduces non-linearity.
Pooling → reduces spatial size, increases invariance.
Repeat — each successive block detects more abstract features.
Flatten → converts the final feature maps into a vector.
Fully connected layers → combine all detected features to make the final prediction.

The early layers detect low-level features (edges, colors, textures). Middle layers detect intermediate patterns (corners, shapes). Later layers detect high-level concepts (eyes, faces, objects). This hierarchy is learned automatically from data.

Why CNNs dominate visual tasks

CNNs have three structural advantages over fully connected networks for images:

Parameter efficiency: weight sharing means many fewer parameters for the same capacity.
Spatial structure: filters explicitly exploit local spatial relationships.
Hierarchical representations: stacked convolutions build from simple to complex features naturally.

CNNs extended naturally from images to any data with a spatial or temporal grid structure — 1D CNNs for audio and text, 3D CNNs for video. They powered the deep learning revolution in computer vision from 2012 onward.