SVM (margin intuition, kernel trick)

The margin intuition

A Support Vector Machine (SVM) is a classifier that finds the decision boundary with the largest possible gap — the margin — between the two classes.

Many boundaries can separate two linearly separable classes. Logistic regression picks one based on maximum likelihood. The SVM picks the one that is furthest from the nearest points of each class. The idea: a wider margin means the boundary is less likely to be disrupted by small changes in the data or new examples near the boundary.

The boundary is a hyperplane ( $\mathbf{w}^T\mathbf{x} + b = 0$ ). The margin is the perpendicular distance from the hyperplane to the nearest training examples on each side. Those nearest examples — the ones that would change the boundary if moved — are the support vectors. Every other training point is irrelevant; the solution depends only on the support vectors.

The optimization problem

Maximizing the margin is equivalent to minimizing $\|\mathbf{w}\|^2$ subject to all training examples satisfying $y^{(i)}(\mathbf{w}^T \mathbf{x}^{(i)} + b) \geq 1$ . This is a convex quadratic program with a unique global solution — no local minima.

Soft margins and the C parameter

Real data is rarely cleanly separable. The soft-margin SVM allows some examples to violate the margin, penalizing violations by a parameter $C$ :

$\min \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_i \xi_i$

Large $C$ : violations are costly — the model tries hard to classify everything correctly, resulting in a narrower margin. High variance, low bias.
Small $C$ : violations are tolerated — the model prioritizes a wide margin. Higher bias, lower variance.

$C$ is the primary hyperparameter to tune, always via cross-validation.

The kernel trick

A linear SVM can only learn a straight boundary. For non-linearly separable data, the kernel trick allows the SVM to operate in a higher-dimensional feature space — where a linear boundary might separate the classes — without explicitly computing the transformation.

The SVM's predictions depend only on dot products between training examples: $\mathbf{x}^{(i)} \cdot \mathbf{x}^{(j)}$ . A kernel function $K(\mathbf{x}^{(i)}, \mathbf{x}^{(j)})$ computes the dot product in a transformed space without ever building that space explicitly.

Common kernels:

Linear: $K(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T\mathbf{z}$ — no transformation. Use for high-dimensional data (text).
RBF (Gaussian): $K(\mathbf{x}, \mathbf{z}) = \exp(-\gamma\|\mathbf{x}-\mathbf{z}\|^2)$ — corresponds to an infinite-dimensional space. The most versatile kernel; controlled by $\gamma$ (width of influence per support vector).
Polynomial: $K(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z} + r)^d$ — captures polynomial feature interactions up to degree $d$ .

Practical considerations

Feature scaling is mandatory for RBF and polynomial kernels. Distance-based kernels are dominated by features with large numerical ranges.
SVMs scale poorly to large datasets — training is $O(m^2)$ to $O(m^3)$ . For more than ~100k examples, use LinearSVC or switch to logistic regression.
SVMs do not natively produce probabilities. Platt scaling (fitting a logistic regression on the output scores) adds calibration at extra cost.

SVMs are most useful for small-to-medium datasets with non-linear structure, and for high-dimensional sparse data (text) with a linear kernel.