⚡ A Support Vector Machine (SVM) finds the optimal decision boundary between classes — specifically the hyperplane that maximises the gap (margin) between class regions. It dominated classification before deep learning and remains competitive for high-dimensional small-dataset problems. The kernel trick extends it to non-linear boundaries without expensive transformations.
Category: Machine Learning · Difficulty: Intermediate · Last updated: 15 May 2026 · 4 min read
SVM — What It Is, How Maximum Margin Classification Works & Where It Still Beats Neural Networks
What is SVM ?
Imagine two groups of data points plotted on a graph — blue dots and red dots. Many straight lines could separate them. Which one should you choose? SVMs answer this with mathematical elegance: choose the line that maximises the margin — the gap between the line and the nearest points of each class.
A wider margin means the boundary is more confident. Points far from the boundary are easy to classify. Points near a narrow boundary are more likely to be misclassified by noise or small changes. Maximising the margin produces the most robust possible separator — the boundary least likely to be wrong on new data.
Developed by Vapnik and Cortes in 1995, SVMs were the dominant classification method for a decade, used in text classification, image recognition, bioinformatics, and financial prediction. Deep learning eventually outperformed SVMs on large datasets — but SVMs remain the right tool when data is scarce, dimensions are high, or interpretability matters.ions.
How SVM works
- Represent each training example as a vector in feature space.
- Find the hyperplane (line in 2D, plane in 3D, hyperplane in higher dimensions) that separates the two classes.
- Specifically, find the hyperplane that maximises the margin — the distance between the hyperplane and the nearest data point of each class.
- The points on the margin edge are support vectors — only these define the hyperplane.
- For non-linearly separable data: apply a kernel function that implicitly maps data to a higher-dimensional space where linear separation is possible.
- Soft margin SVM allows some misclassifications (controlled by regularisation parameter C) — trading perfect training accuracy for better generalisation.
Real-world examples
Not theory — what real teams actually shipped using this technique.
- Text classification with TF-IDF — SVMs on TF-IDF features were the state of the art for spam filtering, news categorisation, and sentiment analysis for much of the 2000s. The high dimensionality of text (vocabulary size) is exactly where SVMs excel.
- Bioinformatics — gene expression classification (which of these cancer types does this gene expression profile indicate?) typically involves thousands of features and hundreds of samples — the high-dimension, small-sample regime where SVMs often outperform neural networks.
- Face detection — early versions of real-time face detection systems (including in digital cameras) used SVMs on Haar features before CNNs became practical.
Common pitfalls
- Kernel choice — performance is sensitive to which kernel and kernel parameters you choose. Poor choices produce poor boundaries. Grid search over kernel parameters is standard but computationally expensive.
- Scaling — SVMs require feature normalisation. Features on different scales produce distorted distance calculations that degrade margin quality significantly.
- Slow on large datasets — SVM training scales roughly O(n²) to O(n³) with training set size. For millions of examples, training is impractically slow. Linear SVMs (no kernel) are faster and often sufficient for text.
- Probability estimates — SVMs do not natively produce probability outputs (just class labels). Platt scaling adds a calibration step to produce probabilities but is a post-hoc approximation.
Frequently asked questions
QUESTION 1 What is an SVM in simple terms?
ANSWER 1 An algorithm that finds the widest possible gap (margin) between two groups — the decision boundary that is furthest from the nearest points of each class.
QUESTION 2 What is the kernel trick?
ANSWER 2 Computing similarities as if data were in a higher dimension — enabling non-linear boundaries without expensive explicit transformation.
QUESTION 3 What are support vectors?
ANSWER 3 The training examples closest to the decision boundary — the only ones that define it. Remove any other point and the boundary stays the same.
QUESTION 4 When should you use an SVM today?
ANSWER 4 High-dimensional, small-sample problems — text classification, bioinformatics — where SVMs often outperform neural networks and train far faster.
Sources & further reading
- Cortes & Vapnik (1995). Support-Vector Networks. Machine Learning — the original SVM paper.
- Schölkopf & Smola (2002). Learning with Kernels. MIT Press — comprehensive kernel methods reference.
- Hastie, Tibshirani & Friedman (2009). The Elements of Statistical Learning. Chapter 12: Support Vector Machines. Free at web.stanford.edu/~hastie/ElemStatLearn/
- Scikit-learn SVM documentation: scikit-learn.org/stable/modules/svm.html — practical guide with examples.
📬 Get one concept + one use case every Tuesday. Join the newsletter →