⚡ Knowledge distillation trains a small student model to mimic a large teacher model — not just copying final answers but learning from the teacher’s probability distributions across all possible outputs. The result is a compact, fast model that retains most of the large model’s performance. DistilBERT is 40% smaller and 60% faster than BERT while retaining 97% of its accuracy.
Category: MLOps · Difficulty: Intermediate · Last updated: 15 May 2026 · 4 min read
Knowledge Distillation — How Small Models Learn to Be Surprisingly Smart by Watching Large Ones
What is Knowledge Distillation ?
A master craftsperson trains an apprentice not by handing them a list of rules but by working alongside them — demonstrating technique, showing what almost-right looks like, explaining why one approach is better than another. The apprentice absorbs far more from watching the master’s process than from reading a manual of correct answers.
Knowledge distillation works the same way. A large, capable teacher model has learned rich internal representations from extensive training. A small student model learns not from the raw training labels but from the teacher’s outputs — specifically the soft probability distributions the teacher produces over all possible classes.
These soft distributions carry information that hard labels hide. When a teacher says “cat: 0.85, fox: 0.12, dog: 0.03,” it reveals that cats and foxes share features that distinguish them from dogs. The student learns this relational structure — this “dark knowledge” as Geoffrey Hinton called it — and builds a small but surprisingly capable model.
How Knowledge Distillation works
- Train a large, high-accuracy teacher model on the full training dataset.
- Use the teacher to generate soft predictions (probability distributions) over all classes for every training example.
- Train the student model on a combined loss: standard cross-entropy loss with hard labels + distillation loss measuring how closely the student’s soft predictions match the teacher’s soft predictions.
- A temperature parameter T softens the teacher’s distributions — dividing logits by T before softmax makes low-probability classes more visible to the student.
- The student, typically 3-10x smaller than the teacher, learns to approximate the teacher’s behaviour across the full output distribution.
- The trained student model is deployed — smaller, faster, cheaper, and close to the teacher in accuracy.
Real-world examples
Not theory — what real teams actually shipped using this technique.
- DistilBERT (Hugging Face, 2019) — 6-layer student distilled from 12-layer BERT. 40% smaller, 60% faster inference, 97% of BERT’s GLUE benchmark score. The default choice for production NLP at scale.
- Apple distils large models into small ones for on-device ML on iPhone — the Siri voice model, autocorrect, and photo recognition all use distilled models small enough to run on the Neural Engine without cloud inference.
- Google distilled the large PaLM 2 model into smaller variants for different deployment contexts — Gecko, Otter, Bison, Unicorn — each tuned for different tradeoffs of capability versus compute cost.
Common pitfalls
- Teacher quality ceiling — the student can only be as good as the teacher. A poorly trained teacher produces poor distillation signals. Invest in the teacher first.
- Task specificity — distillation works best when teacher and student are trained on the same task and data distribution. Distilling a general teacher for a specialised student task may not capture the relevant knowledge.
- Hyperparameter sensitivity — temperature T, the balance between hard label loss and distillation loss, and the student architecture all significantly affect distillation quality. Tuning is required.
- Not a substitute for architecture search — distillation compresses an existing model’s knowledge but does not optimise the student architecture. Combining distillation with neural architecture search produces the best results.
Frequently asked questions
QUESTION 1 What is knowledge distillation in simple terms?
ANSWER 1 Training a small student model to mimic a large teacher — learning from the teacher’s probability distributions, not just the final labels. Like an apprentice watching a master work.
QUESTION 2 Why are soft predictions better than hard labels?
ANSWER 2 They reveal similarity structure — “cat 85%, fox 12%” tells the student cats and foxes share features. Hard labels hide this “dark knowledge.”
QUESTION 3 What is DistilBERT?
ANSWER 3 A distilled BERT — 40% smaller, 60% faster, 97% accuracy. The standard choice when BERT quality is needed but deployment constraints prevent the full model.
QUESTION 4 When should you use knowledge distillation?
ANSWER 4 For edge and mobile deployment, strict latency requirements, high-scale serving cost reduction, and transferring large model capabilities into small task-specific models.
📬 Get one concept + one use case every Tuesday. Join the newsletter →