Knowledge distillation trains a small student model to mimic a large teacher model — not just copying final answers but learning from the teacher’s probability distributions across all possible outputs. The result is a compact, fast model that retains most of the large model’s performance. DistilBERT is 40% smaller and 60% faster than BERT while retaining 97% of its accuracy.

Category: MLOps · Difficulty: Intermediate · Last updated: 15 May 2026 · 4 min read


Knowledge Distillation — How Small Models Learn to Be Surprisingly Smart by Watching Large Ones

What is Knowledge Distillation ?

A master craftsperson trains an apprentice not by handing them a list of rules but by working alongside them — demonstrating technique, showing what almost-right looks like, explaining why one approach is better than another. The apprentice absorbs far more from watching the master’s process than from reading a manual of correct answers.

Knowledge distillation works the same way. A large, capable teacher model has learned rich internal representations from extensive training. A small student model learns not from the raw training labels but from the teacher’s outputs — specifically the soft probability distributions the teacher produces over all possible classes.

These soft distributions carry information that hard labels hide. When a teacher says “cat: 0.85, fox: 0.12, dog: 0.03,” it reveals that cats and foxes share features that distinguish them from dogs. The student learns this relational structure — this “dark knowledge” as Geoffrey Hinton called it — and builds a small but surprisingly capable model.

How Knowledge Distillation works

  1. Train a large, high-accuracy teacher model on the full training dataset.
  2. Use the teacher to generate soft predictions (probability distributions) over all classes for every training example.
  3. Train the student model on a combined loss: standard cross-entropy loss with hard labels + distillation loss measuring how closely the student’s soft predictions match the teacher’s soft predictions.
  4. A temperature parameter T softens the teacher’s distributions — dividing logits by T before softmax makes low-probability classes more visible to the student.
  5. The student, typically 3-10x smaller than the teacher, learns to approximate the teacher’s behaviour across the full output distribution.
  6. The trained student model is deployed — smaller, faster, cheaper, and close to the teacher in accuracy.

Real-world examples

Not theory — what real teams actually shipped using this technique.

  • DistilBERT (Hugging Face, 2019) — 6-layer student distilled from 12-layer BERT. 40% smaller, 60% faster inference, 97% of BERT’s GLUE benchmark score. The default choice for production NLP at scale.
  • Apple distils large models into small ones for on-device ML on iPhone — the Siri voice model, autocorrect, and photo recognition all use distilled models small enough to run on the Neural Engine without cloud inference.
  • Google distilled the large PaLM 2 model into smaller variants for different deployment contexts — Gecko, Otter, Bison, Unicorn — each tuned for different tradeoffs of capability versus compute cost.

Common pitfalls

  • Teacher quality ceiling — the student can only be as good as the teacher. A poorly trained teacher produces poor distillation signals. Invest in the teacher first.
  • Task specificity — distillation works best when teacher and student are trained on the same task and data distribution. Distilling a general teacher for a specialised student task may not capture the relevant knowledge.
  • Hyperparameter sensitivity — temperature T, the balance between hard label loss and distillation loss, and the student architecture all significantly affect distillation quality. Tuning is required.
  • Not a substitute for architecture search — distillation compresses an existing model’s knowledge but does not optimise the student architecture. Combining distillation with neural architecture search produces the best results.

Frequently asked questions

QUESTION 1 What is knowledge distillation in simple terms?

ANSWER 1 Training a small student model to mimic a large teacher — learning from the teacher’s probability distributions, not just the final labels. Like an apprentice watching a master work.

QUESTION 2 Why are soft predictions better than hard labels?

ANSWER 2 They reveal similarity structure — “cat 85%, fox 12%” tells the student cats and foxes share features. Hard labels hide this “dark knowledge.”

QUESTION 3 What is DistilBERT?

ANSWER 3 A distilled BERT — 40% smaller, 60% faster, 97% accuracy. The standard choice when BERT quality is needed but deployment constraints prevent the full model.

QUESTION 4 When should you use knowledge distillation?

ANSWER 4 For edge and mobile deployment, strict latency requirements, high-scale serving cost reduction, and transferring large model capabilities into small task-specific models.


📬 Get one concept + one use case every Tuesday. Join the newsletter →