Why are soft predictions better than hard labels for training?

Hard labels say 'cat' — binary, no nuance. Soft predictions say 'cat 85%, fox 12%, dog 3%' — revealing what the teacher model considers similar. The probability that 'cat' and 'fox' are confused tells the student that these classes share features, providing richer information than a one-hot label. Geoffrey Hinton called this the 'dark knowledge' in neural networks — information encoded in the wrong answers.

DistilBERT is a distilled version of BERT created by Hugging Face. It retains 97% of BERT's performance on NLP benchmarks while being 40% smaller and 60% faster — achieved by training a 6-layer student on the outputs of BERT's 12-layer teacher. It is the standard choice when BERT's accuracy is needed but latency and memory constraints prevent deploying the full model.

Knowledge Distillation

Q: What is knowledge distillation in simple terms?

Knowledge distillation is teaching a small model by having it watch a large model work — not just copying the final answers, but learning from the large model's uncertainty and reasoning. The large model says 'I think this is a cat (85%), maybe a fox (12%), unlikely a dog (3%)'. The small model learns from those probabilities, not just the label 'cat'. This richer signal trains a small model that punches well above its size.

Q: When should you use knowledge distillation?

When you need to deploy a model on resource-constrained devices (mobile, edge, IoT) that cannot run the full large model. When inference latency requirements are strict. When serving costs at scale are a concern. And when you want to transfer specialised capabilities of a large model into a small model fine-tuned on a narrow task.

⚡ Knowledge distillation trains a small student model to mimic a large teacher model — not just copying final answers but learning from the teacher’s probability distributions across all possible outputs. The result is a compact, fast model that retains most of the large model’s performance. DistilBERT is 40% smaller and 60% faster than BERT while retaining 97% of its accuracy.

Category: MLOps · Difficulty: Intermediate · Last updated: 15 May 2026 · 4 min read

Knowledge Distillation — How Small Models Learn to Be Surprisingly Smart by Watching Large Ones

What is Knowledge Distillation ?

A master craftsperson trains an apprentice not by handing them a list of rules but by working alongside them — demonstrating technique, showing what almost-right looks like, explaining why one approach is better than another. The apprentice absorbs far more from watching the master’s process than from reading a manual of correct answers.

Knowledge distillation works the same way. A large, capable teacher model has learned rich internal representations from extensive training. A small student model learns not from the raw training labels but from the teacher’s outputs — specifically the soft probability distributions the teacher produces over all possible classes.

These soft distributions carry information that hard labels hide. When a teacher says “cat: 0.85, fox: 0.12, dog: 0.03,” it reveals that cats and foxes share features that distinguish them from dogs. The student learns this relational structure — this “dark knowledge” as Geoffrey Hinton called it — and builds a small but surprisingly capable model.

How Knowledge Distillation works

Train a large, high-accuracy teacher model on the full training dataset.
Use the teacher to generate soft predictions (probability distributions) over all classes for every training example.
Train the student model on a combined loss: standard cross-entropy loss with hard labels + distillation loss measuring how closely the student’s soft predictions match the teacher’s soft predictions.
A temperature parameter T softens the teacher’s distributions — dividing logits by T before softmax makes low-probability classes more visible to the student.
The student, typically 3-10x smaller than the teacher, learns to approximate the teacher’s behaviour across the full output distribution.
The trained student model is deployed — smaller, faster, cheaper, and close to the teacher in accuracy.

Real-world examples

Not theory — what real teams actually shipped using this technique.

DistilBERT (Hugging Face, 2019) — 6-layer student distilled from 12-layer BERT. 40% smaller, 60% faster inference, 97% of BERT’s GLUE benchmark score. The default choice for production NLP at scale.
Apple distils large models into small ones for on-device ML on iPhone — the Siri voice model, autocorrect, and photo recognition all use distilled models small enough to run on the Neural Engine without cloud inference.
Google distilled the large PaLM 2 model into smaller variants for different deployment contexts — Gecko, Otter, Bison, Unicorn — each tuned for different tradeoffs of capability versus compute cost.

Common pitfalls

Teacher quality ceiling — the student can only be as good as the teacher. A poorly trained teacher produces poor distillation signals. Invest in the teacher first.
Task specificity — distillation works best when teacher and student are trained on the same task and data distribution. Distilling a general teacher for a specialised student task may not capture the relevant knowledge.
Hyperparameter sensitivity — temperature T, the balance between hard label loss and distillation loss, and the student architecture all significantly affect distillation quality. Tuning is required.
Not a substitute for architecture search — distillation compresses an existing model’s knowledge but does not optimise the student architecture. Combining distillation with neural architecture search produces the best results.

Frequently asked questions

QUESTION 1 What is knowledge distillation in simple terms?

ANSWER 1 Training a small student model to mimic a large teacher — learning from the teacher’s probability distributions, not just the final labels. Like an apprentice watching a master work.

QUESTION 2 Why are soft predictions better than hard labels?

ANSWER 2 They reveal similarity structure — “cat 85%, fox 12%” tells the student cats and foxes share features. Hard labels hide this “dark knowledge.”

QUESTION 3 What is DistilBERT?

ANSWER 3 A distilled BERT — 40% smaller, 60% faster, 97% accuracy. The standard choice when BERT quality is needed but deployment constraints prevent the full model.

QUESTION 4 When should you use knowledge distillation?

ANSWER 4 For edge and mobile deployment, strict latency requirements, high-scale serving cost reduction, and transferring large model capabilities into small task-specific models.

📬 Get one concept + one use case every Tuesday. Join the newsletter →