Knowledge Distillation
A training technique that transfers the knowledge of a large teacher model into a smaller student model for lighter deployment
What is Knowledge Distillation?
Knowledge distillation is a training technique that moves the know-how of a large, accurate teacher model into a smaller, faster student model — preserving performance while cutting size and cost. Introduced by Hinton et al. (2015) as a classic idea, it has become a core tool in the LLM era for shipping 7B, 3B, and 1B scale models.
In short, the teacher teaches the student not only what the answer is, but also how confident to be across all candidate answers.
How Does It Work?
Distillation combines two signals during training:
- Hard label: the ground-truth answer (e.g., "cat")
- Soft label (the key idea): the full probability distribution the teacher emits (e.g., cat 0.8, tiger 0.1, dog 0.05, fox 0.03 ...)
The student learns to mimic the teacher's soft distribution. Those soft labels encode implicit knowledge like "cats and tigers are related", giving the student a far richer learning signal than hard labels alone.
In LLMs, sequence-level distillation — training the student to reproduce the teacher's responses — is the dominant approach, and it shows up in training recipes for Llama, Gemma, and other compact open-weight models.
Why Does It Matter?
Knowledge distillation determines the economics of AI in production. Serving a frontier model (hundreds of billions of parameters) directly is cost-prohibitive, but distilling it into a 7B–13B student can preserve 80–95% of performance at 1/10 to 1/100 of the cost. On-device AI, edge applications, and low-latency agents are effectively impossible without distillation, making it an indispensable technique in today's AI deployment stack.