As large models advance, there’s growing demand to use knowledge distillation to produce smaller, more portable models (student) that match higher-performing larger models (teacher). However, when a teacher’s capacity far exceeds the student’s, distillation often degrades, which is known as capacity mismatch. This mismatch caps the student model’s performance and has become a bottleneck in large-model distillation. Existing mitigation techniques remain ad hoc, and no study has yet systematically explained its root causes or proposed targeted methods to enable larger teachers to yield better distillation results.
To solve the problems, a research team led by De-Chuan Zhan published their new research on 15 June 2026 in Frontiers of Computer Science co-published by Higher Education Press and Springer Nature.
The team has identified two key characteristics in knowledge distillation. First, as the teacher model grows, the variance of its probability outputs on non-target classes rises then falls. This variance—reflecting sample–class relationships—is essential to transfer and correlates with distillation performance; when the teacher becomes too large, this variance shrinks and distillation degrades. Second, despite changes in teacher capacity, the ranking of class output magnitudes remains unchanged, showing that simple, order-preserving temperature scaling can adjust class variance without disrupting the teacher’s inherent knowledge. Their insights deepen the understanding of dark knowledge, reveal the origin of capacity mismatch, and guide the design of more effective distillation methods. Based on these insights, they propose Instance-Specific Asymmetric Temperature Scaling (ISATS) method. For each example, ISATS applies different distillation temperatures to the correct class versus all incorrect classes—and chooses the incorrect-class temperature to maximize their output variance—thereby enriching dark knowledge in two ways. Experiments on numerous datasets show that their method outperforms previous capacity-mismatch mitigation techniques and ensures larger teacher models teach more effectively.
DOI
10.1007/s11704-025-41434-w