Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective

by   Helong Zhou, et al.

Knowledge distillation is an effective approach to leverage a well-trained network or an ensemble of them, named as the teacher, to guide the training of a student network. The outputs from the teacher network are used as soft labels for supervising the training of a new network. Recent studies <cit.> revealed an intriguing property of the soft labels that making labels soft serves as a good regularization to the student network. From the perspective of statistical learning, regularization aims to reduce the variance, however how bias and variance change is not clear for training with soft labels. In this paper, we investigate the bias-variance tradeoff brought by distillation with soft labels. Specifically, we observe that during training the bias-variance tradeoff varies sample-wisely. Further, under the same distillation temperature setting, we observe that the distillation performance is negatively associated with the number of some specific samples, which are named as regularization samples since these samples lead to bias increasing and variance decreasing. Nevertheless, we empirically find that completely filtering out regularization samples also deteriorates distillation performance. Our discoveries inspired us to propose the novel weighted soft labels to help the network adaptively handle the sample-wise bias-variance tradeoff. Experiments on standard evaluation benchmarks validate the effectiveness of our method. Our code is available at <>.


Spot-adaptive Knowledge Distillation

Knowledge distillation (KD) has become a well established paradigm for c...

Mitigating Class Boundary Label Uncertainty to Reduce Both Model Bias and Variance

The study of model bias and variance with respect to decision boundaries...

Why distillation helps: a statistical perspective

Knowledge distillation is a technique for improving the performance of a...

Self-Distillation from the Last Mini-Batch for Consistency Regularization

Knowledge distillation (KD) shows a bright promise as a powerful regular...

Simon Says: Evaluating and Mitigating Bias in Pruned Neural Networks with Knowledge Distillation

In recent years the ubiquitous deployment of AI has posed great concerns...

Efficient One Pass Self-distillation with Zipf's Label Smoothing

Self-distillation exploits non-uniform soft supervision from itself duri...

On the Bias-Variance Tradeoff: Textbooks Need an Update

The main goal of this thesis is to point out that the bias-variance trad...

Code Repositories



view repo


some methods in DL using paddle

view repo