DOT: A Distillation-Oriented Trainer

07/17/2023
by   Borui Zhao, et al.
0

Knowledge distillation transfers knowledge from a large model to a small one via task and distillation losses. In this paper, we observe a trade-off between task and distillation losses, i.e., introducing distillation loss limits the convergence of task loss. We believe that the trade-off results from the insufficient optimization of distillation loss. The reason is: The teacher has a lower task loss than the student, and a lower distillation loss drives the student more similar to the teacher, then a better-converged task loss could be obtained. To break the trade-off, we propose the Distillation-Oriented Trainer (DOT). DOT separately considers gradients of task and distillation losses, then applies a larger momentum to distillation loss to accelerate its optimization. We empirically prove that DOT breaks the trade-off, i.e., both losses are sufficiently optimized. Extensive experiments validate the superiority of DOT. Notably, DOT achieves a +2.59 ResNet50-MobileNetV1 pair. Conclusively, DOT greatly benefits the student's optimization properties in terms of loss convergence and model generalization. Code will be made publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/17/2022

Revisiting Self-Distillation

Knowledge distillation is the procedure of transferring "knowledge" from...
research
04/19/2019

Knowledge Distillation via Route Constrained Optimization

Distillation-based learning boosts the performance of the miniaturized n...
research
01/30/2023

On student-teacher deviations in distillation: does it pay to disobey?

Knowledge distillation has been widely-used to improve the performance o...
research
05/27/2023

Knowledge Distillation Performs Partial Variance Reduction

Knowledge distillation is a popular approach for enhancing the performan...
research
05/09/2023

DynamicKD: An Effective Knowledge Distillation via Dynamic Entropy Correction-Based Distillation for Gap Optimizing

The knowledge distillation uses a high-performance teacher network to gu...
research
09/07/2023

Towards Comparable Knowledge Distillation in Semantic Image Segmentation

Knowledge Distillation (KD) is one proposed solution to large model size...
research
03/26/2021

Hands-on Guidance for Distilling Object Detectors

Knowledge distillation can lead to deploy-friendly networks against the ...

Please sign up or login with your details

Forgot password? Click here to reset