Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better

by   Xuanyang Zhang, et al.

Knowledge distillation field delicately designs various types of knowledge to shrink the performance gap between compact student and large-scale teacher. These existing distillation approaches simply focus on the improvement of knowledge quality, but ignore the significant influence of knowledge quantity on the distillation procedure. Opposed to the conventional distillation approaches, which extract knowledge from a fixed teacher computation graph, this paper explores a non-negligible research direction from a novel perspective of knowledge quantity to further improve the efficacy of knowledge distillation. We introduce a new concept of knowledge decomposition, and further put forward the Partial to Whole Knowledge Distillation (PWKD) paradigm. Specifically, we reconstruct teacher into weight-sharing sub-networks with same depth but increasing channel width, and train sub-networks jointly to obtain decomposed knowledge (sub-networks with more channels represent more knowledge). Then, student extract partial to whole knowledge from the pre-trained teacher within multiple training stages where cyclic learning rate is leveraged to accelerate convergence. Generally, PWKD can be regarded as a plugin to be compatible with existing offline knowledge distillation approaches. To verify the effectiveness of PWKD, we conduct experiments on two benchmark datasets: CIFAR-100 and ImageNet, and comprehensive evaluation results reveal that PWKD consistently improve existing knowledge distillation approaches without bells and whistles.


Highlight Every Step: Knowledge Distillation via Collaborative Teaching

High storage and computational costs obstruct deep neural networks to be...

Embedded Knowledge Distillation in Depth-level Dynamic Neural Network

In real applications, different computation-resource devices need differ...

A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

The convergence rate and final performance of common deep learning model...

Efficient Evaluation-Time Uncertainty Estimation by Improved Distillation

In this work we aim to obtain computationally-efficient uncertainty esti...

Knowledge Distillation Performs Partial Variance Reduction

Knowledge distillation is a popular approach for enhancing the performan...

Leveraging Different Learning Styles for Improved Knowledge Distillation

Learning style refers to a type of training mechanism adopted by an indi...

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

Recently, knowledge distillation (KD) has shown great success in BERT co...

Please sign up or login with your details

Forgot password? Click here to reset