KDExplainer: A Task-oriented Attention Model for Explaining Knowledge Distillation

05/10/2021
by   Mengqi Xue, et al.
0

Knowledge distillation (KD) has recently emerged as an efficacious scheme for learning compact deep neural networks (DNNs). Despite the promising results achieved, the rationale that interprets the behavior of KD has yet remained largely understudied. In this paper, we introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD. At the heart of KDExplainer is a Hierarchical Mixture of Experts (HME), in which a multi-class classification is reformulated as a multi-task binary one. Through distilling knowledge from a free-form pre-trained DNN to KDExplainer, we observe that KD implicitly modulates the knowledge conflicts between different subtasks, and in reality has much more to offer than label smoothing. Based on such findings, we further introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various DNNs to enhance their performance under KD. Experimental results demonstrate that with a negligible additional cost, student models equipped with VAM consistently outperform their non-VAM counterparts across different benchmarks. Furthermore, when combined with other KD methods, VAM remains competent in promoting results, even though it is only motivated by vanilla KD. The code is available at https://github.com/zju-vipa/KDExplainer.

READ FULL TEXT
research
04/20/2019

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

This paper explores the use of knowledge distillation to improve a Multi...
research
10/27/2021

Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data

Knowledge distillation (KD) aims to craft a compact student model that i...
research
04/21/2022

Eliminating Backdoor Triggers for Deep Neural Networks Using Attention Relation Graph Distillation

Due to the prosperity of Artificial Intelligence (AI) techniques, more a...
research
11/08/2022

Understanding the Role of Mixup in Knowledge Distillation: An Empirical Study

Mixup is a popular data augmentation technique based on creating new sam...
research
08/12/2023

Multi-Label Knowledge Distillation

Existing knowledge distillation methods typically work by imparting the ...
research
12/02/2021

A Fast Knowledge Distillation Framework for Visual Recognition

While Knowledge Distillation (KD) has been recognized as a useful tool i...
research
07/18/2018

Self-supervised Knowledge Distillation Using Singular Value Decomposition

To solve deep neural network (DNN)'s huge training dataset and its high ...

Please sign up or login with your details

Forgot password? Click here to reset