Improved knowledge distillation by utilizing backward pass knowledge in neural networks

01/27/2023
by   Aref Jafari, et al.
0

Knowledge distillation (KD) is one of the prominent techniques for model compression. In this method, the knowledge of a large network (teacher) is distilled into a model (student) with usually significantly fewer parameters. KD tries to better-match the output of the student model to that of the teacher model based on the knowledge extracts from the forward pass of the teacher network. Although conventional KD is effective for matching the two networks over the given data points, there is no guarantee that these models would match in other areas for which we do not have enough training samples. In this work, we address that problem by generating new auxiliary training samples based on extracting knowledge from the backward pass of the teacher in the areas where the student diverges greatly from the teacher. We compute the difference between the teacher and the student and generate new data samples that maximize the divergence. This is done by perturbing data samples in the direction of the gradient of the difference between the student and the teacher. Augmenting the training set by adding this auxiliary improves the performance of KD significantly and leads to a closer match between the student and the teacher. Using this approach, when data samples come from a discrete domain, such as applications of natural language processing (NLP) and language understanding, is not trivial. However, we show how this technique can be used successfully in such applications. We evaluated the performance of our method on various tasks in computer vision and NLP domains and got promising results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/31/2020

Towards Zero-Shot Knowledge Distillation for Natural Language Processing

Knowledge Distillation (KD) is a common knowledge transfer algorithm use...
research
07/27/2018

One-Shot Optimal Topology Generation through Theory-Driven Machine Learning

We introduce a theory-driven mechanism for learning a neural network mod...
research
05/22/2023

Improving Robustness in Knowledge Distillation Using Domain-Targeted Data Augmentation

Applying knowledge distillation encourages a student model to behave mor...
research
12/12/2022

Continuation KD: Improved Knowledge Distillation through the Lens of Continuation Optimization

Knowledge Distillation (KD) has been extensively used for natural langua...
research
01/20/2021

Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation

Despite pre-trained language models such as BERT have achieved appealing...
research
10/16/2021

Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

With ever growing scale of neural models, knowledge distillation (KD) at...
research
03/04/2019

Perception, Prestige and PageRank

Academic esteem is difficult to quantify in objective terms. Network the...

Please sign up or login with your details

Forgot password? Click here to reset