Knowledge Distillation from Internal Representations

10/08/2019
by   Gustavo Aguilar, et al.
0

Knowledge distillation is typically conducted by training a small model (the student) to mimic a large and cumbersome model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft-labels to optimize the student. However, when the teacher is considerably large, there is no guarantee that the internal knowledge of the teacher will be transferred into the student; even if the student closely matches the soft-labels, its internal representations may be considerably different. This internal mismatch can undermine the generalization capabilities originally intended to be transferred from the teacher to the student. In this paper, we propose to distill the internal representations of a large model such as BERT into a simplified version of it. We formulate two ways to distill such representations and various algorithms to conduct the distillation. We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/10/2021

Does Knowledge Distillation Really Work?

Knowledge distillation is a popular technique for training a small stude...
research
12/05/2020

Multi-head Knowledge Distillation for Model Compression

Several methods of knowledge distillation have been developed for neural...
research
12/14/2022

Hybrid Paradigm-based Brain-Computer Interface for Robotic Arm Control

Brain-computer interface (BCI) uses brain signals to communicate with ex...
research
01/30/2023

On student-teacher deviations in distillation: does it pay to disobey?

Knowledge distillation has been widely-used to improve the performance o...
research
10/11/2019

Improving Generalization and Robustness with Noisy Collaboration in Knowledge Distillation

Inspired by trial-to-trial variability in the brain that can result from...
research
12/31/2019

Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

Knowledge distillation (KD) is a new method for transferring knowledge o...
research
03/09/2022

Efficient Sub-structured Knowledge Distillation

Structured prediction models aim at solving a type of problem where the ...

Please sign up or login with your details

Forgot password? Click here to reset