Pea-KD: Parameter-efficient and Accurate Knowledge Distillation

09/30/2020
by   Ikhyun Cho, et al.
0

How can we efficiently compress a model while maintaining its performance? Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model's level of performance as much as possible. However, the existing KD methods suffer from the following limitations. First, since the student model is small in absolute size, it inherently lacks model complexity. Second, the absence of an initial guide for the student model makes it difficult for the student to imitate the teacher model to its fullest. Conventional KD methods yield low performance due to these limitations. In this paper, we propose Parameter-efficient and accurate Knowledge Distillation (Pea-KD), a novel approach to KD. Pea-KD consists of two main parts: Shuffled Parameter Sharing (SPS) and Pretraining with Teacher's Predictions (PTP). Using this combination, we are capable of alleviating the KD's limitations. SPS is a new parameter sharing method that allows greater model complexity for the student model. PTP is a KD-specialized initialization method, which can act as a good initial guide for the student. When combined, this method yields significant increase in student model's performance. Experiments conducted on different datasets and tasks show that the proposed approach improves the student model's performance by 4.4 GLUE tasks, outperforming existing KD baselines by significant margins.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2021

Fixing the Teacher-Student Knowledge Discrepancy in Distillation

Training a small student network with the guidance of a larger teacher n...
research
02/26/2021

PURSUhInT: In Search of Informative Hint Points Based on Layer Clustering for Knowledge Distillation

We propose a novel knowledge distillation methodology for compressing de...
research
05/25/2023

On the Impact of Knowledge Distillation for Model Interpretability

Several recent studies have elucidated why knowledge distillation (KD) i...
research
09/12/2021

On the Efficiency of Subclass Knowledge Distillation in Classification Tasks

This work introduces a novel knowledge distillation framework for classi...
research
05/02/2020

Heterogeneous Knowledge Distillation using Information Flow Modeling

Knowledge Distillation (KD) methods are capable of transferring the know...
research
11/08/2019

Towards a General Model of Knowledge for Facial Analysis by Multi-Source Transfer Learning

This paper proposes a step toward obtaining general models of knowledge ...
research
10/01/2022

Multi-stage Progressive Compression of Conformer Transducer for On-device Speech Recognition

The smaller memory bandwidth in smart devices prompts development of sma...

Please sign up or login with your details

Forgot password? Click here to reset