Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

10/16/2021
by   Mehdi Rezagholizadeh, et al.
4

With ever growing scale of neural models, knowledge distillation (KD) attracts more attention as a prominent tool for neural model compression. However, there are counter intuitive observations in the literature showing some challenging limitations of KD. A case in point is that the best performing checkpoint of the teacher might not necessarily be the best teacher for training the student in KD. Therefore, one important question would be how to find the best checkpoint of the teacher for distillation? Searching through the checkpoints of the teacher would be a very tedious and computationally expensive process, which we refer to as the checkpoint-search problem. Moreover, another observation is that larger teachers might not necessarily be better teachers in KD which is referred to as the capacity-gap problem. To address these challenging problems, in this work, we introduce our progressive knowledge distillation (Pro-KD) technique which defines a smoother training path for the student by following the training footprints of the teacher instead of solely relying on distilling from a single mature fully-trained teacher. We demonstrate that our technique is quite effective in mitigating the capacity-gap problem and the checkpoint search problem. We evaluate our technique using a comprehensive set of experiments on different tasks such as image classification (CIFAR-10 and CIFAR-100), natural language understanding tasks of the GLUE benchmark, and question answering (SQuAD 1.1 and 2.0) using BERT-based models and consistently got superior results over state-of-the-art techniques.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/14/2021

Annealing Knowledge Distillation

Significant memory and computational requirements of large deep neural n...
research
09/27/2022

PROD: Progressive Distillation for Dense Retrieval

Knowledge distillation is an effective way to transfer knowledge from a ...
research
12/12/2022

Continuation KD: Improved Knowledge Distillation through the Lens of Continuation Optimization

Knowledge Distillation (KD) has been extensively used for natural langua...
research
04/21/2019

Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System

Deep pre-training and fine-tuning models (like BERT, OpenAI GPT) have de...
research
07/03/2020

Knowledge Distillation Beyond Model Compression

Knowledge distillation (KD) is commonly deemed as an effective model com...
research
06/11/2022

Reducing Capacity Gap in Knowledge Distillation with Review Mechanism for Crowd Counting

The lightweight crowd counting models, in particular knowledge distillat...
research
01/27/2023

Improved knowledge distillation by utilizing backward pass knowledge in neural networks

Knowledge distillation (KD) is one of the prominent techniques for model...

Please sign up or login with your details

Forgot password? Click here to reset