Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

by   Mehdi Rezagholizadeh, et al.

With ever growing scale of neural models, knowledge distillation (KD) attracts more attention as a prominent tool for neural model compression. However, there are counter intuitive observations in the literature showing some challenging limitations of KD. A case in point is that the best performing checkpoint of the teacher might not necessarily be the best teacher for training the student in KD. Therefore, one important question would be how to find the best checkpoint of the teacher for distillation? Searching through the checkpoints of the teacher would be a very tedious and computationally expensive process, which we refer to as the checkpoint-search problem. Moreover, another observation is that larger teachers might not necessarily be better teachers in KD which is referred to as the capacity-gap problem. To address these challenging problems, in this work, we introduce our progressive knowledge distillation (Pro-KD) technique which defines a smoother training path for the student by following the training footprints of the teacher instead of solely relying on distilling from a single mature fully-trained teacher. We demonstrate that our technique is quite effective in mitigating the capacity-gap problem and the checkpoint search problem. We evaluate our technique using a comprehensive set of experiments on different tasks such as image classification (CIFAR-10 and CIFAR-100), natural language understanding tasks of the GLUE benchmark, and question answering (SQuAD 1.1 and 2.0) using BERT-based models and consistently got superior results over state-of-the-art techniques.



There are no comments yet.


page 1

page 2

page 3

page 4


Annealing Knowledge Distillation

Significant memory and computational requirements of large deep neural n...

Model Compression with Two-stage Multi-teacher Knowledge Distillation for Web Question Answering System

Deep pre-training and fine-tuning models (such as BERT and OpenAI GPT) h...

Differentiable Feature Aggregation Search for Knowledge Distillation

Knowledge distillation has become increasingly important in model compre...

Causal Distillation for Language Models

Distillation efforts have led to language models that are more compact a...

Model Compression with Multi-Task Knowledge Distillation for Web-scale Question Answering System

Deep pre-training and fine-tuning models (like BERT, OpenAI GPT) have de...

Distilling Linguistic Context for Language Model Compression

A computationally expensive and memory intensive neural network lies beh...

Born Again Neural Rankers

We introduce Born Again neural Rankers (BAR) in the Learning to Rank (LT...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.