Sparse Distillation: Speeding Up Text Classification by Using Bigger Models

10/16/2021
by   Qinyuan Ye, et al.
0

Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. However, the improved inference speed may be still unsatisfactory for certain time-sensitive applications. In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model. More specifically, we consider distilling a transformer-based text classifier into a billion-parameter, sparsely-activated student model with a embedding-averaging architecture. Our experiments show that the student models retain 97 classification tasks. Meanwhile, the student model achieves up to 600x speed-up on both GPUs and CPUs, compared to the teacher models. Further investigation shows that our pipeline is also effective in privacy-preserving and domain generalization settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/08/2019

Transformer to CNN: Label-scarce distillation for efficient text classification

Significant advances have been made in Natural Language Processing (NLP)...
research
05/16/2023

Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation

It has been commonly observed that a teacher model with superior perform...
research
03/01/2023

Towards domain generalisation in ASR with elitist sampling and ensemble knowledge distillation

Knowledge distillation has widely been used for model compression and do...
research
05/25/2023

MERGE: Fast Private Text Generation

Recent years have seen increasing concerns about the private inference o...
research
05/12/2018

I Have Seen Enough: A Teacher Student Network for Video Classification Using Fewer Frames

Over the past few years, various tasks involving videos such as classifi...
research
10/19/2021

When in Doubt, Summon the Titans: Efficient Inference with Large Models

Scaling neural networks to "large" sizes, with billions of parameters, h...
research
05/05/2020

The Cascade Transformer: an Application for Efficient Answer Sentence Selection

Large transformer-based language models have been shown to be very effec...

Please sign up or login with your details

Forgot password? Click here to reset