LightPAFF: A Two-Stage Distillation Framework for Pre-training and Fine-tuning

04/27/2020
by   Kaitao Song, et al.
0

While pre-training and fine-tuning, e.g., BERT <cit.>, GPT-2 <cit.>, have achieved great success in language understanding and generation tasks, the pre-trained models are usually too big for online deployment in terms of both memory cost and inference speed, which hinders them from practical online usage. In this paper, we propose LightPAFF, a Lightweight Pre-training And Fine-tuning Framework that leverages two-stage knowledge distillation to transfer knowledge from a big teacher model to a lightweight student model in both pre-training and fine-tuning stages. In this way the lightweight model can achieve similar accuracy as the big teacher model, but with much fewer parameters and thus faster online inference speed. LightPAFF can support different pre-training methods (such as BERT, GPT-2 and MASS <cit.>) and be applied to many downstream tasks. Experiments on three language understanding tasks, three language modeling tasks and three sequence to sequence generation tasks demonstrate that while achieving similar accuracy with the big BERT, GPT-2 and MASS models, LightPAFF reduces the model size by nearly 5x and improves online inference speed by 5x-7x.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/07/2019

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Pre-training and fine-tuning, e.g., BERT, have achieved great success in...
research
07/14/2023

Improving BERT with Hybrid Pooling Network and Drop Mask

Transformer-based pre-trained language models, such as BERT, achieve gre...
research
10/14/2022

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Pre-trained vision-language models (VLMs) have achieved impressive resul...
research
12/13/2020

MiniVLM: A Smaller and Faster Vision-Language Model

Recent vision-language (VL) studies have shown remarkable progress by le...
research
09/10/2021

Block Pruning For Faster Transformers

Pre-training has improved model accuracy for both classification and gen...
research
03/31/2022

Training strategy for a lightweight countermeasure model for automatic speaker verification

The countermeasure (CM) model is developed to protect Automatic Speaker ...
research
05/25/2022

Towards Understanding Label Regularization for Fine-tuning Pre-trained Language Models

Knowledge Distillation (KD) is a prominent neural model compression tech...

Please sign up or login with your details

Forgot password? Click here to reset