Distilling Knowledge from Pre-trained Language Models via Text Smoothing

05/08/2020
by   Xing Wu, et al.
0

This paper studies compressing pre-trained language models, like BERT (Devlin et al.,2019), via teacher-student knowledge distillation. Previous works usually force the student model to strictly mimic the smoothed labels predicted by the teacher BERT. As an alternative, we propose a new method for BERT distillation, i.e., asking the teacher to generate smoothed word ids, rather than labels, for teaching the student model in knowledge distillation. We call this kind of methodTextSmoothing. Practically, we use the softmax prediction of the Masked Language Model(MLM) in BERT to generate word distributions for given texts and smooth those input texts using that predicted soft word ids. We assume that both the smoothed labels and the smoothed texts can implicitly augment the input corpus, while text smoothing is intuitively more efficient since it can generate more instances in one neural network forward step.Experimental results on GLUE and SQuAD demonstrate that our solution can achieve competitive results compared with existing BERT distillation methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2021

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

Pre-trained language models (PLMs) achieve great success in NLP. However...
research
10/13/2021

Language Modelling via Learning to Rank

We consider language modelling (LM) as a multi-label structured predicti...
research
10/07/2020

DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling

Pre-trained models like BERT (Devlin et al., 2018) have dominated NLP / ...
research
02/09/2021

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

Pre-trained language models (PLMs) like BERT have made great progress in...
research
01/20/2021

Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation

Despite pre-trained language models such as BERT have achieved appealing...
research
03/20/2022

g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin

Polyphone disambiguation is the most crucial task in Mandarin grapheme-t...
research
11/21/2022

Multi-Level Knowledge Distillation for Out-of-Distribution Detection in Text

Self-supervised representation learning has proved to be a valuable comp...

Please sign up or login with your details

Forgot password? Click here to reset