Distilling Knowledge from Pre-trained Language Models via Text Smoothing

by   Xing Wu, et al.
Baidu, Inc.

This paper studies compressing pre-trained language models, like BERT (Devlin et al.,2019), via teacher-student knowledge distillation. Previous works usually force the student model to strictly mimic the smoothed labels predicted by the teacher BERT. As an alternative, we propose a new method for BERT distillation, i.e., asking the teacher to generate smoothed word ids, rather than labels, for teaching the student model in knowledge distillation. We call this kind of methodTextSmoothing. Practically, we use the softmax prediction of the Masked Language Model(MLM) in BERT to generate word distributions for given texts and smooth those input texts using that predicted soft word ids. We assume that both the smoothed labels and the smoothed texts can implicitly augment the input corpus, while text smoothing is intuitively more efficient since it can generate more instances in one neural network forward step.Experimental results on GLUE and SQuAD demonstrate that our solution can achieve competitive results compared with existing BERT distillation methods.



There are no comments yet.


page 1

page 2

page 3

page 4


One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

Pre-trained language models (PLMs) achieve great success in NLP. However...

Language Modelling via Learning to Rank

We consider language modelling (LM) as a multi-label structured predicti...

DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling

Pre-trained models like BERT (Devlin et al., 2018) have dominated NLP / ...

Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation

Despite pre-trained language models such as BERT have achieved appealing...

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

Pre-trained language models (PLMs) like BERT have made great progress in...

BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online E-Commerce Search

Relevance has significant impact on user experience and business profit ...

g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin

Polyphone disambiguation is the most crucial task in Mandarin grapheme-t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While language model pre-training, such as BERT (Devlin et al., 2019) and its variants (Yang et al., 2019; Liu et al., 2019; Lan et al., 2019; Raffel et al., 2019)

, has significantly improved the performance of many natural language processing tasks, those pre-trained models are usually too large to be deployed for resource-limited applications. To address this problem, many researchers recently investigate using the Knowledge Distillation (KD) algorithms 

(Hinton et al., 2015) to transfer the knowledge of a large pre-trained language model (teacher) into a small neural network (student) (Sanh et al., 2019; Sun et al., 2019; Jiao et al., 2019), in order to reduce the model size for online deployment. The left part in Figure  1

shows the learning algorithm of previous KD methods for BERT. Typically, the student model learns from smoothed labels, i.e., given an input text of some downstream task, both BERT and the student model are asked to conduct prediction over the input text, the student model is then asked to strictly fit the soft labels predicted by BERT (for example, the probability distribution over different labels for text classification tasks).

Figure 1: Distilling BERT using smoothed labels (left) and smoothed texts (right).

As an alternative, we investigate training a student model using smoothed texts, rather than labels, in knowledge distillation of BERT. The right part of Figure  1 demonstrates our proposed method, which is comprised of two steps:

  • Text Smoothing

    : Given the pre-trained teacher (BERT) as well as the downstream task corpus, we first feed the one-hot encoding of each text in the corpus into the teacher BERT, and fetch the softmax prediction of the Masked Language Model (MLM) in BERT, the original one-hot encodings and the predicted word distributions are then mixed up together to smooth the raw texts.

  • Student Learning: The student model is tuned using the smoothed corpus of the downstream task, akin to typical supervised training. During the inference stage, the student model only takes the one-hot encoding as input to make predictions.

We assume that both the label smoothing and text smoothing can implicitly generate more instances for training the student model, while text smoothing could be more efficient since it can intuitively generate more instances than label smoothing in each forward step of neural networks (because the number of word candidates is much larger than task labels). Our work is inspired by the recent success of using soft word ids to improve performance in Neural Machine Translation 

Wang et al. (2019); Gao et al. (2019), and we extend this idea to knowledge distillation of BERT.

Empirical results on GLUE (Wang et al., 2018) and SQuAD (Rajpurkar et al., 2016, 2018) show that the student model that trained with our proposed text smoothing algorithm can achieve competitive results, compared with existing BERT KD methods, while the text smoothing method is significantly faster than previous distillation algorithms. The major contribution of our work is that we propose a new knowledge distillation method for BERT, which uses BERT to generate smoothed texts, rather than labels, for teaching the student model.

2 Related Work

Compressing large pre-trained models like BERT attracts increasing research interests in recent years. Existing studies can be roughly divided into distillation (Sanh et al., 2019; Sun et al., 2019; Jiao et al., 2019), pruning (McCarley, 2019) and quantization (Zafrir et al., 2019), where the distillation method shows the most promising results. In this paper, we focus on the knowledge distillation of BERT. Most previous BERT distillation methods are based on making the student model mimic the smoothed labels predicted by the teacher BERT. Sanh et al. (2019) use the large-scale corpus in pre-training to generate a massive amount of weak labels to teach the student model, to improve the performance of the student model. Sun et al. (2019) investigate forcing the student model fitting multiple feedforward layers in the teacher BERT to improve the performance of the student model. Jiao et al. (2019) propose a more systemic approach to construct an equally-effective student of BERT, where they use the feedforward, multi-head attention distribution and embedding layers of the teacher model to supervise the student model, conduct the KD method on both pre-training and fine-tuning sides, and leverage Data Augmentation (DA) (Wu et al., 2019) algorithms to increase size of fine-tuning corpus in 20 times.

3 Our Method

Let be the teacher BERT and be the student model to be tuned. Given the dataset of some downstream task, namely , where is the number of instances, is the one-hot encoding of a text (a single sentence or a sentence pair), is the positional encoding of , is the segment encoding of and is the label of this instance. We train the student model of BERT in two steps:

Text Smoothing: We feed the one-hot encoding , positional encoding as well as the segment encoding into BERT, and fetch the output of the last layer of the transformer encoder in BERT, which is denoted as:



is a 2D dense vector in shape of [sequence_len, embedding_size]. We then multiply

with the word embedding matrix in BERT, to get the MLM prediction results, which is defined as:


where each row in is a probability distribution over the word vocabulary, representing the choice of words in that position of the input text learned by BERT. It is worth noticing that we do not apply the mask corruption in BERT in fetching the MLM prediction in text smoothing, besides the teacher BERT is also not fine-tuned with the downstream task corpus. Our experimental results show that the mask corruption may harm the performance in knowledge distillation. The input text

is finally smoothed using a simple linear interpolation as:


where controls the smoothing degree.

Method (# layers) MNLI-m MNLP-mm QQP SST-2 QNLI MRPC RTE CoLA STS-B Average
BERT (12) 84.6 83.4 71.2 93.5 90.5 88.9 66.4 52.1 85.8 79.6
BERT (3) 74.8 74.3 65.8 86.4 84.3 80.5 55.2 16.8 67.5 68.3
BERT-PKD (3) 76.7 76.3 68.1 87.5 84.7 80.7 58.2 - - -
TextSmooth (3) 77.2 76.1 66.7 88.4 85.2 82.3 60.7 23.8 77.7 70.9
BERT (4) 75.4 74.9 66.5 87.6 84.8 83.2 62.6 19.5 77.1 70.2
DistillBERT (4) 78.9 78.0 68.5 91.4 85.2 82.4 54.1 32.8 76.1 71.9
BERT-PKD (4) 79.9 79.3 70.2 89.4 85.1 82.6 62.3 24.8 79.8 72.6
TinyBERT (4) 82.5 81.8 71.3 92.6 87.7 86.4 62.9 43.3 79.9 76.5
TinyBERT w/o DA (4) 80.5 81.0 - - - 82.4 - 29.8 - -
TextSmooth (4) 79.9 79.2 69.6 90.6 86.1 85.0 63.3 33.3 79.8 74.1
BERT (6) 80.4 79.7 69.2 90.7 86.7 85.9 63.6 30.6 81.9 74.3
BERT-PKD (6) 81.5 81.0 70.7 92.0 89.0 85.0 65.5 - - -
TinyBERT (6) 84.6 83.2 71.6 93.1 90.4 87.3 70.0 51.1 83.7 79.4
TinyBERT w/o DA (6) - - - - - - - - - -
TextSmooth (6) 81.9 80.9 70.3 92.8 88.0 86.4 65.7 42.7 82.8 76.8
Table 1: Experimental results on the GLUE benchmark test set. TextSmooth is our method.
Method (# layers) SQuAD 1.1 SQuAD 2.0
BERT (12) 80.7 88.4 73.1 76.4
BERT (4) 67.8 77.5 60.0 63.9
BERT-PKD (4) 70.1 79.5 60.8 64.6
DistillBERT (4) 71.8 81.2 60.6 64.1
TinyBERT (4) 72.7 82.1 65.3 68.8
TextSmooth (4) 70.9 80.5 62.9 66.2
BERT (6) 76.5 84.7 65.7 69.0
BERT-PKD (6) 77.1 85.3 66.3 69.8
DistillBERT (6) 78.1 86.2 66.0 69.5
TinyBERT (6) 79.7 87.5 69.9 73.4
TextSmooth (6) 76.7 84.8 66.7 69.5
Table 2: Experimental results on the SQuAD dev set. The result of TinyBERT w/o DA is not reported, so we skip those missing scores.

Student Learning: Given those smoothed texts, we construct the corpus for teaching the student model, denoted as . The student model is then tuned with

as supervised learning. In the inference phase, the student model takes the one-hot encoding of text as input and conducts predictions, the same as most existing teacher-student KD methods.

4 Experiments

4.1 Dataset and Metrics

Following the previous works, we test the performance of our proposed KD method on two large-scale datasets: the GLUE benchmark (Wang et al., 2018) and SQuAD (1.1 and 2.0) (Rajpurkar et al., 2016, 2018)

, covering eleven different tasks in natural language processing. The evaluation metrics in this paper are also the same as previous works.

4.2 Baselines

Same as the previous works, we take the pre-trained BERT (12-layers, English, uncased, no whole word masking) as our teacher model, and we use a smaller BERT (contains fewer layers of transformer encoder) as the student model to be tuned. The baselines to be compared include DistillBERT (Sanh et al., 2019), BERT-PKD (Sun et al., 2019) and TinyBERT (Jiao et al., 2019), those kinds of methods are all based on mimicking soft labels, while TinyBERT is more complicated and uses Data Augmentation (DA) to boost the performance of the student model. To make a fair comparison, we also take TinyBERT w/o DA as another baseline, to better analyze the difference between label smoothing and text smoothing in BERT distillation. Additionally, we study directly fine-tuning the student model without any smoothing (denoted as BERT), as ablation experiments, to validate the performance of all knowledge distillation algorithms.

4.3 Experiment Setting

We test 3 different student models: 3-layer BERT, 4-layer BERT, and 6-layer BERT, following the previous works. All the other hyper-parameters are the same as the teacher BERT and the is set to 0.5 in our experiments. Akin to the previous works, the student model is initialized by reusing the weights of first n layers of the transformer encoder as well as the embedding layers in the teacher BERT. We directly copy the experimental results of our baselines if reported, otherwise, we use ‘-’ to represent those missing scores or directly skip that setting if the baseline methods do not consider this setting in their experiments.

Figure 2: Five unique fake sentences generated for the raw input text “The film has some of the best special effects ever” using text smoothing.

4.4 Experimental Results

Table  1 and Table  2 show the experimental results of our proposed distillation method as well as all baseline and ablation methods. As demonstrated, the text smoothing method can consistently over-perform existing smooth label based KD methods in different student models and most natural language processing tasks, including DistillBERT, BERT-PKD as well as TinyBERT w/o DA, especially for small student models. The performance of TinyBERT is significantly improved when using data augmentation, which also implies the importance of smoothing the texts, instead of the labels. The performance of the ablation model BERT

is significantly worse than both the label smoothing based methods as well as our text smoothing based methods, which implies the effectiveness of knowledge distillation algorithms in BERT. Besides the performance, our proposed text smoothing distillation method is also faster than the other methods. We only use about 92 seconds to pass one epoch in GLUE on average, while BERT-PKD needs to use more than 233 seconds on the same devices and deep learning framework. Moreover, our proposed method can also support transferring the teacher knowledge to heterogeneous students, since we do not rely on mimicking the intermediate layers of the teacher BERT, which increases the searching space for the optimal student model.

5 Analysis

To better understand how text smoothing works and where is the limitation, we visualize the smoothing representation of an example in the SST-2 corpus by sampling some fake sentences from its smoothed text representation. Figure  2 shows the visualization results, the top sentence (“The film has some of the best special effects ever”) is a raw sentence in SST-2, and the five sentences below are all generated by randomly sampling words at each position of the raw input sentence, using its smoothed word distribution. The probability of generating the whole sentence is calculated using the multiplication of the probability for generating each word.

As we can observe, all those sampled sentences are semantically-related to the original input text, the higher the generation probability is, the more semantically similar to the raw text the generated sentence is. Those sampled sentences are surprisingly diverse in expression (such as has-contains, effects-foley, some-several in Figure  2). An interesting thing is that the top one generated sentence is usually the raw input itself, as shown in Figure  2. The reason may be the MLM in BERT pre-training, which facilitates the reconstruction and paraphrasing of the input sentence. However, many of those generated sentences have some small errors in grammar (like “the of” of the third sentence) and some of them suffer from semantic shifts (like “productions” of the last sentence). Those kinds of errors may harm the performance of some tasks who needs to consider fine-grained semantic information or be aware of syntactic structures of the input sentences. Studying the impact of those errors and how to eliminate them shall be an interesting research topic in the future.

6 Conclusion

In this paper, we propose a new knowledge distillation method for compressing BERT. Instead of making the student model mimic the soft labels predicted by the teacher BERT, we explore leveraging the Masked Language Model (MLM) in BERT to generate word distributions for each text in the downstream task corpus, and using those predicted word distributions to smooth the raw input text. The student model then learns from those smoothed texts, rather than smoothed labels. Experimental results on GLUE and SQuAD show that our text smoothing method can achieve competitive results compared with most existing BERT KD methods. In the future, we would like to take a deep look at the principle of text smoothing and try to combine text smoothing and label smoothing for better knowledge distillation of language model pre-training.