1 Introduction
While language model pre-training, such as BERT (Devlin et al., 2019) and its variants (Yang et al., 2019; Liu et al., 2019; Lan et al., 2019; Raffel et al., 2019)
, has significantly improved the performance of many natural language processing tasks, those pre-trained models are usually too large to be deployed for resource-limited applications. To address this problem, many researchers recently investigate using the Knowledge Distillation (KD) algorithms
(Hinton et al., 2015) to transfer the knowledge of a large pre-trained language model (teacher) into a small neural network (student) (Sanh et al., 2019; Sun et al., 2019; Jiao et al., 2019), in order to reduce the model size for online deployment. The left part in Figure 1shows the learning algorithm of previous KD methods for BERT. Typically, the student model learns from smoothed labels, i.e., given an input text of some downstream task, both BERT and the student model are asked to conduct prediction over the input text, the student model is then asked to strictly fit the soft labels predicted by BERT (for example, the probability distribution over different labels for text classification tasks).

As an alternative, we investigate training a student model using smoothed texts, rather than labels, in knowledge distillation of BERT. The right part of Figure 1 demonstrates our proposed method, which is comprised of two steps:
-
Text Smoothing
: Given the pre-trained teacher (BERT) as well as the downstream task corpus, we first feed the one-hot encoding of each text in the corpus into the teacher BERT, and fetch the softmax prediction of the Masked Language Model (MLM) in BERT, the original one-hot encodings and the predicted word distributions are then mixed up together to smooth the raw texts.
-
Student Learning: The student model is tuned using the smoothed corpus of the downstream task, akin to typical supervised training. During the inference stage, the student model only takes the one-hot encoding as input to make predictions.
We assume that both the label smoothing and text smoothing can implicitly generate more instances for training the student model, while text smoothing could be more efficient since it can intuitively generate more instances than label smoothing in each forward step of neural networks (because the number of word candidates is much larger than task labels). Our work is inspired by the recent success of using soft word ids to improve performance in Neural Machine Translation
Wang et al. (2019); Gao et al. (2019), and we extend this idea to knowledge distillation of BERT.Empirical results on GLUE (Wang et al., 2018) and SQuAD (Rajpurkar et al., 2016, 2018) show that the student model that trained with our proposed text smoothing algorithm can achieve competitive results, compared with existing BERT KD methods, while the text smoothing method is significantly faster than previous distillation algorithms. The major contribution of our work is that we propose a new knowledge distillation method for BERT, which uses BERT to generate smoothed texts, rather than labels, for teaching the student model.
2 Related Work
Compressing large pre-trained models like BERT attracts increasing research interests in recent years. Existing studies can be roughly divided into distillation (Sanh et al., 2019; Sun et al., 2019; Jiao et al., 2019), pruning (McCarley, 2019) and quantization (Zafrir et al., 2019), where the distillation method shows the most promising results. In this paper, we focus on the knowledge distillation of BERT. Most previous BERT distillation methods are based on making the student model mimic the smoothed labels predicted by the teacher BERT. Sanh et al. (2019) use the large-scale corpus in pre-training to generate a massive amount of weak labels to teach the student model, to improve the performance of the student model. Sun et al. (2019) investigate forcing the student model fitting multiple feedforward layers in the teacher BERT to improve the performance of the student model. Jiao et al. (2019) propose a more systemic approach to construct an equally-effective student of BERT, where they use the feedforward, multi-head attention distribution and embedding layers of the teacher model to supervise the student model, conduct the KD method on both pre-training and fine-tuning sides, and leverage Data Augmentation (DA) (Wu et al., 2019) algorithms to increase size of fine-tuning corpus in 20 times.
3 Our Method
Let be the teacher BERT and be the student model to be tuned. Given the dataset of some downstream task, namely , where is the number of instances, is the one-hot encoding of a text (a single sentence or a sentence pair), is the positional encoding of , is the segment encoding of and is the label of this instance. We train the student model of BERT in two steps:
Text Smoothing: We feed the one-hot encoding , positional encoding as well as the segment encoding into BERT, and fetch the output of the last layer of the transformer encoder in BERT, which is denoted as:
(1) | |||
(2) |
where
is a 2D dense vector in shape of [sequence_len, embedding_size]. We then multiply
with the word embedding matrix in BERT, to get the MLM prediction results, which is defined as:(3) |
where each row in is a probability distribution over the word vocabulary, representing the choice of words in that position of the input text learned by BERT. It is worth noticing that we do not apply the mask corruption in BERT in fetching the MLM prediction in text smoothing, besides the teacher BERT is also not fine-tuned with the downstream task corpus. Our experimental results show that the mask corruption may harm the performance in knowledge distillation. The input text
is finally smoothed using a simple linear interpolation as:
(4) |
where controls the smoothing degree.
Method (# layers) | MNLI-m | MNLP-mm | QQP | SST-2 | QNLI | MRPC | RTE | CoLA | STS-B | Average |
BERT (12) | 84.6 | 83.4 | 71.2 | 93.5 | 90.5 | 88.9 | 66.4 | 52.1 | 85.8 | 79.6 |
BERT (3) | 74.8 | 74.3 | 65.8 | 86.4 | 84.3 | 80.5 | 55.2 | 16.8 | 67.5 | 68.3 |
BERT-PKD (3) | 76.7 | 76.3 | 68.1 | 87.5 | 84.7 | 80.7 | 58.2 | - | - | - |
TextSmooth (3) | 77.2 | 76.1 | 66.7 | 88.4 | 85.2 | 82.3 | 60.7 | 23.8 | 77.7 | 70.9 |
BERT (4) | 75.4 | 74.9 | 66.5 | 87.6 | 84.8 | 83.2 | 62.6 | 19.5 | 77.1 | 70.2 |
DistillBERT (4) | 78.9 | 78.0 | 68.5 | 91.4 | 85.2 | 82.4 | 54.1 | 32.8 | 76.1 | 71.9 |
BERT-PKD (4) | 79.9 | 79.3 | 70.2 | 89.4 | 85.1 | 82.6 | 62.3 | 24.8 | 79.8 | 72.6 |
TinyBERT (4) | 82.5 | 81.8 | 71.3 | 92.6 | 87.7 | 86.4 | 62.9 | 43.3 | 79.9 | 76.5 |
TinyBERT w/o DA (4) | 80.5 | 81.0 | - | - | - | 82.4 | - | 29.8 | - | - |
TextSmooth (4) | 79.9 | 79.2 | 69.6 | 90.6 | 86.1 | 85.0 | 63.3 | 33.3 | 79.8 | 74.1 |
BERT (6) | 80.4 | 79.7 | 69.2 | 90.7 | 86.7 | 85.9 | 63.6 | 30.6 | 81.9 | 74.3 |
BERT-PKD (6) | 81.5 | 81.0 | 70.7 | 92.0 | 89.0 | 85.0 | 65.5 | - | - | - |
TinyBERT (6) | 84.6 | 83.2 | 71.6 | 93.1 | 90.4 | 87.3 | 70.0 | 51.1 | 83.7 | 79.4 |
TinyBERT w/o DA (6) | - | - | - | - | - | - | - | - | - | - |
TextSmooth (6) | 81.9 | 80.9 | 70.3 | 92.8 | 88.0 | 86.4 | 65.7 | 42.7 | 82.8 | 76.8 |
Method (# layers) | SQuAD 1.1 | SQuAD 2.0 | ||
EM | F1 | EM | F1 | |
BERT (12) | 80.7 | 88.4 | 73.1 | 76.4 |
BERT (4) | 67.8 | 77.5 | 60.0 | 63.9 |
BERT-PKD (4) | 70.1 | 79.5 | 60.8 | 64.6 |
DistillBERT (4) | 71.8 | 81.2 | 60.6 | 64.1 |
TinyBERT (4) | 72.7 | 82.1 | 65.3 | 68.8 |
TextSmooth (4) | 70.9 | 80.5 | 62.9 | 66.2 |
BERT (6) | 76.5 | 84.7 | 65.7 | 69.0 |
BERT-PKD (6) | 77.1 | 85.3 | 66.3 | 69.8 |
DistillBERT (6) | 78.1 | 86.2 | 66.0 | 69.5 |
TinyBERT (6) | 79.7 | 87.5 | 69.9 | 73.4 |
TextSmooth (6) | 76.7 | 84.8 | 66.7 | 69.5 |
Student Learning: Given those smoothed texts, we construct the corpus for teaching the student model, denoted as . The student model is then tuned with
as supervised learning. In the inference phase, the student model takes the one-hot encoding of text as input and conducts predictions, the same as most existing teacher-student KD methods.
4 Experiments
4.1 Dataset and Metrics
Following the previous works, we test the performance of our proposed KD method on two large-scale datasets: the GLUE benchmark (Wang et al., 2018) and SQuAD (1.1 and 2.0) (Rajpurkar et al., 2016, 2018)
, covering eleven different tasks in natural language processing. The evaluation metrics in this paper are also the same as previous works.
4.2 Baselines
Same as the previous works, we take the pre-trained BERT (12-layers, English, uncased, no whole word masking) as our teacher model, and we use a smaller BERT (contains fewer layers of transformer encoder) as the student model to be tuned. The baselines to be compared include DistillBERT (Sanh et al., 2019), BERT-PKD (Sun et al., 2019) and TinyBERT (Jiao et al., 2019), those kinds of methods are all based on mimicking soft labels, while TinyBERT is more complicated and uses Data Augmentation (DA) to boost the performance of the student model. To make a fair comparison, we also take TinyBERT w/o DA as another baseline, to better analyze the difference between label smoothing and text smoothing in BERT distillation. Additionally, we study directly fine-tuning the student model without any smoothing (denoted as BERT), as ablation experiments, to validate the performance of all knowledge distillation algorithms.
4.3 Experiment Setting
We test 3 different student models: 3-layer BERT, 4-layer BERT, and 6-layer BERT, following the previous works. All the other hyper-parameters are the same as the teacher BERT and the is set to 0.5 in our experiments. Akin to the previous works, the student model is initialized by reusing the weights of first n layers of the transformer encoder as well as the embedding layers in the teacher BERT. We directly copy the experimental results of our baselines if reported, otherwise, we use ‘-’ to represent those missing scores or directly skip that setting if the baseline methods do not consider this setting in their experiments.

4.4 Experimental Results
Table 1 and Table 2 show the experimental results of our proposed distillation method as well as all baseline and ablation methods. As demonstrated, the text smoothing method can consistently over-perform existing smooth label based KD methods in different student models and most natural language processing tasks, including DistillBERT, BERT-PKD as well as TinyBERT w/o DA, especially for small student models. The performance of TinyBERT is significantly improved when using data augmentation, which also implies the importance of smoothing the texts, instead of the labels. The performance of the ablation model BERT
is significantly worse than both the label smoothing based methods as well as our text smoothing based methods, which implies the effectiveness of knowledge distillation algorithms in BERT. Besides the performance, our proposed text smoothing distillation method is also faster than the other methods. We only use about 92 seconds to pass one epoch in GLUE on average, while BERT-PKD needs to use more than 233 seconds on the same devices and deep learning framework. Moreover, our proposed method can also support transferring the teacher knowledge to heterogeneous students, since we do not rely on mimicking the intermediate layers of the teacher BERT, which increases the searching space for the optimal student model.
5 Analysis
To better understand how text smoothing works and where is the limitation, we visualize the smoothing representation of an example in the SST-2 corpus by sampling some fake sentences from its smoothed text representation. Figure 2 shows the visualization results, the top sentence (“The film has some of the best special effects ever”) is a raw sentence in SST-2, and the five sentences below are all generated by randomly sampling words at each position of the raw input sentence, using its smoothed word distribution. The probability of generating the whole sentence is calculated using the multiplication of the probability for generating each word.
As we can observe, all those sampled sentences are semantically-related to the original input text, the higher the generation probability is, the more semantically similar to the raw text the generated sentence is. Those sampled sentences are surprisingly diverse in expression (such as has-contains, effects-foley, some-several in Figure 2). An interesting thing is that the top one generated sentence is usually the raw input itself, as shown in Figure 2. The reason may be the MLM in BERT pre-training, which facilitates the reconstruction and paraphrasing of the input sentence. However, many of those generated sentences have some small errors in grammar (like “the of” of the third sentence) and some of them suffer from semantic shifts (like “productions” of the last sentence). Those kinds of errors may harm the performance of some tasks who needs to consider fine-grained semantic information or be aware of syntactic structures of the input sentences. Studying the impact of those errors and how to eliminate them shall be an interesting research topic in the future.
6 Conclusion
In this paper, we propose a new knowledge distillation method for compressing BERT. Instead of making the student model mimic the soft labels predicted by the teacher BERT, we explore leveraging the Masked Language Model (MLM) in BERT to generate word distributions for each text in the downstream task corpus, and using those predicted word distributions to smooth the raw input text. The student model then learns from those smoothed texts, rather than smoothed labels. Experimental results on GLUE and SQuAD show that our text smoothing method can achieve competitive results compared with most existing BERT KD methods. In the future, we would like to take a deep look at the principle of text smoothing and try to combine text smoothing and label smoothing for better knowledge distillation of language model pre-training.
References
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Gao et al. (2019) Fei Gao, Jinhua Zhu, Lijun Wu, Yingce Xia, Tao Qin, Xueqi Cheng, Wengang Zhou, and Tie-Yan Liu. 2019. Soft contextual data augmentation for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5539–5544, Florence, Italy. Association for Computational Linguistics.
- Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Jiao et al. (2019) Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2019. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351.
- Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- McCarley (2019) JS McCarley. 2019. Pruning a bert-based question answering model. arXiv preprint arXiv:1910.06360.
- Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
- Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. CoRR, abs/1606.05250.
- Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
- Sun et al. (2019) Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. 2019. Patient knowledge distillation for BERT model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4314–4323, Hong Kong, China. Association for Computational Linguistics.
- Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. CoRR, abs/1804.07461.
- Wang et al. (2019) Yiren Wang, Yingce Xia, Fei Tian, Fei Gao, Tao Qin, Cheng Xiang Zhai, and Tie-Yan Liu. 2019. Neural machine translation with soft prototype. In Advances in Neural Information Processing Systems 32, pages 6313–6322. Curran Associates, Inc.
- Wu et al. (2019) Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and Songlin Hu. 2019. Conditional bert contextual augmentation. In International Conference on Computational Science, pages 84–95. Springer.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32, pages 5754–5764. Curran Associates, Inc.
- Zafrir et al. (2019) Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. 2019. Q8bert: Quantized 8bit bert. arXiv preprint arXiv:1910.06188.
Comments
There are no comments yet.