Recently, the NLP community has witnessed the effectiveness of utilizing large pre-trained language models, such as BERT , ELECTRA , XLNet , and open-GPT . These large language models have significantly improved the performance in many NLP downstream tasks, e.g., the GLUE benchmark . However, the pre-trained language models usually contain hundreds of millions of parameters and suffer from high computation and latency in real-world applications [19, 36, 39]. Hence, in order to make the large pre-trained language models applicable for broader applications [10, 40, 14, 15], it is necessary to reduce the computation overhead to accelerate the finetuning and inference of the large pre-trained language models.
In the literature, several effective techniques, such as quantization [11, 13] and knowledge distillation [27, 25, 16], have been explored to tackle the computation overhead in large pre-trained language models. In this paper, we focus on knowledge distillation  to transfer the knowledge embedded in a large teacher model to a small student model because knowledge distillation has been proved its effectiveness in several pieces of work, e.g., DistilBERT  and TinyBERT . However, these methods usually discard the teacher’s representations after obtaining a learned student model, which may yield a sub-optimal performance.
Different from existing work, we propose RefBERT to utilize the teacher’s representations on reference samples to compress BERT into a small student model while maintaining the model performance. Here, a reference sample of an input sample refers to the most similar (but not the same one) in the dataset evaluated by a certain similarity criterion, e.g., containing the most common keywords or following the most similar structure. Our RefBERT delivers two modifications on the original Transformer layers: (1) the key and the value in the first Transformer layer: the key in the multi-head attention networks of the first Transformer layer of the student model will facilitate both the student’s embedding on the input sample and the teacher’s embedding on the reference sample while the corresponding value in the multi-head attention networks will utilize both the student’s embedding on the input sample and the teacher’s last layer’s representation on the reference sample. By simply concatenating them, we can effectively absorb their information through the self-attention mechanism. (2) Shifting the normalization attention score: we subtract the attention score (normalized by the softmax function) a constant to amplify the effect of the normalization attention score. By subtracting a constant in the score, we can place negative attention for non-informative tokens and discard the impact of unrelated parts in the next layer. We then follow the setup of TinyBERT to learn the student’s parameters by distilling the embedding-layer, hidden states, attention weights, and the prediction layer. More importantly, we present theoretical analysis to justify the selection of the mean-square-error (MSE) loss function while revealing that by including any related reference sample during the compression procedure, our RefBERT indeed increases the information absorption.
We highlight the contribution of our work as follows:
We propose a novel knowledge distillation method, namely RefBERT, for Transformer-based architecture to transfer the linguistic knowledge encoded in the teacher BERT to a small student model through the reference mechanism. Hence, the teacher’s information will be facilitated by the student model during inference.
We modify the query, key, and value of the multi-head attention networks at the first layer of the student model to facilitate the teacher’s embeddings and the representations in the last Transformer layer. The attention score is subtracted by a constant to discard the effect of non-informative tokens. More importantly, theoretical analysis has provided to justify the selection of the MSE loss and the including of reference samples in information absorption.
We conduct experimental evaluations and show that our RefBERT can beat TinyBERT without data augmentation over 8.1% and attains more than 94% the performance of the teacher on the GLUE benchmark. Meanwhile, RefBERT is 7.4x smaller and 9.5x faster on inference than .
Ii Related Work
We review two main streams of related work in the following:
Pre-training. In natural language processing, researchers have trained models on huge unlabeled text to learn precise word representations. The models range from word embeddings [22, 23] to contextual word representations [24, 20] and recently-developed powerful pre-trained language models, such as BERT , ELECTRA , XLNet , and open-GPT . After that, researchers usually apply the fine-tuning mechanism to update the large pre-trained representations for downstream tasks with a small number of task-specific parameters [8, 21, 17, 18]. However, the high demand for computational resources has hindered the applicability to a broader range, especially those resource-limited applications.
Distillation with unsupervised pre-training. Early attempts try to leverage the unsupervised pre-training representations for behaviors distillation , e.g., label distillation [3, 27] and task-specific knowledge distillation . The distillation procedure is also extended from single-task learning to multi-task learning, i.e., distilling knowledge from multiple teacher models to a light-weight student model [4, 37]. Knowledge distillation  has been frequently applied to compress a larger teacher model to a compact student model . For example, DistilBERT  attempts to compress the intermediate layers of BERT into a smaller student model. TinyBERT  further explores more distillation losses while facilitating data augmentation to attain good performance for downstream NLP tasks. InfoBERT  tries to improve the model robustness based on information-theoretic guarantee. However, these methods discard the teacher’s direct information after obtaining the student model. This may lack sufficient information and yield sub-optimal performance during inference.
Iii Notations and Problem Statement
We first define some notations. Bold capital letters, e.g., , indicate matrices. Bold small letters, e.g.,
, indicate vectors or sequences. Letters in calligraphic or blackboard bold fonts, e.g.,, , indicate a set, where denotes an -dimensional real space. With a little abuse of notations, we use lowercase letters to denote indices or the density functions (e.g., ,
), while uppercase letters for probability measure (e.g.,,
), or random variables.denotes the transpose operator. denotes the concatenation of vectors. defines the mean squared error loss function.
The task of knowledge distillation from a teacher model is to build a compact students model to fit the real-world constraints, e.g., memory and latency budget, in downstream tasks. In this paper, we fix the model architecture to Transformer because it is a well-known architecture attaining superior performance in a wide range of NLP tasks [30, 8].
To relieve the burden of the model expression, we define the vanilla Transformer layer for the encoder () as follows:
where the function30]. The multi-head attention (MH) is computed by
where is the number of attention heads, is the teacher’ hidden size.
Hence, for an input with sub-words, i.e., , we have and is computed by
where is the embedding of the teacher model, which is usually computed by the summation of token embedding, position embedding, and segment embedding if there is. In the compression, we will utilize the teacher information, and (the output in the last Transformer layer), to derive the corresponding student model.
For fast experimentation, we use a single teacher, without making a statement about the best architectural choice, and adopt the pre-trained  as the teacher model. Other models, e.g., , XLNet , can be easily deployed and tested within our framework. To lessen the variation of the model, we also apply the same number of self-attention heads to both the teacher and the student models.
We are given the following data:
Unlabeled language model data (): a collection of texts for representation learning. These corpora are usually collected from Wiki or other open domains, which contain thousands of millions of words without strong domain information.
Labeled data (): a set of training examples , where is an input in a -dimensional space of and is the response. For most NLP downstream tasks, labeled data requires a lot of experts’ manual effort and are thus restricted in size.
In our work, we apply to train the student model from the teacher model as stated in the next section.
In this section, we present our proposed RefBERT and provide the theoretical justification to support our proposal.
Iv-a Model Architecture
Following the setup of TinyBERT , we break the distillation into two stages: the general distillation (GD) stage and the task distillation (TD) stage. In the following, we first outline the GD stage. We first construct a new reference dataset to pair a reference example for each sample in :
Hence, and the reference example in is different from in , but is the most similar one based on a certain criterion, e.g., containing the most common keywords or following the most similar structure.
Let the hidden size in the student model be and the number of Transformer layers be . Usually, we have and because we want to learn a smaller student model while preserving the teacher’s performance.
As illustrated in Fig. 1, given , we have and obtained from the teacher model. We modify the first Transformer layer in the student model to absorb the reference information while keeping the rest layers the same as the teacher model. More specifically, the first Transformer layer of the student model is computed by
To guarantee the compatibility, the number of the attention heads in the student is the same as in the teach model in Eq. (2).
We highlight several points about the difference between the student model and the teacher model in the above computation:
At the student’s first Transformation layer, the query, key, and value, , , and , are mapped to the projected space and are slightly different from the original query, key, and value without mapping in the teacher model; see Eq. (1). This yields a different computation in the self-attention score as in Eq. (12) of the student model contrast to that in Eq. (5) of the teacher model.
Different from the teach model, and have included the information of both and via concatenation. In , the concatenated component is related to the embedding, , while in , the component is related to the output in the last Transformer layer, . By such setting and the self-attention mechanism, we can absorb the matched information on the input, and , while utilizing the output of for the final output.
In Eq. (13), we subtract the normalization attention score by a constant , where . This is a key setup borrowed from  to make some of the attention weights negative so that gradients would have different signs at back-propagation. It is noted that the original attention score ranges from 0 to 1, where a higher value implies high correlation on the corresponding tokens and a lower value towards un-correlation. By subtracting a constant in the score, we can place negative attention for non-informative tokens and discard the impact of unrelated parts in the next layer.
Hence, for a pair of , the output of each layer of our RefBERT is and is computed by:
Iv-B Model Training
Following TinyBERT , we let be a mapping function to indicate the -th layer of student model learns the information from the -layer of teacher model. Hence, in the embedding layer and in the last Transformer layer.
In the training, we follow the setup of TinyBERT to apply mean-squared-error (MSE) to distill the information. That is, both the embedding-layer and the prediction-layer are distilled while in the prediction-layer, the attentions and hidden states are distilled. Hence, we derive the same objective function:
In Eq. (17), the distillation losses consist of:
Embedding-layer distillation. MSE is adopted to penalize the difference between the embeddings in the student and the teacher:
where is a weight matrix to linearly map the embedding in the student to the same space as that in the teacher.
Hidden state distillation. MSE is adopted to penalize the difference between the hidden states in the student and the teacher:
where plays the same role on to linearly map the hidden state in the student to the same space as that in the teacher.
Attention distillation. MSE is adopted to penalize the difference between the attention weights in the student and the teacher:
where is the number of attention heads, is the -th head of student or teacher, is the length of input text. It is noted that the attention score is unnormalized and computed as in Eq. (5) or Eq. (12) because a faster convergence is verified in TinyBERT.
. The soft cross-entropy loss is adopted on the logits of the student and teacher to mimic the behaviors of intermediate layers:
where and are the logits of the teacher and the students computed from fully-connected networks, respectively. denotes the log likelihood. is a scalar representing the temperature value and usually set to 1.
V Theoretical Analysis
We provide some theoretical insight into the distillation procedure by examining the mutual information between the student model and the teacher model. From an information-theoretic perspective, the goal of compressing a large model like BERT is to retain as much information as possible. That is to maximize the mutual information between the student model and the teacher model.
Let be a pair of random variables with values over the space , the mutual information is defined by
where denotes the entropy and denote the conditional entropy.
Justification of the Loss Function. Usually, we can apply to represent the output from the teacher while as the output of the student. Since is fixed, maximizing is equivalent to minimizing . The conditional entropy quantifies the amount of additional information to describe given . Since it is difficult to directly compute , we derive an upper bound for the conditional entropy during compression.
Theorem 1 (Upper Bound of Conditional Entropy).
Let be a pair of random variables with values over the space , we have
where is a constant.
We introduce a variational distribution . That is, the conditional probability of given
follows a Gaussian distribution with a diagonal covariance matrix. Then, we can derive
In the above, the fourth line holds because is a constant while the inequality in the fifth line holds because is non-negative. ∎
Theorem 1 justifies that MSE is an effective surrogate to guarantee information absorption.
Justification of the Usage of the Reference Sample. To see the information flow between the layers, we close up to examine the mutual information between the student and the teacher at each layer. First, we present the following two theoretical results.
Theorem 2 (Decrease in Mutual Information).
For any mapping function , which maps a random variable to another random variable through some learning process, we have the following inequality:
: in deep learning, the mutual information between the learned representation and the ground truth signal decreases from layer to layer.
Theorem 2 indicates that the difference between the teacher and the student is that the teacher discards the useless information and reserves the most useful information in learning the final representation. In contrast, the student might discard some useful information in the learning procedure and ends up with a less precise representation.
Theorem 3 (Increase of Mutual Information by Reference Samples).
For random variables , , where and follow the same distribution, we have the following inequality:
Note that when and are independent.
By deeming to the teacher’s representation, the reference sample’s representation, and the student’s representation, with Theorem 3, we conclude that by including any related reference sample during the compression procedure, we can increase the information absorption. This verifies the effectiveness of our proposed RefBERT.
|Similarity and Paraphrase Tasks|
|STS-B||7k||1.4k||sentence similarity||Pearson/Spearman corr.||misc.|
|QQP||364k||391k||paraphrase||acc./F1||social QA questions|
|MNLI||393k||20k||NLI||matched acc./mismatched acc.||misc.|
In this section, we evaluate RefBERT on the General Language Understanding Evaluation (GLUE) benchmark and show the effectiveness of our RefBERT on utilizing the reference samples.
Vi-a Model Settings
The code is written in PyTorch. To provide a fair comparison, we apply the same setting as the TinyBERT. That is, the number of layers, the hidden size , the feed-forward/filter size and the head number . This yields a total of 14.8M parameters, where the additional parameters come from the projection matrices of , , and in Eq. (9)-Eq. (11). is adopted as the teacher model and consists of 109M parameters by the default setting: the number of layers , the hidden size , the feed-forward/filter size and the head number . The same as TinyBERT, we adopt as the layer mapping function to learn from every 3 layers of . The weight loss at each layer in Eq. (16) is set to 1 due to good performance.
General distillation. We use the English Wikipedia (2,500M words) as the dataset and follow the same pre-processing as in . Each input sentence in is paired with a reference sentence by BM25 in Elastic Search 111https://elasticsearch-py.readthedocs.io/en/v7.10.1/. The model of RefBERT is then trained on 8 16GB V100 GPUs for approximately 100 hours.
Task-specific Distillation. The same as general distillation, we select the reference samples by BM25 in Elastic Search from the English Wikipedia. We then fine-tune RefBERT on the downstream tasks with the teacher’s representations on the reference samples. The learning rate is tuned from 1e-5 to 1e-4 with the step of 1e-5 to seek the best performance on the development sets of the corresponding tasks in GLUE.
CoLA. The Corpus of Linguistic Acceptability is a task to predict whether an English sentence is a grammatically correct one and evaluated by Matthews correlation coefficient .
MNLI. Multi-Genre Natural Language Inference is a large-scale, crowd-sourced entailment classification task and evaluated by the matched and mismatched accuracy . Given a pair of (premise, hypothesis), the task is to predict whether the hypothesis is an entailment, contradiction, or neutral with respect to the premise.
MRPC. Microsoft Research Paraphrase Corpus is a paraphrase identification dataset evaluated by F1 score . The task is to identify if two sentences are paraphrases of each other.
Question Natural Language Inference is a version of the Stanford Question Answering Dataset which has been converted to a binary sentence pair classification task and is evaluated by accuracy. Given a pair of (question, context), the task is to determine whether the context contains the answer to the question.
QQP.Quora Question Pairs is a collection of question pairs from the website Quora. The task is to determine whether two questions are semantically equivalent and is evaluated by the F1 score.
RTE. Recognizing Textual Entailment is a binary entailment task with a small training dataset and is evaluated by the accuracy .
SST-2. The Stanford Sentiment Treebank is a binary single-sentence classification task, where the goal is to predict the sentiment of movie reviews and is evaluated by the accuracy .
STS-B. The Semantic Textual Similarity Benchmark is a collection of sentence pairs drawn from news headlines and many other domains. The task aims to evaluate how similar two pieces of texts are by a score from 1 to 5 and is evaluated by the Pearson correlation coefficient .
We follow the standard splitting to conduct the experiments and submit the prediction results to the GLUE benchmark system 222https://gluebenchmark.com/ to evaluation the performance on the test sets.
|12||768||3072||109M (1.0)||190 (1.0)|
|DistillBERT||4||768||3072||52.2M (2.1)||64.1 (3.0)|
|TinyBERT||4||312||1200||14.5M (7.5)||20.1 (9.5)|
|RefBERT||4||312||1200||14.8M (7.4)||20.1 (9.5)|
Table II reports the comparison results between our RefBERT and baselines, , DistilBERT, TinyBERT with Data Augmentation (TinyBERT-DA), and the vanilla TinyBERT. The results of baselines are copied from those reported in TinyBERT  for reference. The results show that RefBERT can attain competitive performance on all tasks in GLUE benchmark:
RefBERT gains 7 better results than DistilBERT out of the 9 tasks and obtains a prominent improvement of 4.4% on average.
RefBERT significantly outperforms the vanilla TinyBERT in all compared tasks and attains a large improvement of 8.1% on average of the four compared tasks, CoLA, MNLI-m, MNLI-mm, and MRPC.
RefBERT can even obtain better performance on CoLA, MRPC, and STS-2 compared to TinyBERT-DA and yields 98.2% performance of TinyBERT-DA.
Overall, the average performance of RefBERT is at least 94% of . RefBERT attains relative lower performance on the task of QQP. We conjecture that the tokens in the reference samples in QQP may be too similar to those in the evaluated samples and make confusion the prediction.
In terms of the inference time reported in Table III, we observe that RefBERT contains slightly larger model size, around 300K parameters, than TinyBERT. However, the inference time is negligible. Compared with the teacher , RefBERT is 7.4x smaller and 9.5x faster while achieving competitive performance. The results show that RefBERT is a promising surrogate for the recently-developed BERT distillation models.
In this paper, we propose a new knowledge distillation method, namely RefBERT, to distill BERT by utilizing the teacher’s representations on the reference samples. By including the references’ word embeddings and the teacher’s final layer representations in the corresponding key and value while amplifying the self-attention effect of irrelevant components in the first layer, we can make RefBERT absorb the teacher’s knowledge on the reference samples and strengthen the information interaction effectively. More importantly, we provide theoretical justification on selecting the mean-square-error loss function and prove that including reference samples indeed can increase the mutual information of distillation. Our experimental evaluation shows that RefBERT can beat the vanilla TinyBERT over 8.1% and achieves more than 94% of the performance of on the GLUE benchmark. Meanwhile, RefBERT is 7.4x smaller and 9.5x faster on inference than .
Several research problems are worthy of further exploration. First, we would like to explore more ways to reduce the model size of RefBERT while maintaining the same performance. Second, it would be promising to investigate more effective mechanisms to transfer the knowledge from wider and deeper teachers, e.g., BERT-large, to a smaller student via the reference mechanism. Third, other speed-up methods, e.g., quantization, pruning, and even hardware acceleration, can be attempted to resolve the computation overhead on the large pre-trained language models.
-  (2009) The fifth PASCAL recognizing textual entailment challenge. In TAC, External Links: Cited by: 6th item.
-  (2017) SemEval-2017 task 1: semantic textual similarity - multilingual and cross-lingual focused evaluation. CoRR abs/1708.00055. External Links: Cited by: 8th item.
-  (2019) Transformer to CNN: label-scarce distillation for efficient text classification. CoRR abs/1909.03508. External Links: Cited by: §II.
-  (2019) BAM! born-again multi-task networks for natural language understanding. In ACL, pp. 5931–5937. Cited by: §II.
-  (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. In ICLR, Cited by: §I, §II.
-  (2020) OpenGPT-2: open language models and implications of generated text. XRDS 27 (1), pp. 26–30. External Links: Cited by: §I, §II.
-  (2006) Elements of information theory (wiley series in telecommunications and signal processing). Wiley-Interscience, USA. External Links: Cited by: §V.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. External Links: Cited by: §I, §II, §III, §III, §VI-A.
-  (2005) Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing, IWP@IJCNLP, External Links: Cited by: 3rd item.
-  (2021) Advances and challenges in conversational recommender systems: a survey. arXiv preprint arXiv:2101.09459. Cited by: §I.
Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR, External Links: Cited by: §I.
-  (2015) Distilling the knowledge in a neural network. CoRR abs/1503.02531. External Links: Cited by: §I, §II.
-  (2019) Normalization helps training of quantized LSTM. In NeurIPS, pp. 7344–7354. External Links: Cited by: §I.
-  (2017) A deep learning approach for predicting the quality of online health expert question-answering services. J. Biomed. Informatics 71, pp. 241–253. External Links: Cited by: §I.
HiGRU: hierarchical gated recurrent units for utterance-level emotion recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 397–406. External Links: Cited by: §I.
-  (2019) TinyBERT: distilling BERT for natural language understanding. CoRR abs/1909.10351. Cited by: §I, §II, §IV-A, §IV-B, §VI-C.
ALBERT: A lite BERT for self-supervised learning of language representations. CoRR abs/1909.11942. External Links: Cited by: §II.
Have we solved the hard problem? it’s not easy! contextual lexical contrast as a means to probe neural coherence.
Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §II.
-  (2017) SWIM: a simple word interaction model for implicit discourse relation recognition.. In IJCAI, pp. 4026–4032. Cited by: §I.
-  (2020) PiRhDy: learning pitch-, rhythm-, and dynamics-aware embeddings for symbolic music. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 574–582. Cited by: §II.
-  (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Cited by: §II.
-  (2013) Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. External Links: Cited by: §II.
-  (2014) Glove: global vectors for word representation. In EMNLP, pp. 1532–1543. External Links: Cited by: §II, 7th item.
-  (2018) Deep contextualized word representations. In NAACL-HLT, pp. 2227–2237. Cited by: §II, §II.
-  (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108. Cited by: §I, §II.
-  (2017) Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: §V.
-  (2019) Patient knowledge distillation for BERT model compression. In EMNLP-IJCNLP, pp. 4322–4331. External Links: Cited by: §I, §II.
-  (2019) Distilling task-specific knowledge from BERT into simple neural networks. CoRR abs/1903.12136. External Links: Cited by: §II.
-  (2019) Well-read students learn better: the impact of student initialization on knowledge distillation. CoRR abs/1908.08962. Cited by: §II.
-  (2017) Attention is all you need. In NIPS, pp. 5998–6008. External Links: Cited by: §III, §III.
-  (2019) GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, Cited by: §I, §VI-B.
-  (2020) InfoBERT: improving robustness of language models from an information theoretic perspective. CoRR abs/2010.02329. External Links: Cited by: §II.
Neural topic model with attention for supervised learning.
In AISTATS, S. Chiappa and R. Calandra (Eds.),
Proceedings of Machine Learning Research, Vol. 108, pp. 1147–1156. External Links: Cited by: 3rd item.
-  (2019) Neural network acceptability judgments. Trans. Assoc. Comput. Linguistics 7, pp. 625–641. External Links: Cited by: 1st item.
-  (2018) A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, pp. 1112–1122. External Links: Cited by: 2nd item.
-  (2021) Emotion dynamics modeling via BERT. In IJCNN, pp. . Cited by: §I.
-  (2020) Model compression with two-stage multi-teacher knowledge distillation for web question answering system. In WSDM, pp. 690–698. External Links: Cited by: §II.
-  (2019) XLNet: generalized autoregressive pretraining for language understanding. In NeurIPS, pp. 5754–5764. Cited by: §I, §II, §III.
-  (2021) Automatic intent-slot induction for dialogue systems. In WWW, Cited by: §I.
-  (2021) Retrieving and reading: a comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774. Cited by: §I.