I Introduction
Recently, the NLP community has witnessed the effectiveness of utilizing large pretrained language models, such as BERT [8], ELECTRA [5], XLNet [38], and openGPT [6]. These large language models have significantly improved the performance in many NLP downstream tasks, e.g., the GLUE benchmark [31]. However, the pretrained language models usually contain hundreds of millions of parameters and suffer from high computation and latency in realworld applications [19, 36, 39]. Hence, in order to make the large pretrained language models applicable for broader applications [10, 40, 14, 15], it is necessary to reduce the computation overhead to accelerate the finetuning and inference of the large pretrained language models.
In the literature, several effective techniques, such as quantization [11, 13] and knowledge distillation [27, 25, 16], have been explored to tackle the computation overhead in large pretrained language models. In this paper, we focus on knowledge distillation [12] to transfer the knowledge embedded in a large teacher model to a small student model because knowledge distillation has been proved its effectiveness in several pieces of work, e.g., DistilBERT [25] and TinyBERT [16]. However, these methods usually discard the teacher’s representations after obtaining a learned student model, which may yield a suboptimal performance.
Different from existing work, we propose RefBERT to utilize the teacher’s representations on reference samples to compress BERT into a small student model while maintaining the model performance. Here, a reference sample of an input sample refers to the most similar (but not the same one) in the dataset evaluated by a certain similarity criterion, e.g., containing the most common keywords or following the most similar structure. Our RefBERT delivers two modifications on the original Transformer layers: (1) the key and the value in the first Transformer layer: the key in the multihead attention networks of the first Transformer layer of the student model will facilitate both the student’s embedding on the input sample and the teacher’s embedding on the reference sample while the corresponding value in the multihead attention networks will utilize both the student’s embedding on the input sample and the teacher’s last layer’s representation on the reference sample. By simply concatenating them, we can effectively absorb their information through the selfattention mechanism. (2) Shifting the normalization attention score: we subtract the attention score (normalized by the softmax function) a constant to amplify the effect of the normalization attention score. By subtracting a constant in the score, we can place negative attention for noninformative tokens and discard the impact of unrelated parts in the next layer. We then follow the setup of TinyBERT to learn the student’s parameters by distilling the embeddinglayer, hidden states, attention weights, and the prediction layer. More importantly, we present theoretical analysis to justify the selection of the meansquareerror (MSE) loss function while revealing that by including any related reference sample during the compression procedure, our RefBERT indeed increases the information absorption.
We highlight the contribution of our work as follows:

We propose a novel knowledge distillation method, namely RefBERT, for Transformerbased architecture to transfer the linguistic knowledge encoded in the teacher BERT to a small student model through the reference mechanism. Hence, the teacher’s information will be facilitated by the student model during inference.

We modify the query, key, and value of the multihead attention networks at the first layer of the student model to facilitate the teacher’s embeddings and the representations in the last Transformer layer. The attention score is subtracted by a constant to discard the effect of noninformative tokens. More importantly, theoretical analysis has provided to justify the selection of the MSE loss and the including of reference samples in information absorption.

We conduct experimental evaluations and show that our RefBERT can beat TinyBERT without data augmentation over 8.1% and attains more than 94% the performance of the teacher on the GLUE benchmark. Meanwhile, RefBERT is 7.4x smaller and 9.5x faster on inference than .
Ii Related Work
We review two main streams of related work in the following:
Pretraining. In natural language processing, researchers have trained models on huge unlabeled text to learn precise word representations. The models range from word embeddings [22, 23] to contextual word representations [24, 20] and recentlydeveloped powerful pretrained language models, such as BERT [8], ELECTRA [5], XLNet [38], and openGPT [6]. After that, researchers usually apply the finetuning mechanism to update the large pretrained representations for downstream tasks with a small number of taskspecific parameters [8, 21, 17, 18]. However, the high demand for computational resources has hindered the applicability to a broader range, especially those resourcelimited applications.
Distillation with unsupervised pretraining. Early attempts try to leverage the unsupervised pretraining representations for behaviors distillation [24], e.g., label distillation [3, 27] and taskspecific knowledge distillation [28]. The distillation procedure is also extended from singletask learning to multitask learning, i.e., distilling knowledge from multiple teacher models to a lightweight student model [4, 37]. Knowledge distillation [12] has been frequently applied to compress a larger teacher model to a compact student model [29]. For example, DistilBERT [25] attempts to compress the intermediate layers of BERT into a smaller student model. TinyBERT [16] further explores more distillation losses while facilitating data augmentation to attain good performance for downstream NLP tasks. InfoBERT [32] tries to improve the model robustness based on informationtheoretic guarantee. However, these methods discard the teacher’s direct information after obtaining the student model. This may lack sufficient information and yield suboptimal performance during inference.
Iii Notations and Problem Statement
We first define some notations. Bold capital letters, e.g., , indicate matrices. Bold small letters, e.g.,
, indicate vectors or sequences. Letters in calligraphic or blackboard bold fonts, e.g.,
, , indicate a set, where denotes an dimensional real space. With a little abuse of notations, we use lowercase letters to denote indices or the density functions (e.g., ,), while uppercase letters for probability measure (e.g.,
,), or random variables.
denotes the transpose operator. denotes the concatenation of vectors. defines the mean squared error loss function.The task of knowledge distillation from a teacher model is to build a compact students model to fit the realworld constraints, e.g., memory and latency budget, in downstream tasks. In this paper, we fix the model architecture to Transformer because it is a wellknown architecture attaining superior performance in a wide range of NLP tasks [30, 8].
To relieve the burden of the model expression, we define the vanilla Transformer layer for the encoder () as follows:
(1) 
where the function
consists of two linear transformations with an activation, e.g., GeLU or ReLU
[30]. The multihead attention (MH) is computed by(2)  
(3)  
(4)  
(5) 
where is the number of attention heads, is the teacher’ hidden size.
Hence, for an input with subwords, i.e., , we have and is computed by
(6) 
where is the embedding of the teacher model, which is usually computed by the summation of token embedding, position embedding, and segment embedding if there is. In the compression, we will utilize the teacher information, and (the output in the last Transformer layer), to derive the corresponding student model.
For fast experimentation, we use a single teacher, without making a statement about the best architectural choice, and adopt the pretrained [8] as the teacher model. Other models, e.g., , XLNet [38], can be easily deployed and tested within our framework. To lessen the variation of the model, we also apply the same number of selfattention heads to both the teacher and the student models.
We are given the following data:

Unlabeled language model data (): a collection of texts for representation learning. These corpora are usually collected from Wiki or other open domains, which contain thousands of millions of words without strong domain information.

Labeled data (): a set of training examples , where is an input in a dimensional space of and is the response. For most NLP downstream tasks, labeled data requires a lot of experts’ manual effort and are thus restricted in size.
In our work, we apply to train the student model from the teacher model as stated in the next section.
Iv Methodology
In this section, we present our proposed RefBERT and provide the theoretical justification to support our proposal.
Iva Model Architecture
Following the setup of TinyBERT [16], we break the distillation into two stages: the general distillation (GD) stage and the task distillation (TD) stage. In the following, we first outline the GD stage. We first construct a new reference dataset to pair a reference example for each sample in :
(7) 
Hence, and the reference example in is different from in , but is the most similar one based on a certain criterion, e.g., containing the most common keywords or following the most similar structure.
Let the hidden size in the student model be and the number of Transformer layers be . Usually, we have and because we want to learn a smaller student model while preserving the teacher’s performance.
As illustrated in Fig. 1, given , we have and obtained from the teacher model. We modify the first Transformer layer in the student model to absorb the reference information while keeping the rest layers the same as the teacher model. More specifically, the first Transformer layer of the student model is computed by
(8)  
where
(9)  
(10)  
(11)  
(12)  
(13)  
(14) 
To guarantee the compatibility, the number of the attention heads in the student is the same as in the teach model in Eq. (2).
Remark 1.
We highlight several points about the difference between the student model and the teacher model in the above computation:

At the student’s first Transformation layer, the query, key, and value, , , and , are mapped to the projected space and are slightly different from the original query, key, and value without mapping in the teacher model; see Eq. (1). This yields a different computation in the selfattention score as in Eq. (12) of the student model contrast to that in Eq. (5) of the teacher model.

Different from the teach model, and have included the information of both and via concatenation. In , the concatenated component is related to the embedding, , while in , the component is related to the output in the last Transformer layer, . By such setting and the selfattention mechanism, we can absorb the matched information on the input, and , while utilizing the output of for the final output.

In Eq. (13), we subtract the normalization attention score by a constant , where . This is a key setup borrowed from [33] to make some of the attention weights negative so that gradients would have different signs at backpropagation. It is noted that the original attention score ranges from 0 to 1, where a higher value implies high correlation on the corresponding tokens and a lower value towards uncorrelation. By subtracting a constant in the score, we can place negative attention for noninformative tokens and discard the impact of unrelated parts in the next layer.
Hence, for a pair of , the output of each layer of our RefBERT is and is computed by:
(15)  
IvB Model Training
Following TinyBERT [16], we let be a mapping function to indicate the th layer of student model learns the information from the layer of teacher model. Hence, in the embedding layer and in the last Transformer layer.
In the training, we follow the setup of TinyBERT to apply meansquarederror (MSE) to distill the information. That is, both the embeddinglayer and the predictionlayer are distilled while in the predictionlayer, the attentions and hidden states are distilled. Hence, we derive the same objective function:
(16) 
where
(17)  
In Eq. (17), the distillation losses consist of:

Embeddinglayer distillation. MSE is adopted to penalize the difference between the embeddings in the student and the teacher:
(18) where is a weight matrix to linearly map the embedding in the student to the same space as that in the teacher.

Hidden state distillation. MSE is adopted to penalize the difference between the hidden states in the student and the teacher:
(19) where plays the same role on to linearly map the hidden state in the student to the same space as that in the teacher.

Attention distillation. MSE is adopted to penalize the difference between the attention weights in the student and the teacher:
(20) where is the number of attention heads, is the th head of student or teacher, is the length of input text. It is noted that the attention score is unnormalized and computed as in Eq. (5) or Eq. (12) because a faster convergence is verified in TinyBERT.

Predictionlayer distillation
. The soft crossentropy loss is adopted on the logits of the student and teacher to mimic the behaviors of intermediate layers:
where and are the logits of the teacher and the students computed from fullyconnected networks, respectively. denotes the log likelihood. is a scalar representing the temperature value and usually set to 1.
V Theoretical Analysis
We provide some theoretical insight into the distillation procedure by examining the mutual information between the student model and the teacher model. From an informationtheoretic perspective, the goal of compressing a large model like BERT is to retain as much information as possible. That is to maximize the mutual information between the student model and the teacher model.
Let be a pair of random variables with values over the space , the mutual information is defined by
(21) 
where denotes the entropy and denote the conditional entropy.
Justification of the Loss Function. Usually, we can apply to represent the output from the teacher while as the output of the student. Since is fixed, maximizing is equivalent to minimizing . The conditional entropy quantifies the amount of additional information to describe given . Since it is difficult to directly compute , we derive an upper bound for the conditional entropy during compression.
Theorem 1 (Upper Bound of Conditional Entropy).
Let be a pair of random variables with values over the space , we have
(22) 
where is a constant.
Proof.
We introduce a variational distribution . That is, the conditional probability of given
follows a Gaussian distribution with a diagonal covariance matrix. Then, we can derive
(23)  
(24) 
In the above, the fourth line holds because is a constant while the inequality in the fifth line holds because is nonnegative. ∎
Remark 2.
Theorem 1 justifies that MSE is an effective surrogate to guarantee information absorption.
Justification of the Usage of the Reference Sample. To see the information flow between the layers, we close up to examine the mutual information between the student and the teacher at each layer. First, we present the following two theoretical results.
Theorem 2 (Decrease in Mutual Information).
For any mapping function , which maps a random variable to another random variable through some learning process, we have the following inequality:
(25) 
The result can be derived by the Data Processing Inequality (DPI) [7]. A similar argument is also stated in [26]
: in deep learning, the mutual information between the learned representation and the ground truth signal decreases from layer to layer.
Remark 3.
Theorem 2 indicates that the difference between the teacher and the student is that the teacher discards the useless information and reserves the most useful information in learning the final representation. In contrast, the student might discard some useful information in the learning procedure and ends up with a less precise representation.
Theorem 3 (Increase of Mutual Information by Reference Samples).
For random variables , , where and follow the same distribution, we have the following inequality:
(26) 
Note that when and are independent.
Remark 4.
By deeming to the teacher’s representation, the reference sample’s representation, and the student’s representation, with Theorem 3, we conclude that by including any related reference sample during the compression procedure, we can increase the information absorption. This verifies the effectiveness of our proposed RefBERT.
Vi Experiments
Corpus  Train  Dev.  Task  Metrics  Domain 
SingleSentence Tasks  
CoLA  8.5k  1k  acceptability  Matthews corr.  misc. 
SST2  67k  1.8k  sentiment  acc.  movie reviews 
Similarity and Paraphrase Tasks  
MRPC  3.7k  1.7k  paraphrase  acc./F1  news 
STSB  7k  1.4k  sentence similarity  Pearson/Spearman corr.  misc. 
QQP  364k  391k  paraphrase  acc./F1  social QA questions 
Inference Tasks  
MNLI  393k  20k  NLI  matched acc./mismatched acc.  misc. 
QNLI  105k  5.4k  QA/NLI  acc.  Wikipedia 
RTE  2.5k  3k  NLI  acc.  news, Wikipedia 
WNLI  634  146  coreference/NLI  acc.  fiction books 
In this section, we evaluate RefBERT on the General Language Understanding Evaluation (GLUE) benchmark and show the effectiveness of our RefBERT on utilizing the reference samples.
Model  CoLA  MNLIm  MNLImm  MRPC  QNLI  QQP  RTE  SST2  STSB  Avg. 

Mcc.  Acc.  Acc.  F1  Acc.  F1.  Acc.  Acc.  Pear./Spea.  
52.1  84.6  83.4  88.9  90.5  71.2  66.4  93.5  85.8  79.6/77.3  
DistillBERT  32.8  78.9  78.0  82.4  85.2  68.5  54.1  91.4  76.1  71.9/68.9 
TinyBERTDA  43.3  82.5  81.8  86.4  87.7  71.3  62.9  92.6  79.9  76.5/73.5 
TinyBERT  29.8  80.5  81.0  82.4             /68.4 
RefBERT  47.9  80.9  80.3  86.9  87.3  61.6  61.7  92.9  75.0  75.1/74.0 
Via Model Settings
The code is written in PyTorch. To provide a fair comparison, we apply the same setting as the TinyBERT. That is, the number of layers
, the hidden size , the feedforward/filter size and the head number . This yields a total of 14.8M parameters, where the additional parameters come from the projection matrices of , , and in Eq. (9)Eq. (11). is adopted as the teacher model and consists of 109M parameters by the default setting: the number of layers , the hidden size , the feedforward/filter size and the head number . The same as TinyBERT, we adopt as the layer mapping function to learn from every 3 layers of . The weight loss at each layer in Eq. (16) is set to 1 due to good performance.General distillation. We use the English Wikipedia (2,500M words) as the dataset and follow the same preprocessing as in [8]. Each input sentence in is paired with a reference sentence by BM25 in Elastic Search ^{1}^{1}1https://elasticsearchpy.readthedocs.io/en/v7.10.1/. The model of RefBERT is then trained on 8 16GB V100 GPUs for approximately 100 hours.
Taskspecific Distillation. The same as general distillation, we select the reference samples by BM25 in Elastic Search from the English Wikipedia. We then finetune RefBERT on the downstream tasks with the teacher’s representations on the reference samples. The learning rate is tuned from 1e5 to 1e4 with the step of 1e5 to seek the best performance on the development sets of the corresponding tasks in GLUE.
ViB Dataset
The GLUE benchmark [31] is a collection of 9 natural language understanding tasks as listed in Table I:

CoLA. The Corpus of Linguistic Acceptability is a task to predict whether an English sentence is a grammatically correct one and evaluated by Matthews correlation coefficient [34].

MNLI. MultiGenre Natural Language Inference is a largescale, crowdsourced entailment classification task and evaluated by the matched and mismatched accuracy [35]. Given a pair of (premise, hypothesis), the task is to predict whether the hypothesis is an entailment, contradiction, or neutral with respect to the premise.

MRPC. Microsoft Research Paraphrase Corpus is a paraphrase identification dataset evaluated by F1 score [9]. The task is to identify if two sentences are paraphrases of each other.

QNLI.
Question Natural Language Inference is a version of the Stanford Question Answering Dataset which has been converted to a binary sentence pair classification task and is evaluated by accuracy. Given a pair of (question, context), the task is to determine whether the context contains the answer to the question.

QQP.Quora Question Pairs is a collection of question pairs from the website Quora. The task is to determine whether two questions are semantically equivalent and is evaluated by the F1 score.

RTE. Recognizing Textual Entailment is a binary entailment task with a small training dataset and is evaluated by the accuracy [1].

SST2. The Stanford Sentiment Treebank is a binary singlesentence classification task, where the goal is to predict the sentiment of movie reviews and is evaluated by the accuracy [23].

STSB. The Semantic Textual Similarity Benchmark is a collection of sentence pairs drawn from news headlines and many other domains. The task aims to evaluate how similar two pieces of texts are by a score from 1 to 5 and is evaluated by the Pearson correlation coefficient [2].
We follow the standard splitting to conduct the experiments and submit the prediction results to the GLUE benchmark system ^{2}^{2}2https://gluebenchmark.com/ to evaluation the performance on the test sets.
Model  Layers  Hidden  Feedforward  Model  Inference 

Size  Size  Size  Time (s)  
12  768  3072  109M (1.0)  190 (1.0)  
DistillBERT  4  768  3072  52.2M (2.1)  64.1 (3.0) 
TinyBERT  4  312  1200  14.5M (7.5)  20.1 (9.5) 
RefBERT  4  312  1200  14.8M (7.4)  20.1 (9.5) 
ViC Results
Table II reports the comparison results between our RefBERT and baselines, , DistilBERT, TinyBERT with Data Augmentation (TinyBERTDA), and the vanilla TinyBERT. The results of baselines are copied from those reported in TinyBERT [16] for reference. The results show that RefBERT can attain competitive performance on all tasks in GLUE benchmark:

RefBERT gains 7 better results than DistilBERT out of the 9 tasks and obtains a prominent improvement of 4.4% on average.

RefBERT significantly outperforms the vanilla TinyBERT in all compared tasks and attains a large improvement of 8.1% on average of the four compared tasks, CoLA, MNLIm, MNLImm, and MRPC.

RefBERT can even obtain better performance on CoLA, MRPC, and STS2 compared to TinyBERTDA and yields 98.2% performance of TinyBERTDA.

Overall, the average performance of RefBERT is at least 94% of . RefBERT attains relative lower performance on the task of QQP. We conjecture that the tokens in the reference samples in QQP may be too similar to those in the evaluated samples and make confusion the prediction.
In terms of the inference time reported in Table III, we observe that RefBERT contains slightly larger model size, around 300K parameters, than TinyBERT. However, the inference time is negligible. Compared with the teacher , RefBERT is 7.4x smaller and 9.5x faster while achieving competitive performance. The results show that RefBERT is a promising surrogate for the recentlydeveloped BERT distillation models.
Vii Conclusion
In this paper, we propose a new knowledge distillation method, namely RefBERT, to distill BERT by utilizing the teacher’s representations on the reference samples. By including the references’ word embeddings and the teacher’s final layer representations in the corresponding key and value while amplifying the selfattention effect of irrelevant components in the first layer, we can make RefBERT absorb the teacher’s knowledge on the reference samples and strengthen the information interaction effectively. More importantly, we provide theoretical justification on selecting the meansquareerror loss function and prove that including reference samples indeed can increase the mutual information of distillation. Our experimental evaluation shows that RefBERT can beat the vanilla TinyBERT over 8.1% and achieves more than 94% of the performance of on the GLUE benchmark. Meanwhile, RefBERT is 7.4x smaller and 9.5x faster on inference than .
Several research problems are worthy of further exploration. First, we would like to explore more ways to reduce the model size of RefBERT while maintaining the same performance. Second, it would be promising to investigate more effective mechanisms to transfer the knowledge from wider and deeper teachers, e.g., BERTlarge, to a smaller student via the reference mechanism. Third, other speedup methods, e.g., quantization, pruning, and even hardware acceleration, can be attempted to resolve the computation overhead on the large pretrained language models.
References
 [1] (2009) The fifth PASCAL recognizing textual entailment challenge. In TAC, External Links: Link Cited by: 6th item.
 [2] (2017) SemEval2017 task 1: semantic textual similarity  multilingual and crosslingual focused evaluation. CoRR abs/1708.00055. External Links: Link, 1708.00055 Cited by: 8th item.
 [3] (2019) Transformer to CNN: labelscarce distillation for efficient text classification. CoRR abs/1909.03508. External Links: Link, 1909.03508 Cited by: §II.
 [4] (2019) BAM! bornagain multitask networks for natural language understanding. In ACL, pp. 5931–5937. Cited by: §II.
 [5] (2020) ELECTRA: pretraining text encoders as discriminators rather than generators. In ICLR, Cited by: §I, §II.
 [6] (2020) OpenGPT2: open language models and implications of generated text. XRDS 27 (1), pp. 26–30. External Links: Link Cited by: §I, §II.
 [7] (2006) Elements of information theory (wiley series in telecommunications and signal processing). WileyInterscience, USA. External Links: ISBN 0471241954 Cited by: §V.
 [8] (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In NAACLHLT, J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. External Links: Link Cited by: §I, §II, §III, §III, §VIA.
 [9] (2005) Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing, IWP@IJCNLP, External Links: Link Cited by: 3rd item.
 [10] (2021) Advances and challenges in conversational recommender systems: a survey. arXiv preprint arXiv:2101.09459. Cited by: §I.

[11]
(2016)
Deep compression: compressing deep neural network with pruning, trained quantization and huffman coding
. In ICLR, External Links: Link Cited by: §I.  [12] (2015) Distilling the knowledge in a neural network. CoRR abs/1503.02531. External Links: Link, 1503.02531 Cited by: §I, §II.
 [13] (2019) Normalization helps training of quantized LSTM. In NeurIPS, pp. 7344–7354. External Links: Link Cited by: §I.
 [14] (2017) A deep learning approach for predicting the quality of online health expert questionanswering services. J. Biomed. Informatics 71, pp. 241–253. External Links: Link, Document Cited by: §I.

[15]
(2019)
HiGRU: hierarchical gated recurrent units for utterancelevel emotion recognition
. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT 2019, Minneapolis, MN, USA, June 27, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 397–406. External Links: Link, Document Cited by: §I.  [16] (2019) TinyBERT: distilling BERT for natural language understanding. CoRR abs/1909.10351. Cited by: §I, §II, §IVA, §IVB, §VIC.

[17]
(2019)
ALBERT: A lite BERT for selfsupervised learning of language representations
. CoRR abs/1909.11942. External Links: Link, 1909.11942 Cited by: §II. 
[18]
(2021)
Have we solved the hard problem? it’s not easy! contextual lexical contrast as a means to probe neural coherence.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Cited by: §II.  [19] (2017) SWIM: a simple word interaction model for implicit discourse relation recognition.. In IJCAI, pp. 4026–4032. Cited by: §I.
 [20] (2020) PiRhDy: learning pitch, rhythm, and dynamicsaware embeddings for symbolic music. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 574–582. Cited by: §II.
 [21] (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §II.
 [22] (2013) Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111–3119. External Links: Link Cited by: §II.
 [23] (2014) Glove: global vectors for word representation. In EMNLP, pp. 1532–1543. External Links: Link, Document Cited by: §II, 7th item.
 [24] (2018) Deep contextualized word representations. In NAACLHLT, pp. 2227–2237. Cited by: §II, §II.
 [25] (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. CoRR abs/1910.01108. Cited by: §I, §II.
 [26] (2017) Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: §V.
 [27] (2019) Patient knowledge distillation for BERT model compression. In EMNLPIJCNLP, pp. 4322–4331. External Links: Link, Document Cited by: §I, §II.
 [28] (2019) Distilling taskspecific knowledge from BERT into simple neural networks. CoRR abs/1903.12136. External Links: Link, 1903.12136 Cited by: §II.
 [29] (2019) Wellread students learn better: the impact of student initialization on knowledge distillation. CoRR abs/1908.08962. Cited by: §II.
 [30] (2017) Attention is all you need. In NIPS, pp. 5998–6008. External Links: Link Cited by: §III, §III.
 [31] (2019) GLUE: A multitask benchmark and analysis platform for natural language understanding. In ICLR, Cited by: §I, §VIB.
 [32] (2020) InfoBERT: improving robustness of language models from an information theoretic perspective. CoRR abs/2010.02329. External Links: Link, 2010.02329 Cited by: §II.

[33]
(2020)
Neural topic model with attention for supervised learning.
In AISTATS, S. Chiappa and R. Calandra (Eds.),
Proceedings of Machine Learning Research
, Vol. 108, pp. 1147–1156. External Links: Link Cited by: 3rd item.  [34] (2019) Neural network acceptability judgments. Trans. Assoc. Comput. Linguistics 7, pp. 625–641. External Links: Link Cited by: 1st item.
 [35] (2018) A broadcoverage challenge corpus for sentence understanding through inference. In NAACLHLT, pp. 1112–1122. External Links: Link, Document Cited by: 2nd item.
 [36] (2021) Emotion dynamics modeling via BERT. In IJCNN, pp. . Cited by: §I.
 [37] (2020) Model compression with twostage multiteacher knowledge distillation for web question answering system. In WSDM, pp. 690–698. External Links: Link, Document Cited by: §II.
 [38] (2019) XLNet: generalized autoregressive pretraining for language understanding. In NeurIPS, pp. 5754–5764. Cited by: §I, §II, §III.
 [39] (2021) Automatic intentslot induction for dialogue systems. In WWW, Cited by: §I.
 [40] (2021) Retrieving and reading: a comprehensive survey on opendomain question answering. arXiv preprint arXiv:2101.00774. Cited by: §I.