Transformer-based models have significantly advanced the field of natural language processing by establishing new state-of-the-art results in a large variety of tasks. Specifically, BERT, GPT 
, GPT-2, XLM , XLNet , and RoBERTa 
lead tasks such as text classification, sentiment analysis, semantic role labeling, question answering, among others. However, most of the models have hundreds of millions of parameters, which significantly slows down the training process and inference time. Besides, the large number of parameters demands a lot of memory consumption, making such models hard to adopt in production environments where computational resources are strictly limited.
Due to these limitations, many approaches have been proposed to reduce the size of the models while still providing similar performance. One of the most effective techniques is knowledge distillation (KD) in a teacher-student setting 
, where a cumbersome already-optimized model (i.e., the teacher) produces output probabilities that are used to train a simplified model (i.e., the student). Unlike training with one-hot labels where the classes are mutually exclusive, using a probability distribution provides more information about the similarities of the samples, which is the key part of the teacher-student distillation.
Even though the student requires fewer parameters while still performing similar to the teacher, recent work shows the difficulty of distilling information from a huge model. mirzadeh2019improved mirzadeh2019improved state that, when the gap in between the teacher and the student is large (e.g., shallow vs. deep neural networks), the student struggles to approximate the teacher. They propose to use an intermediate teaching assistant (TA) model to distill the information from the teacher and then use the TA model to distill information towards the student. However, we argue that the abstraction captured by a large teacher is only exposed through the output probabilities, which makes the internal knowledge from the teacher (or the TA model) hard to infer by the student. This can potentially take the student to very different internal representations undermining the generalization capabilities initially intended to be transferred from the teacher.
In this paper, we propose to apply KD to internal representations. Our approach allows the student to internally behave as the teacher by effectively transferring its linguistic properties. We perform the distillation at different internal points across the teacher, which allows the student to learn and compress the abstraction in the hidden layers of the large model systematically. By including internal representations, we show that our student outperforms its homologous models trained on ground-truth labels, soft-labels, or both.
Knowledge distillation has become one of the most effective and simple techniques to compress huge models into simpler and faster models. The versatility of this framework has allowed the extension of KD to scenarios where a set of expert models in different tasks distill their knowledge into a unified multi-task learning network , as well as the opposite scenario where an ensemble of multi-task models are distilled into a task-specific network [13, 12]. We extend the knowledge distillation framework with a different formulation by applying the same principle to internal representations.
Using internal representations to guide the training of a student model was initially explored by romero2014fitnets romero2014fitnets. They proposed FitNet
, a convolutional student network that is thinner and deeper than the teacher while using significantly fewer parameters. In their work, they establish a middle point in both the teacher and the student models to compare internal representations. Since the dimensionality between the teacher and the student differs, they use a convolutional regressor model to map such vectors into the same space, which adds a significant number of parameters to learn. Additionally, they mainly focus on providing a deeper student network than the teacher, exploiting the particular benefits of depth in convolutional networks. Our work differs from theirs in different aspects: 1) using a single point-wise loss on the middle layers has mainly a regularization effect, but it does not guarantee to transfer the internal knowledge from the teacher; 2) our distillation method is applied across all the student layers, which effectively compress groups of layers from the teacher into a single layer of the student; 3) we use the internal representations as-is instead of relying on additional parameters to perform the distillation; 4) we do not focus on deeper models than the teacher as this can slow down the inference time, and it is not necessarily an advantage on transformer-based models.
Curriculum learning (CL)  is another line of research that focuses on teaching complex tasks by building upon simple concepts. Although the goal is similar to ours, CL is conducted by stages focusing on simple tasks first and progressively moving to more complicated tasks. However, this method requires annotations among the preliminary tasks, and they have to be carefully picked so that the order and relation among the build-up tasks are helpful for the model. Unlike CL, we focus on teaching the internal representations of an optimized complex model, which are assumed to have the preliminary build-up knowledge for the task of interest.
Other model compression techniques include quantization [10, 8, 4] and weights pruning . The first one focuses on approximating a large model into a smaller one by reducing the precision of each of the parameters. The second one focuses on removing weights in the network that do not have a substantial impact on model performance. These techniques are complementary to the method we propose in this paper, which can potentially lead to a more effective overall compression approach.
In this section, we detail the process of distilling knowledge from internal representations. First, we describe the standard KD framework , which is an essential part of our method. Then, we formalize the objective functions to distill the internal knowledge of transformer-based models. Lastly, we propose various algorithms to conduct the internal distillation process.
hinton2015distilling hinton2015distilling proposed knowledge distillation (KD) as a framework to compress a large model into a simplified model that achieves similar results. The framework uses a teacher-student setting where the student learns from both the ground-truth labels (if available) and the soft-labels provided by the teacher. The probability mass associated with each class in the soft-labels allows the student to learn more information about the label similarities for a given sample. The formulation of KD considering both soft and hard labels is given as follows:
where represents the parameters of the teacher, and are its soft-labels; is the student prediction given by where denotes its parameters, and is a small scalar that weights down the hard-label loss. Since the soft-labels often present high entropy, the gradient tends to be smaller than the one from the hard-labels. Thus, balances the terms by reducing the impact of the hard loss.
Matching Internal Representations
In order to make the student model behave as the teacher model, the student is optimized by the soft-labels from teacher’s output. In addition, the student also acquires the abstraction hidden in the teacher by matching its internal representations. That is, we want to teach the student how to internally behave by compressing the knowledge of multiple layers from the teacher into a single layer of the student. Figure 1 shows a teacher with twice the number of layers of the student, where the colored boxes denote the layers where the student is taught the internal representation of the teacher. In this case, the student compresses two layers into one while preserving the linguistic behavior across the teacher layers.
We study the internal KD of transformer-based models, specifically the case of BERT and simplified versions of it (i.e., fewer transformer layers). We define the internal KD by using two terms in the loss function. Given a pair of transformer layers to match (see Figure1), we calculate (1) the Kullback-Leibler (KL) divergence loss across the self-attention probabilities of all the transformer heads111We are interested in a loss function that considers the probability distribution as a whole, and not point-wise errors.
, and (2) the cosine similarity loss between the[CLS] activation vectors for the given layers.
KL-divergence loss. Consider as the self-attention matrix that contains row-wise probability distributions per token in a sequence given by . For a given head in a transformer layer, we use the KL-divergence loss as follows:
where is the length of a sequence, and describe the -th row of the self-attention matrix for the teacher and student, respectively. The motivation of applying this loss function to the self-attention matrices comes from recent research that documents the linguistic patterns captured by the attention probabilities of BERT . Forcing the divergence between the self-attention probability distributions to be as small as possible preserves the linguistic behavior in the student.
Cosine similarity loss. For the second term of our internal distillation loss, we use cosine similarity222 loss could be used as well without impacting generality. as follows:
where and are the hidden vector representations for the [CLS] token for the teacher and student, respectively. We include this term in our internal KD formulation to consider a similar behavior in the activation going through the network. That is, while KL-divergence focuses on the self-attention matrix, it is the weighted hidden vectors that finally pass to the upper layers, not the probabilities. Even if we force the self-attention probabilities to be similar, there is no guarantee that the final activation passed to the upper layers is similar. Thus, using this extra term, we can regularize the context representation of the sample to be similar to the one from the teacher.333We only use the context vector instead of all the hidden token vectors to avoid over-regularizing the model .
How to Distill the Internal Knowledge?
Different layers across the teacher capture different linguistic concepts. Recent research shows that BERT builds linguistic properties that become more complex as we move from the bottom to the top of the network . Since the model builds upon bottom representations, in addition to distilling all the internal layers simultaneously, we also consider distilling knowledge progressively matching internal representation in a bottom-up fashion. More specifically, we consider the following scenarios:
Progressive internal distillation (PID). We distill the knowledge from lower layers first (close to the input) and progressively move to upper layers until the model focuses only on the classification distillation. Only one layer is optimized at a time. In Figure 1, the loss will be given by the transition .
Stacked internal distillation (SID). We distill the knowledge from lower layers first, but instead of moving from one layer to another exclusively, we keep the loss produced by previous layers stacking them as we move to the top. Once at the top, we only perform classification (see Algorithm 1). In Figure 1, the loss is determined by the transition + + + .
For the last two scenarios, to move to upper layers, the student either reaches a limited number of epochs per layer or a cosine loss threshold, whatever happens first (see line 24 in Algorithm 1). Additionally, these two scenarios can be combined with the classification loss at all times, not only until the model reaches the top layer.
|Experiment||Description||CoLA [8.5k]||QQP [364k]||MRPC [3.7k]||RTE [2.5k]|
|MCC||Acuracy / F1||Acuracy / F1||Acuracy|
|Fine-tuning BERTbase and BERT6 without KD|
|Exp1.0||BERTbase||60.16||91.44 / 91.45||83.09 / 82.96||67.51|
|Exp1.1||BERT6||44.56||90.58 / 90.62||76.23 / 73.72||59.93|
|Fine-tuning BERT6 with different KD techniques using BERTbase (Exp1.0) as teacher|
|Exp2.0||BERT6 soft||41.72||90.61 / 90.65||77.21 / 75.74||62.46|
|Exp3.0||BERT6 soft + kl||43.70||91.32 / 91.32||83.58 / 82.46||67.15|
|Exp3.1||BERT6 soft + cos||42.64||91.08 / 91.10||79.66 / 78.35||57.04|
|Exp3.2||BERT6 soft + kl + cos||42.07||91.37 / 91.38||83.09 / 81.39||66.43|
|Exp3.3||BERT6 [PID] kl + cos soft||45.54||91.22 / 91.24||81.62 / 80.12||64.98|
|Exp3.4||BERT6 [SID] kl + cos soft||46.09||91.25 / 91.27||82.35 / 81.39||64.62|
|Exp3.5||BERT6 [SID] kl + cos + soft||43.93||91.21 / 91.22||81.37 / 79.16||66.43|
|Exp3.6||BERT6 [SID] kl + cos + soft + hard||42.55||91.20 / 91.21||70.10 / 69.68||67.51|
Experiments and Results
We conduct experiments on five datasets of the GLUE benchmark , which we describe briefly:
CoLA. The Corpus of Linguistic Acceptability  is part of the single sentence tasks, and it requires to determine whether an English text is grammatically correct. It uses the Matthews Correlation Coefficient (MCC) to measure the performance.
QQP. The Quora Question Pairs444data.quora.com/First-Quora-Dataset-Release-Question-Pairs is a semantic similarity dataset, where the task is to determine whether two questions are semantically equivalent or not. It uses accuracy and F1 as metrics.
MRPC. The Microsoft Research Paraphrase Corpus  contains pairs of sentences whose annotations describe whether the sentences are semantically equivalent or not. Similar to QQP, it uses accuracy and F1 as metrics.
RTE. The Recognizing Textual Entailment  has a collection of sentence pairs whose annotations describe entitlement or not entitlement between the sentences (formerly annotated with labels entitlement, contradiction or neutral). It uses accuracy as a metric.
For the MRPC and QQP datasets, the metrics are accuracy and F1, but we optimize the models on F1 only.
We experiment with BERTbase  and simplified versions of it. In the case of BERT with 6 transformer layers, we initialize the parameters using different layers of the original BERTbase model, which has 12 transformer layers. Since our goal is to compress the behavior of a subset of layers into one layer, we initialize a layer of the simplified BERT model with the upper layer of the subset. For example, Figure 1 shows the compression of groups of two layers into one layer, hence, the first layer of the student model is initialized with the parameters of the second layer of the BERTbase model.555Note that the initialization does not take the parameters of the fine-tuned teacher. Instead, we use the parameters of the general-purpose BERTbase model.
Table 1 shows the results on the development set across four datasets. We define the experiments as follows:
Exp1.0: BERTbase. This is the standard BERTbase model that is fine-tuned on task-specific data without any KD technique. Once optimized, we use this model as a teacher for the KD experiments.
Exp1.1: BERT6. This is a simplified version of BERTbase, where we use 6 transformer layers instead of 12. The layer selection for initialization is described in the previous section. We do not use any KD for this experiment. The KD experiments described below use this architecture as the student model.
Exp2.0: BERT6 soft. The model is trained with soft-labels produced by the fine-tuned BERTbase teacher from experiment 1.0. This scenario correspond to Equation 1 with to ignore the one-hot loss.
Exp3.2: BERT6 soft + kl + cos. The model uses all the losses from all layers every epoch. This experiment combines experiments 3.0 and 3.1.
Exp3.3: BERT6 [PID] kl + cos soft. The model only uses progressive internal distillation until it reaches the classification layer. Once there, only soft-labels are used.
Exp3.4: BERT6 [SID] kl + cos soft. The model uses stacked internal distillation until it reaches the classification layer. Once there, only soft-labels are used.
Exp3.5: BERT6 [SID] kl + cos + soft. The model uses stacked internal distillation and soft-labels distillation all the time during training.
Exp3.6: BERT6 [SID] kl + cos + soft + hard. Same as Exp3.5, but it includes the hard-labels in the Equation 1 with .
We optimize our models using Adam with an initial learning rate of 2e-5 and a learning rate scheduler as described by devlin2018bert devlin2018bert. We fine-tune BERTbase
for 10 epochs, and the simplified BERT models for 50 epochs both with a batch size of 32 samples and a maximum sequence length of 64 tokens. We evaluate the statistical significant of our models using t-tests as described by rotem2018significance rotem2018significance. All the internal KD results have shown statistical significance with a p-value less than 1e-3 with respect to the standard KD method across the datasets.
Development and Evaluation Results
|MCC||Acc. / F1||Acc. / F1||Acc.|
|Exp1.0||51.4||71.3 / 89.2||84.9 / 79.9||66.4|
|Exp2.0||38.3||69.1 / 88.0||81.6 / 73.9||59.7|
|Exp3.X||41.4||70.9 / 89.1||83.8 / 77.1||62.2|
As shown in Table 1, we perform extensive experiments for BERT6 as a student, where we evaluate different training techniques with or without knowledge distillation. In general, the first thing to notice is that the distillation techniques outperforms BERT6 trained without distillation (Exp1.1). While it is not always the case for standard distillation (Exp1.1 vs. Exp2.0 for CoLA), the internal distillation method proposed in this work consistently outperforms both Exp1.1 and Exp2.0 across all datasets. Nevertheless, the gap between the results substantially depends on the size of the data. Intuitively, this is expected behavior since the more data we provide to the teacher, the more knowledge is exposed, and hence, the student reaches a more accurate approximation of the teacher.
Additionally, our internal distillation results are consistently better than the standard soft-label distillation in the test set, as described in Table 2.
This section provides more insights into our algorithm based on parameter reduction, data size impact, model convergence, self-attention behavior, and error analysis.
Performance vs. Parameters
We analyze the parameter reduction capabilities of our method. Figure 2 shows that BERT6 can easily achieve similar results than the original BERTbase model with 12 transformer layers. Note that BERTbase has around 109.4M parameters, which can be broken down into 23.8M parameters related to embeddings and around 85.6M parameters related to transformer layers. The BERT6 student, however, has 43.1M parameters in the transformer layers, which means that the parameter reduction is about 50%, while still performing very similar to the teacher (91.38 F1 vs. 91.45 F1 for QQP, see Table 2). Also, note that the 0.73% F1 drop is statistical significant between the student only trained on soft-labels and the student trained with our method.
Moreover, if we keep reducing the number of layers, the performance decays for both student models (see Figure 2). However, the internal distillation method is more resilient to keep a higher performance. Eventually, with one transformer layer to distill internally, the compression rate is too high for the model to account for an additional boost when we compare BERT1 students with standard and internal distillation methods.
The Impact of Data Size
We also evaluate the impact of the data size. For this analysis, we fix the student architecture to the BERT6, and we only modify the size of the training data. We compare the standard and the internal distillation techniques for the QQP dataset, as shown in Figure 3. Consistently, the internal distillation outperforms the soft-label KD method. However, the gap between the two methods is small when the data size is large, but it tends to increase in favor of the internal KD method when the data size decreases.
We analyze the convergence behavior during training by comparing the performance of the internal distillation algorithms across epochs. We conduct the experiments on the QQP dataset as described in Figure 4. We control over the student architecture, which is BERT6, and exclusively experiment with different internal KD algorithms. The figure shows three experiments: progressive internal distillation (Exp3.3), stacked internal distillation (Exp3.4), and stacked internal distillation using soft-labels all the time (Exp3.5). Importantly, note that Exp3.3 and Exp3.4 do not update the classification layer until around epoch 40 when all the transformer layers have been optimized. Nevertheless, the internal distillation by itself allows the students to reach higher performance across epochs eventually. In fact, Exp3.3 reaches its highest value when the 6th transformer layer is being optimized while the classification layer remains as it was initialized (see epoch 38 in Figure 4). This serves as strong evidence that the internal knowledge of the model can be taught and compress without even considering the classification layer.
Inspecting the Attention Behavior
We inspect the internal representations learned by the students from standard and internal KD and compare their behaviors against the ones from the teacher. The goal of this experiment is to get a sense of how much the student can compress from the teacher, and how different such representations are from a student trained on soft-label in a standard KD setting. For this experiment, we use the QQP dataset and BERT6 as a student. The internally-distilled student corresponds to experiment 3.2, and the soft-label student comes from experiment 2.0 (see Table 1). Figure 5 shows the compression effectiveness of the internally distilled student with respect to the teacher. Even though the model is skipping one layer for every two layers of the teacher, the student is still able to replicate the behavior taught from the teacher. While the internal representations from the student with standard KD mainly serve to a general-purpose (i.e., attending to the separation token while spotting connections with the word college), the representations are not the ones intended to be transferred from the teacher. This means that the original goal of compressing a model does not hold entirely since its internal behavior is quite different than the one from the teacher (see Figure 5 for the KL divergence on each student).
|Method||Teacher Right||Teacher Wrong|
|Standard KD (Exp2.0)||35,401 ✓||1,566 ✗||1,232 ✓||2,231 ✗|
|Internal KD (Exp3.2)||36,191 ✓||776 ✗||750 ✓||2,713 ✗|
|No.||QQP Development Samples||Class||Teacher||Std KD||Int KD|
|1||Q1: if donald trump loses the general election, will he attempt to seize power by force claiming the election was fraudulent?||1||1 (0.9999)||1 (0.9999)||0 (0.4221)|
|Q2: how will donald trump react if and when he loses the election?|
|2||Q1: can depression lead to insanity?||0||0 (0.0429)||0 (1.2e-4)||1 (0.9987)|
|Q2: does stress or depression lead to mental illness?|
|3||Q1: can i make money by uploading videos on youtube (if i have subscribers)?||1||1 (0.9998)||0 (0.0017)||1 (0.8868)|
|Q2: how do youtube channels make money?|
|4||Q1: what are narendra modi’s educational qualifications?||0||0 (0.0203)||1 (0.9999)||0 (0.2158)|
|Q2: why is pmo hiding narendra modi’s educational qualifications?|
In our internal KD method, the generalization capabilities of the teacher are replicated in the student model. This also implies that the student will potentially make the mistakes of the teacher. In fact, when we compare a student only trained on soft-labels (Exp2.0) against a student trained with our method (Exp3.2), we can see in Table 3 that the numbers of the latter align better with the teacher numbers for both wrong and right predictions. For instance, when the teacher is right (36,967), our method is right 97.9% of the same samples (36,191), whereas the standard distillation provides a rate of 95.7% (35,401) with more than twice the number of mistakes than our method (1,566 vs. 776). On the other hand, when the teacher is wrong (3,463), the student in our method makes more mistakes and provides less correct predictions than the student from standard KD. Nevertheless, the overall score of the student in our method significantly exceeds the score from the student trained in a standard KD setting.
We also inspect the samples where the teacher and only one of the students are right. The QQP samples 1 and 2 in Table 4 show wrong predictions by the internally-distilled student (Exp3.2) that are not consistent with the teacher. For sample 1, although the prediction is 0, the probability output (0.4221) is very close to the threshold (0.5). Our intuition is that the internal distillation method had a regularization effect on the student such that, considering that question 2 is much more specific than question 1, it does not allow the student to tell whether is similar or not confidently. Also, it is worth noting that standard KD student is extremely confident about the prediction (0.9999), which may not be ideal since this can be a sign of over-fitting or memorization. For sample 2, although the internally-distilled student is wrong (according to ground-truth annotation and the teacher), the questions are actually related which suggests that the student model is capable of disagreeing with the teacher while still generalizing well. Samples 3 and 4 show successful cases for the internally-distilled student, while the standard KD student fails.
We propose a new extension of the KD method that effectively compresses a large model into a smaller one, while still preserving a similar performance from the original model. Unlike the standard KD method, where a student only learns from the output probabilities of the teacher, we teach our smaller models by also revealing the internal representations of the teacher. Besides preserving a similar performance, our method effectively compresses the internal behavior of the teacher into the student. This is not guaranteed in the standard KD method, which can potentially affect the generalization capabilities initially intended to be transferred from the teacher. Finally, we validate the effectiveness of our method by consistently outperforming the standard KD technique in four datasets of the GLUE benchmark.
Learning Deep Architectures for AI.
Foundations and Trends® in Machine Learning2 (1), pp. 1–127. Cited by: Related Work.
-  (2019) What Does BERT Look At? An Analysis of BERT’s Attention. CoRR abs/1906.04341. External Links: Cited by: Matching Internal Representations, How to Distill the Internal Knowledge?.
-  (2019) BAM! Born-Again Multi-Task Networks for Natural Language Understanding. CoRR abs/1907.04829. External Links: Cited by: Related Work.
-  (2016) Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1. arXiv preprint arXiv:1602.02830. Cited by: Related Work.
-  (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. Cited by: Introduction, Parameter Initialization.
-  (2005) Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), External Links: Cited by: item 3.
-  (2015) Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149. Cited by: Related Work.
Effective Quantization Methods for Recurrent Neural Networks. CoRR abs/1611.10176. External Links: Cited by: Related Work.
-  (2015) Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531. Cited by: Introduction, Methodology.
-  (2017) Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. The Journal of Machine Learning Research 18 (1), pp. 6869–6898. Cited by: Related Work.
-  (2019) Cross-lingual Language Model Pretraining. CoRR abs/1901.07291. External Links: Cited by: Introduction.
-  (2019-07) Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4487–4496. External Links: Cited by: Related Work.
-  (2019) Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding. arXiv preprint arXiv:1904.09482. Cited by: Related Work.
-  (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692. External Links: Cited by: Introduction.
-  (2018) Improving Language Understanding by Generative Pre-Training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf. Cited by: Introduction.
-  (2019) Language Models are Unsupervised Multitask Learners. OpenAI Blog 1 (8). Cited by: Introduction.
-  (2016) SQuAD: 100, 000+ Questions for Machine Comprehension of Text. CoRR abs/1606.05250. External Links: Cited by: 1st item.
-  (2014) FitNets: Hints for Thin Deep Nets. arXiv preprint arXiv:1412.6550. Cited by: footnote 3.
-  (2017) Attention Is All You Need. CoRR abs/1706.03762. External Links: Cited by: Matching Internal Representations.
-  (2018) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. CoRR abs/1804.07461. External Links: Cited by: item 4, Datasets.
-  (2018) Neural Network Acceptability Judgments. CoRR abs/1805.12471. External Links: Cited by: item 1.
-  (2019) XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237. Cited by: Introduction.
Appendix for ”Knowledge Distillation from Internal Representation”
We analyze our method using the QQP dataset in the main content of the paper. For completeness, we also add figures for CoLA, RTE, and MRPC. We exclude QQP from this appendix to avoid redundancy.
Performance vs. Parameters
The Impact of Data Size
The data size analysis does not apply to the CoLA, RTE, or MRPC datasets. This is because those datasets are too small to show meaningful results. Hence, we use another dataset of the GLUE benchmark to confirm the same behavior we described for the QQP dataset. We use the QNLI dataset.
QNLI. This dataset uses the Stanford Question Answering Dataset . The GLUE benchmark formulates the task as a sentence pair classification where the goal is to determine whether a context sentence contains the answer to a given question.
Consistent with the behavior in the QQP dataset, the internal KD is superior to the standard KD, as shown in Figure 9. Specifically, we use the internal KD method as described in Exp3.2 (BERT6 soft + kl + cos). Also, the gap between the two methods increases as we reduce the data size.
We provide the convergence of the internal KD methods on the CoLA dataset in Figure 10. Like the QQP behavior, the internal KD achieves the best results while the classification layer is still not optimized (see epoch 40-41 for experiment 2.4 in Figure 10). Also, note that between the epoch 0 and 40, the MCC scores are 0 or even negative, and after epoch 40, when the 5th and 6th layers are used, the performance immediately goes up for Exp3.3 and Exp3.4. This suggests that the model progressively learns general aspects in the lower layers and task-specific aspects in the upper layers. This knowledge allows the model to reach top performance even without updating the classification layer once, which is strong evidence of the importance of internal distillation.
Inspecting the Attention Behavior
For the attention analysis, we show a single head from one transformer layer in the main content of this paper. In Figure 11, we show more of this behavior where we compare the first heads across all the student transformer layers. The internal KD student has replicated the internal representations of the teacher while compressing by a ratio of two layers into one.