Knowledge Distillation from Internal Representations

10/08/2019 ∙ by Gustavo Aguilar, et al. ∙ Amazon University of Houston 0

Knowledge distillation is typically conducted by training a small model (the student) to mimic a large and cumbersome model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft-labels to optimize the student. However, when the teacher is considerably large, there is no guarantee that the internal knowledge of the teacher will be transferred into the student; even if the student closely matches the soft-labels, its internal representations may be considerably different. This internal mismatch can undermine the generalization capabilities originally intended to be transferred from the teacher to the student. In this paper, we propose to distill the internal representations of a large model such as BERT into a simplified version of it. We formulate two ways to distill such representations and various algorithms to conduct the distillation. We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Transformer-based models have significantly advanced the field of natural language processing by establishing new state-of-the-art results in a large variety of tasks. Specifically, BERT

[5], GPT [15]

, GPT-2

[16], XLM [11], XLNet [22], and RoBERTa [14]

lead tasks such as text classification, sentiment analysis, semantic role labeling, question answering, among others. However, most of the models have hundreds of millions of parameters, which significantly slows down the training process and inference time. Besides, the large number of parameters demands a lot of memory consumption, making such models hard to adopt in production environments where computational resources are strictly limited.

Due to these limitations, many approaches have been proposed to reduce the size of the models while still providing similar performance. One of the most effective techniques is knowledge distillation (KD) in a teacher-student setting [9]

, where a cumbersome already-optimized model (i.e., the teacher) produces output probabilities that are used to train a simplified model (i.e., the student). Unlike training with one-hot labels where the classes are mutually exclusive, using a probability distribution provides more information about the similarities of the samples, which is the key part of the teacher-student distillation.

Even though the student requires fewer parameters while still performing similar to the teacher, recent work shows the difficulty of distilling information from a huge model. mirzadeh2019improved mirzadeh2019improved state that, when the gap in between the teacher and the student is large (e.g., shallow vs. deep neural networks), the student struggles to approximate the teacher. They propose to use an intermediate teaching assistant (TA) model to distill the information from the teacher and then use the TA model to distill information towards the student. However, we argue that the abstraction captured by a large teacher is only exposed through the output probabilities, which makes the internal knowledge from the teacher (or the TA model) hard to infer by the student. This can potentially take the student to very different internal representations undermining the generalization capabilities initially intended to be transferred from the teacher.

In this paper, we propose to apply KD to internal representations. Our approach allows the student to internally behave as the teacher by effectively transferring its linguistic properties. We perform the distillation at different internal points across the teacher, which allows the student to learn and compress the abstraction in the hidden layers of the large model systematically. By including internal representations, we show that our student outperforms its homologous models trained on ground-truth labels, soft-labels, or both.

Related Work

Knowledge distillation has become one of the most effective and simple techniques to compress huge models into simpler and faster models. The versatility of this framework has allowed the extension of KD to scenarios where a set of expert models in different tasks distill their knowledge into a unified multi-task learning network [3], as well as the opposite scenario where an ensemble of multi-task models are distilled into a task-specific network [13, 12]. We extend the knowledge distillation framework with a different formulation by applying the same principle to internal representations.

Using internal representations to guide the training of a student model was initially explored by romero2014fitnets romero2014fitnets. They proposed FitNet

, a convolutional student network that is thinner and deeper than the teacher while using significantly fewer parameters. In their work, they establish a middle point in both the teacher and the student models to compare internal representations. Since the dimensionality between the teacher and the student differs, they use a convolutional regressor model to map such vectors into the same space, which adds a significant number of parameters to learn. Additionally, they mainly focus on providing a deeper student network than the teacher, exploiting the particular benefits of depth in convolutional networks. Our work differs from theirs in different aspects: 1) using a single point-wise loss on the middle layers has mainly a regularization effect, but it does not guarantee to transfer the internal knowledge from the teacher; 2) our distillation method is applied across all the student layers, which effectively compress groups of layers from the teacher into a single layer of the student; 3) we use the internal representations as-is instead of relying on additional parameters to perform the distillation; 4) we do not focus on deeper models than the teacher as this can slow down the inference time, and it is not necessarily an advantage on transformer-based models.

Curriculum learning (CL) [1] is another line of research that focuses on teaching complex tasks by building upon simple concepts. Although the goal is similar to ours, CL is conducted by stages focusing on simple tasks first and progressively moving to more complicated tasks. However, this method requires annotations among the preliminary tasks, and they have to be carefully picked so that the order and relation among the build-up tasks are helpful for the model. Unlike CL, we focus on teaching the internal representations of an optimized complex model, which are assumed to have the preliminary build-up knowledge for the task of interest.

Other model compression techniques include quantization [10, 8, 4] and weights pruning [7]. The first one focuses on approximating a large model into a smaller one by reducing the precision of each of the parameters. The second one focuses on removing weights in the network that do not have a substantial impact on model performance. These techniques are complementary to the method we propose in this paper, which can potentially lead to a more effective overall compression approach.


In this section, we detail the process of distilling knowledge from internal representations. First, we describe the standard KD framework [9], which is an essential part of our method. Then, we formalize the objective functions to distill the internal knowledge of transformer-based models. Lastly, we propose various algorithms to conduct the internal distillation process.

Knowledge Distillation

hinton2015distilling hinton2015distilling proposed knowledge distillation (KD) as a framework to compress a large model into a simplified model that achieves similar results. The framework uses a teacher-student setting where the student learns from both the ground-truth labels (if available) and the soft-labels provided by the teacher. The probability mass associated with each class in the soft-labels allows the student to learn more information about the label similarities for a given sample. The formulation of KD considering both soft and hard labels is given as follows:


where represents the parameters of the teacher, and are its soft-labels; is the student prediction given by where denotes its parameters, and is a small scalar that weights down the hard-label loss. Since the soft-labels often present high entropy, the gradient tends to be smaller than the one from the hard-labels. Thus, balances the terms by reducing the impact of the hard loss.

Matching Internal Representations

Figure 1: Knowledge distillation from internal representations. We show the internal layers that the teacher (left) distills into the student (right).

In order to make the student model behave as the teacher model, the student is optimized by the soft-labels from teacher’s output. In addition, the student also acquires the abstraction hidden in the teacher by matching its internal representations. That is, we want to teach the student how to internally behave by compressing the knowledge of multiple layers from the teacher into a single layer of the student. Figure 1 shows a teacher with twice the number of layers of the student, where the colored boxes denote the layers where the student is taught the internal representation of the teacher. In this case, the student compresses two layers into one while preserving the linguistic behavior across the teacher layers.

We study the internal KD of transformer-based models, specifically the case of BERT and simplified versions of it (i.e., fewer transformer layers). We define the internal KD by using two terms in the loss function. Given a pair of transformer layers to match (see Figure

1), we calculate (1) the Kullback-Leibler (KL) divergence loss across the self-attention probabilities of all the transformer heads111We are interested in a loss function that considers the probability distribution as a whole, and not point-wise errors.

, and (2) the cosine similarity loss between the

[CLS] activation vectors for the given layers.

KL-divergence loss. Consider as the self-attention matrix that contains row-wise probability distributions per token in a sequence given by [19]. For a given head in a transformer layer, we use the KL-divergence loss as follows:


where is the length of a sequence, and describe the -th row of the self-attention matrix for the teacher and student, respectively. The motivation of applying this loss function to the self-attention matrices comes from recent research that documents the linguistic patterns captured by the attention probabilities of BERT [2]. Forcing the divergence between the self-attention probability distributions to be as small as possible preserves the linguistic behavior in the student.

Cosine similarity loss. For the second term of our internal distillation loss, we use cosine similarity222 loss could be used as well without impacting generality. as follows:


where and are the hidden vector representations for the [CLS] token for the teacher and student, respectively. We include this term in our internal KD formulation to consider a similar behavior in the activation going through the network. That is, while KL-divergence focuses on the self-attention matrix, it is the weighted hidden vectors that finally pass to the upper layers, not the probabilities. Even if we force the self-attention probabilities to be similar, there is no guarantee that the final activation passed to the upper layers is similar. Thus, using this extra term, we can regularize the context representation of the sample to be similar to the one from the teacher.333We only use the context vector instead of all the hidden token vectors to avoid over-regularizing the model [18].

How to Distill the Internal Knowledge?

Different layers across the teacher capture different linguistic concepts. Recent research shows that BERT builds linguistic properties that become more complex as we move from the bottom to the top of the network [2]. Since the model builds upon bottom representations, in addition to distilling all the internal layers simultaneously, we also consider distilling knowledge progressively matching internal representation in a bottom-up fashion. More specifically, we consider the following scenarios:

  1. Internal distillation of all layers

    . All the layers of the student are optimized to match the ones from the teacher in every epoch. In Figure

    1, the distillation simultaneously occurs on the circled numbers



    , and


  2. Progressive internal distillation (PID). We distill the knowledge from lower layers first (close to the input) and progressively move to upper layers until the model focuses only on the classification distillation. Only one layer is optimized at a time. In Figure 1, the loss will be given by the transition





  3. Stacked internal distillation (SID). We distill the knowledge from lower layers first, but instead of moving from one layer to another exclusively, we keep the loss produced by previous layers stacking them as we move to the top. Once at the top, we only perform classification (see Algorithm 1). In Figure 1, the loss is determined by the transition








For the last two scenarios, to move to upper layers, the student either reaches a limited number of epochs per layer or a cosine loss threshold, whatever happens first (see line 24 in Algorithm 1). Additionally, these two scenarios can be combined with the classification loss at all times, not only until the model reaches the top layer.

Experiment Description CoLA [8.5k] QQP [364k] MRPC [3.7k] RTE [2.5k]
MCC Acuracy / F1 Acuracy / F1 Acuracy
Fine-tuning BERTbase and BERT6 without KD
Exp1.0 BERTbase 60.16 91.44 / 91.45 83.09 / 82.96 67.51
Exp1.1 BERT6 44.56 90.58 / 90.62 76.23 / 73.72 59.93
Fine-tuning BERT6 with different KD techniques using BERTbase (Exp1.0) as teacher
Exp2.0 BERT6 soft 41.72 90.61 / 90.65 77.21 / 75.74 62.46
Exp3.0 BERT6 soft + kl 43.70 91.32 / 91.32 83.58 / 82.46 67.15
Exp3.1 BERT6 soft + cos 42.64 91.08 / 91.10 79.66 / 78.35 57.04
Exp3.2 BERT6 soft + kl + cos 42.07 91.37 / 91.38 83.09 / 81.39 66.43
Exp3.3 BERT6 [PID] kl + cos soft 45.54 91.22 / 91.24 81.62 / 80.12 64.98
Exp3.4 BERT6 [SID] kl + cos soft 46.09 91.25 / 91.27 82.35 / 81.39 64.62
Exp3.5 BERT6 [SID] kl + cos + soft 43.93 91.21 / 91.22 81.37 / 79.16 66.43
Exp3.6 BERT6 [SID] kl + cos + soft + hard 42.55 91.20 / 91.21 70.10 / 69.68 67.51
Table 1: The development results across four datasets. Experiments 1.0 and 1.1 are trained without any distillation method, whereas experiments 2.0 and 3.X use a different combination of algorithms to distill information. Experiment 2.0 only uses standard knowledge distillation, and it can be considered as baseline.
1:procedure headLoss()
2:     for  do
3:           concatHeads()
4:           concatHeads()
6:     return
7:procedure StackIntDistill()
9:     for  do
10:           getCLS()
11:           getCLS()
13:           + headLoss()      
14:     return
17:     for  do
18:          if  then
19:                Perform internal distillation
20:               for  do
21:                     StackIntDistill()
23:                     Accumulate for threshold                
24:               if  OR  then
26:          else
27:                Perform standard distillation
28:               for  do
30:until convergence
Algorithm 1 Stacked Internal Distillation (SID)

Experiments and Results


We conduct experiments on five datasets of the GLUE benchmark [20], which we describe briefly:

  1. CoLA. The Corpus of Linguistic Acceptability [21] is part of the single sentence tasks, and it requires to determine whether an English text is grammatically correct. It uses the Matthews Correlation Coefficient (MCC) to measure the performance.

  2. QQP. The Quora Question is a semantic similarity dataset, where the task is to determine whether two questions are semantically equivalent or not. It uses accuracy and F1 as metrics.

  3. MRPC. The Microsoft Research Paraphrase Corpus [6] contains pairs of sentences whose annotations describe whether the sentences are semantically equivalent or not. Similar to QQP, it uses accuracy and F1 as metrics.

  4. RTE. The Recognizing Textual Entailment [20] has a collection of sentence pairs whose annotations describe entitlement or not entitlement between the sentences (formerly annotated with labels entitlement, contradiction or neutral). It uses accuracy as a metric.

For the MRPC and QQP datasets, the metrics are accuracy and F1, but we optimize the models on F1 only.

Parameter Initialization

We experiment with BERTbase [5] and simplified versions of it. In the case of BERT with 6 transformer layers, we initialize the parameters using different layers of the original BERTbase model, which has 12 transformer layers. Since our goal is to compress the behavior of a subset of layers into one layer, we initialize a layer of the simplified BERT model with the upper layer of the subset. For example, Figure 1 shows the compression of groups of two layers into one layer, hence, the first layer of the student model is initialized with the parameters of the second layer of the BERTbase model.555Note that the initialization does not take the parameters of the fine-tuned teacher. Instead, we use the parameters of the general-purpose BERTbase model.

Experimental Setup

Table 1 shows the results on the development set across four datasets. We define the experiments as follows:

  • Exp1.0: BERTbase. This is the standard BERTbase model that is fine-tuned on task-specific data without any KD technique. Once optimized, we use this model as a teacher for the KD experiments.

  • Exp1.1: BERT6. This is a simplified version of BERTbase, where we use 6 transformer layers instead of 12. The layer selection for initialization is described in the previous section. We do not use any KD for this experiment. The KD experiments described below use this architecture as the student model.

  • Exp2.0: BERT6 soft. The model is trained with soft-labels produced by the fine-tuned BERTbase teacher from experiment 1.0. This scenario correspond to Equation 1 with to ignore the one-hot loss.

  • Exp3.0: BERT6 soft + kl. The model uses both the soft-label and the KL-divergence losses from Equations 1 and 2. The KL-divergence loss is averaged across all the self-attention matrices from the student (i.e., 12 attention heads per transformer layer per 12 transformer layers).

  • Exp3.1: BERT6 soft + cos. The model uses both the soft-label and the cosine similarity losses from Equations 1 and 3. The cosine similarity loss is computed from the [CLS] vector from all matching layers.

  • Exp3.2: BERT6 soft + kl + cos. The model uses all the losses from all layers every epoch. This experiment combines experiments 3.0 and 3.1.

  • Exp3.3: BERT6 [PID] kl + cos soft. The model only uses progressive internal distillation until it reaches the classification layer. Once there, only soft-labels are used.

  • Exp3.4: BERT6 [SID] kl + cos soft. The model uses stacked internal distillation until it reaches the classification layer. Once there, only soft-labels are used.

  • Exp3.5: BERT6 [SID] kl + cos + soft. The model uses stacked internal distillation and soft-labels distillation all the time during training.

  • Exp3.6: BERT6 [SID] kl + cos + soft + hard. Same as Exp3.5, but it includes the hard-labels in the Equation 1 with .

We optimize our models using Adam with an initial learning rate of 2e-5 and a learning rate scheduler as described by devlin2018bert devlin2018bert. We fine-tune BERTbase

for 10 epochs, and the simplified BERT models for 50 epochs both with a batch size of 32 samples and a maximum sequence length of 64 tokens. We evaluate the statistical significant of our models using t-tests as described by rotem2018significance rotem2018significance. All the internal KD results have shown statistical significance with a p-value less than 1e-3 with respect to the standard KD method across the datasets.

Development and Evaluation Results

MCC Acc. / F1 Acc. / F1 Acc.
Exp1.0 51.4 71.3 / 89.2 84.9 / 79.9 66.4
Exp2.0 38.3 69.1 / 88.0 81.6 / 73.9 59.7
Exp3.X 41.4 70.9 / 89.1 83.8 / 77.1 62.2
Table 2: The test results from the best models according to the development set. We add Exp1.0 (BERTbase) for reference. Exp2.0 uses BERT6 with standard distillation (soft-labels only), and Exp3.X uses the best internal KD technique with BERT6 as student according to the development set.

As shown in Table 1, we perform extensive experiments for BERT6 as a student, where we evaluate different training techniques with or without knowledge distillation. In general, the first thing to notice is that the distillation techniques outperforms BERT6 trained without distillation (Exp1.1). While it is not always the case for standard distillation (Exp1.1 vs. Exp2.0 for CoLA), the internal distillation method proposed in this work consistently outperforms both Exp1.1 and Exp2.0 across all datasets. Nevertheless, the gap between the results substantially depends on the size of the data. Intuitively, this is expected behavior since the more data we provide to the teacher, the more knowledge is exposed, and hence, the student reaches a more accurate approximation of the teacher.

Additionally, our internal distillation results are consistently better than the standard soft-label distillation in the test set, as described in Table 2.


This section provides more insights into our algorithm based on parameter reduction, data size impact, model convergence, self-attention behavior, and error analysis.

Performance vs. Parameters

Figure 2: Performance vs. parameters trade-off. The points along the lines denote the number of layers used in BERT, which is reflected by the number of parameters in the x-axis.

We analyze the parameter reduction capabilities of our method. Figure 2 shows that BERT6 can easily achieve similar results than the original BERTbase model with 12 transformer layers. Note that BERTbase has around 109.4M parameters, which can be broken down into 23.8M parameters related to embeddings and around 85.6M parameters related to transformer layers. The BERT6 student, however, has 43.1M parameters in the transformer layers, which means that the parameter reduction is about 50%, while still performing very similar to the teacher (91.38 F1 vs. 91.45 F1 for QQP, see Table 2). Also, note that the 0.73% F1 drop is statistical significant between the student only trained on soft-labels and the student trained with our method.

Moreover, if we keep reducing the number of layers, the performance decays for both student models (see Figure 2). However, the internal distillation method is more resilient to keep a higher performance. Eventually, with one transformer layer to distill internally, the compression rate is too high for the model to account for an additional boost when we compare BERT1 students with standard and internal distillation methods.

The Impact of Data Size

Figure 3: The impact of training size for standard vs. internal knowledge distillation.

We also evaluate the impact of the data size. For this analysis, we fix the student architecture to the BERT6, and we only modify the size of the training data. We compare the standard and the internal distillation techniques for the QQP dataset, as shown in Figure 3. Consistently, the internal distillation outperforms the soft-label KD method. However, the gap between the two methods is small when the data size is large, but it tends to increase in favor of the internal KD method when the data size decreases.

Student Convergence

Figure 4: Comparing algorithm convergences across epochs. The annotations along the lines denote the layers that have been completely optimized. After the L6 point, only the classification layer is trained.

We analyze the convergence behavior during training by comparing the performance of the internal distillation algorithms across epochs. We conduct the experiments on the QQP dataset as described in Figure 4. We control over the student architecture, which is BERT6, and exclusively experiment with different internal KD algorithms. The figure shows three experiments: progressive internal distillation (Exp3.3), stacked internal distillation (Exp3.4), and stacked internal distillation using soft-labels all the time (Exp3.5). Importantly, note that Exp3.3 and Exp3.4 do not update the classification layer until around epoch 40 when all the transformer layers have been optimized. Nevertheless, the internal distillation by itself allows the students to reach higher performance across epochs eventually. In fact, Exp3.3 reaches its highest value when the 6th transformer layer is being optimized while the classification layer remains as it was initialized (see epoch 38 in Figure 4). This serves as strong evidence that the internal knowledge of the model can be taught and compress without even considering the classification layer.

Inspecting the Attention Behavior

Figure 5: Attention comparison for head 8 in layer 5, each student with its corresponding head KL-divergence loss. The KL-divergence loss for the given example across all matching layers between the students and the teacher is 2.229 and 0.085 for the standard KD and internal KD students, respectively.

We inspect the internal representations learned by the students from standard and internal KD and compare their behaviors against the ones from the teacher. The goal of this experiment is to get a sense of how much the student can compress from the teacher, and how different such representations are from a student trained on soft-label in a standard KD setting. For this experiment, we use the QQP dataset and BERT6 as a student. The internally-distilled student corresponds to experiment 3.2, and the soft-label student comes from experiment 2.0 (see Table 1). Figure 5 shows the compression effectiveness of the internally distilled student with respect to the teacher. Even though the model is skipping one layer for every two layers of the teacher, the student is still able to replicate the behavior taught from the teacher. While the internal representations from the student with standard KD mainly serve to a general-purpose (i.e., attending to the separation token while spotting connections with the word college), the representations are not the ones intended to be transferred from the teacher. This means that the original goal of compressing a model does not hold entirely since its internal behavior is quite different than the one from the teacher (see Figure 5 for the KL divergence on each student).

Error Analysis

Method Teacher Right Teacher Wrong
(36,967) (3,463)
Standard KD (Exp2.0) 35,401 ✓ 1,566 ✗ 1,232 ✓ 2,231 ✗
Internal KD (Exp3.2) 36,191 ✓ 776 ✗ 750 ✓ 2,713 ✗
Table 3: Right and wrong predictions on the QQP development dataset. Based on the teacher results, we show the number of right (✓) and wrong (✗) predictions by the students from standard KD (Exp2.0) and internal KD (Exp3.2).
No. QQP Development Samples Class Teacher Std KD Int KD
1 Q1: if donald trump loses the general election, will he attempt to seize power by force claiming the election was fraudulent? 1 1 (0.9999) 1 (0.9999) 0 (0.4221)
Q2: how will donald trump react if and when he loses the election?
2 Q1: can depression lead to insanity? 0 0 (0.0429) 0 (1.2e-4) 1 (0.9987)
Q2: does stress or depression lead to mental illness?
3 Q1: can i make money by uploading videos on youtube (if i have subscribers)? 1 1 (0.9998) 0 (0.0017) 1 (0.8868)
Q2: how do youtube channels make money?
4 Q1: what are narendra modi’s educational qualifications? 0 0 (0.0203) 1 (0.9999) 0 (0.2158)
Q2: why is pmo hiding narendra modi’s educational qualifications?
Table 4: Samples where the teacher predictions are right and only one of the students is wrong. We show the predicted label along with the probability for such prediction in parenthesis. We also provide the ground-truth label in the class column.

In our internal KD method, the generalization capabilities of the teacher are replicated in the student model. This also implies that the student will potentially make the mistakes of the teacher. In fact, when we compare a student only trained on soft-labels (Exp2.0) against a student trained with our method (Exp3.2), we can see in Table 3 that the numbers of the latter align better with the teacher numbers for both wrong and right predictions. For instance, when the teacher is right (36,967), our method is right 97.9% of the same samples (36,191), whereas the standard distillation provides a rate of 95.7% (35,401) with more than twice the number of mistakes than our method (1,566 vs. 776). On the other hand, when the teacher is wrong (3,463), the student in our method makes more mistakes and provides less correct predictions than the student from standard KD. Nevertheless, the overall score of the student in our method significantly exceeds the score from the student trained in a standard KD setting.

We also inspect the samples where the teacher and only one of the students are right. The QQP samples 1 and 2 in Table 4 show wrong predictions by the internally-distilled student (Exp3.2) that are not consistent with the teacher. For sample 1, although the prediction is 0, the probability output (0.4221) is very close to the threshold (0.5). Our intuition is that the internal distillation method had a regularization effect on the student such that, considering that question 2 is much more specific than question 1, it does not allow the student to tell whether is similar or not confidently. Also, it is worth noting that standard KD student is extremely confident about the prediction (0.9999), which may not be ideal since this can be a sign of over-fitting or memorization. For sample 2, although the internally-distilled student is wrong (according to ground-truth annotation and the teacher), the questions are actually related which suggests that the student model is capable of disagreeing with the teacher while still generalizing well. Samples 3 and 4 show successful cases for the internally-distilled student, while the standard KD student fails.


We propose a new extension of the KD method that effectively compresses a large model into a smaller one, while still preserving a similar performance from the original model. Unlike the standard KD method, where a student only learns from the output probabilities of the teacher, we teach our smaller models by also revealing the internal representations of the teacher. Besides preserving a similar performance, our method effectively compresses the internal behavior of the teacher into the student. This is not guaranteed in the standard KD method, which can potentially affect the generalization capabilities initially intended to be transferred from the teacher. Finally, we validate the effectiveness of our method by consistently outperforming the standard KD technique in four datasets of the GLUE benchmark.


  • [1] Y. Bengio (2009) Learning Deep Architectures for AI.

    Foundations and Trends® in Machine Learning

    2 (1), pp. 1–127.
    Cited by: Related Work.
  • [2] K. Clark, U. Khandelwal, O. Levy, and C. D. Manning (2019) What Does BERT Look At? An Analysis of BERT’s Attention. CoRR abs/1906.04341. External Links: Link, 1906.04341 Cited by: Matching Internal Representations, How to Distill the Internal Knowledge?.
  • [3] K. Clark, M. Luong, U. Khandelwal, C. D. Manning, and Q. V. Le (2019) BAM! Born-Again Multi-Task Networks for Natural Language Understanding. CoRR abs/1907.04829. External Links: Link, 1907.04829 Cited by: Related Work.
  • [4] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1. arXiv preprint arXiv:1602.02830. Cited by: Related Work.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. Cited by: Introduction, Parameter Initialization.
  • [6] W. B. Dolan and C. Brockett (2005) Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), External Links: Link Cited by: item 3.
  • [7] S. Han, H. Mao, and W. J. Dally (2015) Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149. Cited by: Related Work.
  • [8] Q. He, H. Wen, S. Zhou, Y. Wu, C. Yao, X. Zhou, and Y. Zou (2016)

    Effective Quantization Methods for Recurrent Neural Networks

    CoRR abs/1611.10176. External Links: Link, 1611.10176 Cited by: Related Work.
  • [9] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531. Cited by: Introduction, Methodology.
  • [10] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2017) Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. The Journal of Machine Learning Research 18 (1), pp. 6869–6898. Cited by: Related Work.
  • [11] G. Lample and A. Conneau (2019) Cross-lingual Language Model Pretraining. CoRR abs/1901.07291. External Links: Link, 1901.07291 Cited by: Introduction.
  • [12] X. Liu, P. He, W. Chen, and J. Gao (2019-07) Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4487–4496. External Links: Link Cited by: Related Work.
  • [13] X. Liu, P. He, W. Chen, and J. Gao (2019) Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding. arXiv preprint arXiv:1904.09482. Cited by: Related Work.
  • [14] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: Introduction.
  • [15] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving Language Understanding by Generative Pre-Training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf. Cited by: Introduction.
  • [16] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language Models are Unsupervised Multitask Learners. OpenAI Blog 1 (8). Cited by: Introduction.
  • [17] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ Questions for Machine Comprehension of Text. CoRR abs/1606.05250. External Links: Link, 1606.05250 Cited by: 1st item.
  • [18] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2014) FitNets: Hints for Thin Deep Nets. arXiv preprint arXiv:1412.6550. Cited by: footnote 3.
  • [19] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention Is All You Need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: Matching Internal Representations.
  • [20] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. CoRR abs/1804.07461. External Links: Link, 1804.07461 Cited by: item 4, Datasets.
  • [21] A. Warstadt, A. Singh, and S. R. Bowman (2018) Neural Network Acceptability Judgments. CoRR abs/1805.12471. External Links: Link, 1805.12471 Cited by: item 1.
  • [22] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237. Cited by: Introduction.

Appendix for ”Knowledge Distillation from Internal Representation”

We analyze our method using the QQP dataset in the main content of the paper. For completeness, we also add figures for CoLA, RTE, and MRPC. We exclude QQP from this appendix to avoid redundancy.

Performance vs. Parameters

Figure 6: The performance-parameter trade-off for CoLA
Figure 7: The performance-parameter trade-off for RTE
Figure 8: The performance-parameter trade-off for MRPC

In the main content of the paper, we show that our method outperforms the standard knowledge distillation (KD) while both methods use the same number of parameters. The same behavior is shown across all datasets. Figures 6, 7, and 8 for CoLA, RTE, and MRPC, respectively.

The Impact of Data Size

The data size analysis does not apply to the CoLA, RTE, or MRPC datasets. This is because those datasets are too small to show meaningful results. Hence, we use another dataset of the GLUE benchmark to confirm the same behavior we described for the QQP dataset. We use the QNLI dataset.

  • QNLI. This dataset uses the Stanford Question Answering Dataset [17]. The GLUE benchmark formulates the task as a sentence pair classification where the goal is to determine whether a context sentence contains the answer to a given question.

Figure 9: The internal and standard KD comparison across multiple data sizes for QNLI on the development set.

Consistent with the behavior in the QQP dataset, the internal KD is superior to the standard KD, as shown in Figure 9. Specifically, we use the internal KD method as described in Exp3.2 (BERT6 soft + kl + cos). Also, the gap between the two methods increases as we reduce the data size.

Student Convergence

Figure 10: Convergence of internal KD methods on CoLA.

We provide the convergence of the internal KD methods on the CoLA dataset in Figure 10. Like the QQP behavior, the internal KD achieves the best results while the classification layer is still not optimized (see epoch 40-41 for experiment 2.4 in Figure 10). Also, note that between the epoch 0 and 40, the MCC scores are 0 or even negative, and after epoch 40, when the 5th and 6th layers are used, the performance immediately goes up for Exp3.3 and Exp3.4. This suggests that the model progressively learns general aspects in the lower layers and task-specific aspects in the upper layers. This knowledge allows the model to reach top performance even without updating the classification layer once, which is strong evidence of the importance of internal distillation.

Inspecting the Attention Behavior

Figure 11: The attention probabilities of head 0 across all the student (internal and standard KD) and the corresponding matching teacher layers.

For the attention analysis, we show a single head from one transformer layer in the main content of this paper. In Figure 11, we show more of this behavior where we compare the first heads across all the student transformer layers. The internal KD student has replicated the internal representations of the teacher while compressing by a ratio of two layers into one.