Distilling Linguistic Context for Language Model Compression

A computationally expensive and memory intensive neural network lies behind the recent success of language representation learning. Knowledge distillation, a major technique for deploying such a vast language model in resource-scarce environments, transfers the knowledge on individual word representations learned without restrictions. In this paper, inspired by the recent observations that language representations are relatively positioned and have more semantic knowledge as a whole, we present a new knowledge distillation objective for language representation learning that transfers the contextual knowledge via two types of relationships across representations: Word Relation and Layer Transforming Relation. Unlike other recent distillation techniques for the language models, our contextual distillation does not have any restrictions on architectural changes between teacher and student. We validate the effectiveness of our method on challenging benchmarks of language understanding tasks, not only in architectures of various sizes, but also in combination with DynaBERT, the recently proposed adaptive size pruning method.



page 1

page 2

page 3

page 4


Extreme Language Model Compression with Optimal Subwords and Shared Projections

Pre-trained deep neural network language models such as ELMo, GPT, BERT ...

AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

Driven by the teacher-student paradigm, knowledge distillation is one of...

Contrastive Distillation on Intermediate Representations for Language Model Compression

Existing language model compression methods mostly use a simple L2 loss ...

Activation Map Adaptation for Effective Knowledge Distillation

Model compression becomes a recent trend due to the requirement of deplo...

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

Since visual perception can give rich information beyond text descriptio...

Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

With ever growing scale of neural models, knowledge distillation (KD) at...

Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

Today, transformer language models serve as a core component for majorit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since the Transformer, a simple architecture based on attention mechanism, succeeded in machine translation tasks, Transformer-based models have become a new state of the arts that takes over more complex structures based on recurrent or convolution networks on various language tasks, e.g., language understanding and question answering, etc bert; albert; roberta; raffel2019exploring; xlnet. However, in exchange for high performance, these models suffer from a major drawback: tremendous computational and memory costs. In particular, it is not possible to deploy such large models on platforms with limited resources such as mobile and wearable devices, and it is an urgent research topic with impact to keep up with the performance of the latest models from a small-size network.

As the main method for this purpose, Knowledge Distillation (KD) transfers knowledge from the large and well-performing network (teacher) to a smaller network (student). There have been some efforts that distill Transformer-based models into compact networks distilbert; pd; pkd; mobilebert; tinybert; minilm.

However, they all build on the idea that each word representation is independent, ignoring relationships between words that could be more informative than individual representations.

In this paper, we pay attention to the fact that word representations from language models are very structured and capture certain types of semantic and syntactic relationships. - Word2Vec wor2vec and Glove glove

demonstrated that trained embedding of words contains the linguistic patterns as linear relationships between word vectors. Recently,

manning found out that the distance between words contains the information of the dependency parse tree. Many other studies also suggested the evidence that contextual word representations belinkov2017neural; tenney2019bert; tenney2019you and attention matrices vig2019visualizing; clark2019does contain important relations between words. Moreover, identifiability showed the vertical relations in word representations across the transformer layers through word identifiability. Intuitively, although each word representation has respective knowledge, the set of representations of words as a whole is more semantically meaningful, since words in the embedding space are positioned relatively by learning.

Inspired by these observations, we propose a novel distillation objective, termed Contextual Knowledge Distillation (CKD), for language tasks that utilizes the statistics of relationships between word representations. In this paper, we define two types of contextual knowledge: Word Relation (WR) and Layer Transforming Relation (LTR). Specifically, WR is proposed to capture the knowledge of relationships between word representations and LTR defines how each word representation changes as it passes through the network layers.

We validate our method on General Language Understanding Evaluation (GLUE) benchmark and the Stanford Question Answer Dataset (SQuAD), and show the effectiveness of CKD against the current state-of-the-art distillation methods. To validate elaborately, we conduct experiments on task-agnostic and task-specific distillation settings. We also show that our CKD performs effectively on a variety of network architectures. Moreover, with the advantage that CKD has no restrictions on student’s architecture, we show CKD further improves the performance of adaptive size pruning method dynabert that involves the architectural changes during the training.

To summarize, our contribution is threefold:

  • [leftmargin=8mm]

  • (1) Inspired by the recent observations that word representations from neural networks are structured, we propose a novel knowledge distillation strategy, Contextual Knowledge Distillation (CKD), that transfers the relationships across word representations.

  • (2) We present two types of complementary contextual knowledge: horizontal Word Relation across representations in a single layer and vertical Layer Transforming Relation across representations for a single word.

  • (3) We validate CKD on the standard language understanding benchmark datasets and show that CKD not only outperforms the state-of-the-art distillation methods but boosts the performance of adaptive pruning method.

2 Related Work

Knowledge distillation

Since recently popular deep neural networks are computation- and memory-heavy by design, there has been a long line of research on transferring knowledge for the purpose of compression. hinton2015distilling

first proposed a teacher-student framework with an objective that minimizes the KL divergence between teacher and student class probabilities. In the field of natural language processing (NLP), knowledge distillation has been actively studied

seq-kd; hu2018attention. In particular, after the emergence of large language models based on pre-training such as BERT (bert; roberta; xlnet; raffel2019exploring), many studies have recently emerged that attempt various knowledge distillation in the pre-training process and/or fine-tuning for downstream tasks in order to reduce the burden of handling large models. Specifically, tang2019distilling; chia2019transformer proposed to distill the BERT to train the simple recurrent and convolution networks. distilbert; pd proposed to use the teacher’s predictive distribution to train the smaller BERT and pkd proposed a method to transfer individual representation of words. In addition to matching the hidden state, tinybert; mobilebert; minilm also utilized the attention matrices derived from the Transformer. Several works including liu2020fastbert; dynabert improved the performance of other compression methods by integrating with knowledge distillation objectives in the training procedure. In particular, DynaBERT dynabert proposed the method to train the adaptive size BERT using the hidden state matching distillation. Different from previous knowledge distillation methods that transfer respective knowledge of word representations, we design the objective to distill the contextual knowledge contained among word representations.

Contextual knowledge of word representations

Understanding and utilizing the relationships across words is one of the key ingredients in language modeling. Word embedding (wor2vec; glove) that captures the context of a word in a document, has been traditionally used. Unlike the traditional methods of giving fixed embedding for each word, the contextual embedding methods (bert; ELMo) that assign different embeddings according to the context with surrounding words have become a new standard in recent years showing high performance. sentiment_classification improved the performance of the sentiment classification task by using word relation, and probe_structural; manning found that the distance between contextual representations contains syntactic information of sentences. Recently, identifiability also experimentally showed that the contextual representations of each token change over the layers. Our research focuses on knowledge distillation using context information between words and between layers, and to our best knowledge, we are the first to apply this context information to knowledge distillation.

3 Setup and background

Most of the recent state-of-the-art language models are stacking Transformer layers which consist of repeated multi-head attentions and position-wise feed-forward networks.

Transformer based networks.

Given an input sentence with tokens, , most networks (bert; albert; roberta) utilize the embedding layer to map an input sequence of symbol representations to a sequence of continuous representations . Then, each -th Transformer layer of the identical structure takes the previous representations and produces the updated representations through two sub-layers: Multi-head Attention (MHA) and position-wise Feed Forward Network (FFN). The input at the first layer () is simply . In MHA operation where separate attention heads are operating independently, each input token for each head is projected into a query , key , and value , typically . Here, the key vectors and value vectors are packed into the matrix forms and , respectively, and the attention value and output of each head are calculated as followed:

The outputs of all heads are then concatenated and fed through the FFN, producing the single word representation . For clarity, we pack attention values of all words into a matrix form for attention head .

Knowledge distillation for Transformer.

In the general framework of knowledge distillation, teacher network () with large capacity is trained in advance, and then student network () with pre-defined architecture but relatively smaller than teacher network is trained with the help of teacher’s knowledge. Specifically, given the teacher parameterized by , training the student parameterized by aims to minimize two objectives: i) the cross-entropy loss between the output of the student network and the true label and ii) the difference of some statistics between teacher and student models. Overall, our goal is to minimize the following objective function:

where controls the relative importance between two objectives. Here, characterizes the knowledge being transferred and can vary depending on the distillation methods, and

is a matching loss function such as

, or Huber loss.

Recent studies on knowledge distillation for Transformer-based BERT can also be understood in this general framework. In particular, each distillation methods of previous works are summarized in Appendix A.

Figure 1: Overview of our contextual knowledge distillation. (a) In the teacher-student framework, we define the two contextual knowledge, word relation and layer transforming relation which are the statistics of relation across the words from the same layer (orange) and across the layers for the same word (turquoise), respectively. (b) Given the pair-wise and triple-wise relationships of WR and LTR from teacher and student, we define the objective as matching loss between them.

4 Contextual Knowledge Distillation

We now present our distillation objective that transfers the structural or contextual knowledge which is defined based on the distribution of word representations. Unlike previous methods distilling each word separately, our method transfers the information contained in relationships between words or between layers, and provides a more flexible way of constructing embedding space than directly matching representations. The overall structure of our method is illustrated in Figure 1. Specifically, we design two key concepts of contextual knowledge from language models: Word Relation-based and Layer Transforming Relation-based contextual knowledge, as shown in Figure 1.

4.1 Word Relation (WR)-based Contextual Knowledge Distillation

Inspired by previous studies suggesting that neural networks can successfully capture contextual relationships across words (manning; glove; wor2vec), WR-based CKD aims to distill the contextual knowledge contained in the relationships across words at certain layer. The “relationship” across a set of words can be defined in a variety of different ways. Our work focuses on defining it as the sum of pair-wise and triple-wise relationships. Specifically, for each input with words, let be the word representations at layer from the language model (it could be teacher or student), as described in Section 3. Then, the objective of WR-based CKD is to minimize the following loss:


where . The function and define the pair-wise and triple-wise relationships, respectively and adjust the scales of two losses. Here, we suppress the layer index for clarity, but the distillation loss for the entire network is simply summed for all layers. Since not all terms in Eq. (4.1) are equally important in defining contextual knowledge, we introduce the weight values and to control the weight of how important each pair-wise and triple-wise term is. Determining the values of these weight values is open as an implementation issue, but it can be determined by the locality of words (i.e. if and 0, otherwise), or by attention information to focus only on relationship between related words. In this work, we use the locality of words as weight values.

While functions and defining pair-wise and triple-wise relationship also have various possibilities, the simplest choices are to use the distance between two words for pair-wise and the angle by three words for triple-wise , respectively.

Pair-wise via distance.

Given a pair of word representations () from the same layer, could be defined as cosine distance: or distance: .

Triple-wise via angle.

Triple-wise relation captures higher-order structure and provides more flexibility in constructing contextual knowledge. One of the simplest forms for is the angle, which is calculated as


where denotes the dot product between two vectors.

Despite its simple form, efficiently computing the angles in Eq. (4.1) for all possible triples out of words requires storing all relative representations  in a tensor111From the equation , computing the pair-wise distance with the right hand side of equation requires no additional memory cost.. This incurs an additional memory cost of . In this case, using locality for in Eq. (4.1) mentioned above can be helpful; by considering only the triples within a distance of from , the additional memory space required for efficient computation is , which is beneficial for . It also reduces the computation complexity of computing triple-wise relation from to . Moreover, we show that measuring angles in local window is helpful in the performance in the experimental section.

4.2 Layer Transforming Relation (LTR) -based Contextual Knowledge Distillation

The second structural knowledge that we propose to capture is on “how each word is transformed as it passes through the layers". Transformer-based language models are composed of a stack of identical layers and thus generate a set of representations for each word, one for each layer, with more abstract concept in the higher hierarchy. Hence, LTR-based CKD aims to distill the knowledge of how each word develops into more abstract concept within the hierarchy. Toward this, given a set of representations for a single word in layers, for student and for teacher (Here we abuse the notation and is not necessarily the entire layers of student or teacher. It is the index set of layers which is defined in alignment strategy; this time, we will suppress the word index below), the objective of LTR-based CKD is to minimize the following loss:


where and again adjust the scales of two losses. Here, the composition of Eq. (4.2) is the same as Eq. (4.1), but only the objects for which the relationships are captured have been changed from word representations in one layer to representations for a single word in layers. That is, the relationships among representations for a word in different layers can be defined from distance or angle as in Eq. (4.1): or and .

Alignment strategy.

When the numbers of layers of teacher and student are different, it is important to determine which layer of the student learns information from which layer of the teacher. Previous works (pkd; tinybert) resolved this alignment issue via the uniform (i.e. skip) strategy and demonstrated its effectiveness experimentally. For -layered teacher and -layered student, the layer matching function is defined as

where is the greatest common divisor of and , and .

Overall training objective.

The distillation objective aims to supervise the student network with the help of teacher’s knowledge. Multiple distillation loss functions can be used during training, either alone or together. We combine the proposed CKD with class probability matching (hinton2015distilling) as an additional term. In that case, our overall distillation objective is as follows:

where is a tunable parameter to balance the loss terms.

max width= Model #Params CoLA MNLI-(m/-mm) SST-2 QNLI MRPC QQP RTE STS-B Avg (Mcc) (Acc) (Acc) (Acc) (F1) (Acc) (Acc) (Spear) BERT (Teacher) 110M 60.4 84.8/84.6 94.0 91.8 90.3 91.4 70.4 89.5 84.1
Truncated BERT pkd
67.5M 41.4 81.2/- 90.8 87.9 82.7 90.4 65.5 - -
BERT pd 67.5M 81.1/81.7 91.1 87.8 87.9 90.0 63.0 79.7
TinyBERT tinybert
67.5M 42.8 83.5/ 91.6 90.5 88.4 90.6 72.2 81.3

67.5M 52.7 83.5/83.4 92.4 90.7 89.1 90.8 70.1 89.1 82.4

Table 1: Comparisons for task-agnostic distillation. For the task-agnostic distillation comparison, we do not use task-specific distillation for a fair comparison. The results of TinyBERT and Truncated BERT are ones reported in minilm. Other results are as reported by their authors. We exclude BERT-of-Theseus since the authors do not consider task-agnostic distillation. Results of development set are averaged over 4 runs. “-" indicates the result is not reported in the original papers and the trained model is not released. marks our runs with the officially released model by the authors.

max width= Model #Params CoLA MNLI-(m/-mm) SST-2 QNLI MRPC QQP RTE STS-B Avg (Mcc) (Acc) (Acc) (Acc) (F1) (Acc) (Acc) (Spear) BERT (Teacher) 110M 60.4 84.8/84.6 94.0 91.8 90.3 91.4 70.4 89.5 84.1 PD pd 67.5M - 82.5/83.4 91.1 89.4 89.4 90.7 66.7 - - PKD pkd 67.5M 45.5 81.3/- 91.3 88.4 85.7 88.4 66.5 86.2 79.2 TinyBERT tinybert 67.5M 53.8 83.1/83.4 92.3 89.9 88.8 90.5 66.9 88.3 81.7 BERT-of-Theseus berttheseus 67.5M 51.1 82.3/- 91.5 89.5 89.0 89.6 68.2 88.7 81.2 CKD 67.5M 55.1 83.6/84.1 93.0 90.5 89.6 91.2 67.3 89.0 82.4

Table 2: Comparisons for task-specific distillation. For a fair comparison, all students are 6/768 BERT models, distilled by BERT (12/768) teachers. Other results except for TinyBERT and PKD are as reported by their authors. Results of development set are averaged over 4 runs. “-" indicates the result is not reported. Average score is computed excluding the MNLI-mm accuracy.
Model #Params SQuAD 1.1v
BERT (Teacher) 110M 81.3 88.6
PKDpkd 67.5M 77.1 85.3
PDpd 67.5M 80.1 87.0
TinyBERTtinybert 67.5M 80.4 87.2
CKD 67.5M 81.8 88.7
Table 3: Comparison of task-specific

distillation on SQuAD dataset. The results of baselines and ours are reported by performing distillation with their objectives on the top of pre-trained 6-layer BERT (6/768) 


4.3 Architectural Constraints in Distillation Objectives

State-of-the-art knowledge distillation objectives commonly used come with constraints in designing student networks since they directly match some parts of the teacher and student networks such as attention matrices or word representations. For example, DistilBERT distilbert and PKD pkd

match each word representation independently using their cosine similarities,

, hence the embedding size of student network should follow that of given teacher network. Similarly, TinyBERT tinybert and MINI-LM minilm match the attention matrices via . Therefore, we should have the same number of attention heads for teacher and student networks (see Appendix A for more details on diverse distillation objectives).

In addition to the advantage of distilling contextual information, our CKD method has the advantage of being able to select the student network’s structure more freely without the restrictions that appear in existing KD methods. This is because CKD matches the pair-wise or triple-wise relationships of words from arbitrary networks (student and teacher), as shown in Eq. (4.1), so it is always possible to match the information of the same dimension without being directly affected by the structure. Thanks to this advantage, in the experimental section, we show that CKD can further improve the performance of recently proposed DynaBERT (dynabert) that involves flexible architectural changes in the training phase.

5 Experiments

We conduct task-agnostic and task-specific distillation experiments to elaborately compare our CKD with baseline distillation objectives. We then report on the performance gains achieved by our method for BERT architectures of various sizes and inserting our objective for training DynaBERT which can run at adaptive width and depth through pruning the attention heads or layers. Finally, we analyze the effect of each component in our CKD and the impact of leveraging locality for in Eq. (4.1).


For task-agnostic distillation which compresses a large pre-trained language model into a small language model on the pre-training stage, we use a document of English Wikipedia. For evaluating the compressed language model on the pre-training stage and task-specific distillation, we use the GLUE benchmark glue which consists of nine diverse sentence-level classification tasks and SQuAD squad.


For task-agnostic distillation, we use the original BERT without fine-tuning as the teacher. Then, we perform the distillation on the student where the model size is pre-defined. We perform distillation using our proposed CKD objective with class probability matching of masked language modeling for 3 epochs while task-agnostic distillation following the


and keep other hyperparameters the same as BERT pre-training 

bert. For task-specific distillation, we experiment with our CKD on top of pre-trained BERT models of various sizes which are released for research in institutions with fewer computational resources222https://github.com/google-research/bert pd. For the importance weight of each pair-wise and triple-wise terms, we leverage the locality of words, in that if and 0, otherwise. For this, we select the in (10-21). More details including hyperparameters are provided in Appendix B. The code to reproduce the experimental results is available at https://github.com/GeondoPark/CKD.

Figure 2: Task specific distillation on various sizes of models. We consider diverse cases by changing (a) the network structures, (b) the number of parameters and (c) the number of FLOPs. All results are averaged over 4 runs on the development set.
Figure 3: Boosting the performance of DynaBERT via training with CKD. Comparison between the original DynaBERT and CKD-augmented DynaBERT according to (a) the number of parameters and (b) the number of FLOPs. The results are averaged over 4 runs on the development set.

5.1 Main Results

To verify the effectiveness of our CKD objective, we compare the performance with previous distillation methods for BERT compression including task-agnostic and task-specific distillation. Following the standard setup in baselines, we use the BERT (12/768)333In notation , means the number of layers and denotes a hidden size in intermediate layers. The number of attention heads is defined as . as the teacher and 6-layer BERT (6/768) as the student network. Therefore, the student models used in all baselines and ours have the same number of parameters (67.5M) and inference FLOPs (10878M) and time.

Task-agnostic Distillation.

We compare with three baselines: 1) Truncated BERT which drop top 6 layers from BERT proposed in PKD pkd, 2) BERT which trained using the Masked LM objectives provided in PD pd, 3) TinyBERT tinybert which propose the individual word representation and attention map matching. Since MobileBERT mobilebert use the specifically designed teacher and student network which have 24-layers with an inverted bottleneck structure, we do not compare with. DistilBERT distilbert and MINI-LM minilm use the additional BookCorpus dataset which is no longer publicly available. Moreover, the authors do not release the code, making it hard to reproduce. Thus we do not compare in the main paper for a fair comparison. The comparisons with those methods are available in Appendix C. Results of task-agnostic distillation on GLUE dev sets are presented Table 1. The result shows that CKD surpasses all baselines. Comparing with TinyBERT which transfers the knowledge of individual representations, CKD outdoes in all scores except for the RTE. These results empirically demonstrate that distribution-based knowledge works better than individual representation knowledge.

Table 4: Ablation study about the impact of each component of CKD. ’- *’ denotes to eliminate *, the component of CKD. max width=0.98 Objectives MNLI-(m/-mm) SST-2 QNLI MRPC QQP STS-B (Acc) (Acc) (Acc) (F1) (Acc) (Spear) CKD 80.7/80.8 91.4 88.1 88.8 90.3 87.9 - WR 80.1/80.6 90.6 87.5 88.5 89.7 87.5 - LTR 79.9/80.3 91.1 87.8 88.3 90.3 87.6 - WR - LTR 79.2/79.9 89.1 87.4 88.1 89.2 86.8 max width=0.98 Figure 4: Effect of local window size.

Task-specific Distillation.

Here, we compare with four baselines that do not perform distillation in the pre-training: 1) PD pd which do pre-training with Masked LM and distills with Logit KD in task-specific fine-tuning process. 2) PKD pkd which uses only 6 layers below BERT, and distillation is also performed only in task-specific fine-tuning. The GLUE results on dev sets of PKD are taken from (berttheseus). 3) TinyBERT tinybert. For the TinyBERT, we also perform distillation only in the task-specific fine-tuning with their objectives on the top of the pre-trained model provided by pd for a fair comparison. 4) BERT-of-Theseus berttheseus which learn a compact student network by replacing the teacher layers in a fine-tuning stage. Results of task-specific distillation on GLUE dev sets and SQuAD datasets are presented in Table 2 and 3, respectively. Note that briefly, the CKD also outperforms all baselines for all GLUE datasets and SQuAD dataset except for RTE for task-specific distillation, convincingly verifying its effectiveness. These results consistently support that contextual knowledge works better than other distillation knowledge.

5.2 Effect of CKD on various sizes of models

For the knowledge distillation with the purpose of network compression, it is essential to work well in more resource-scarce environments. To this end, we further evaluate our method on various sizes of architectures. For this experiment, we perform distillation on a task-specific training process on top of various size pre-trained models provided by pd. We compare CKD with three baselines: 1) LogitKD objective used by distilbert; pd. 2) TinyBERT tinybert objective which includes individual word representations and attention matrix matching. 3) MINI-LM minilm objective which includes attention matrix and value-value relation matching. We implement the baselines and runs for task-specific distillation. We note that MINI-LM and TinyBERT objective are applicable only to models (*/768) which have the same number of attention heads with the teacher model (12/768). Figure 2 illustrate that our CKD consistently exhibits significant improvements in the performance compared LogitKD. In addition, for task-specific distillation, we show that CKD works better than all baselines on (*/768) student models. The results on more datasets are provided in Appendix E.

5.3 Incorporating with DynaBERT

DynaBERT dynabert is a recently proposed adaptive-size pruning method that can run at adaptive width and depth by removing the attention heads or layers. In the training phase, DynaBERT uses distillation objectives which consist of LogitKD and individual word representations matching to improve the performance. Since the CKD objective has no constraints about architecture such as embedding size or the number of attention heads, we validate the objective by replacing it with CKD. The algorithm of DynaBERT and how to insert CKD are provided in Appendix D. To observe just how much distillation alone improves performance, we do not use data augmentation and an additional fine-tuning process. We note that objectives proposed in MINI-LM minilm and TinyBERT tinybert cannot be directly applied due to constraints of the number of attention heads. As illustrated in Figure 3, CKD consistently outperforms the original DynaBERT on dynamic model sizes, supporting the claim that distribution-based knowledge is more helpful than individual word representation knowledge. The results on more datasets are provided in Appendix E.

5.4 Ablation Studies

We provide additional ablation studies to analyze the impact of each component of the CKD and the introduced locality () in Eq. (4.1) as the weight of how important each pair-wise and triple-wise term is. For these studies, we fix the student network with 4-layer BERT (4/512) and report the results as an average of over 4 runs on the development set.

Impact of each component of CKD.

The proposed CKD transfers the word relation based and layer transforming relation based contextual knowledge. To isolate the impact on them, we experiment successively removing each piece of our objective. Table 4 summarizes the results, and we observe that WR and LTR can bring a considerable performance gain when they are applied together, verifying their individual effectiveness.

Locality as the importance of relation terms.

We introduced the additional weights (, ) in Eq. (4.1) for CKD-WR (and similar ones for CKD-LTR) to control the importance of each pair-wise and triple-wise term and suggested using the locality for them as one possible way. Here, we verify the effect of locality by increasing the local window size () on the SST-2 and QNLI datasets. The result is illustrated in Figure 4. We observe that as the local window size increases, the performance improves, but after some point, the performance is degenerated. From this ablation study, we set the window size () between 10-21.

6 Conclusion

We proposed a novel distillation strategy that leverages contextual information efficiently based on word relation and layer transforming relation. To our knowledge, we are the first to apply this contextual knowledge which is studied to interpret the language models. Through various experiments, we show not only that CKD outperforms the state-of-the-art distillation methods but also the possibility that our method boosts the performance of other compression methods.


This work was supported by the National Research Foundation of Korea (NRF) grants (2018R1A5A1059921, 2019R1C1C1009192) and Institute of Information & Communications Technology Planning & Evaluation (IITP) grants (No.2017-0-01779, A machine learning and statistical inference framework for explainable artificial intelligence, No.2019-0-01371, Development of brain-inspired AI with human-like intelligence, No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)) funded by the Korea government (MSIT).


Appendix A Explanation of previous methods and their constraints

Table 5 present the details of knowledge distillation objectives of previous methods and their constraints.

DistilBERT distilbert uses logit distillation loss (Logit KD), masked language modeling loss, and cosine loss between the teacher and student word representations in the learning process. The cosine loss serves to align the directions of the hidden state vectors of the teacher and student. Since the cosine of the two hidden state vectors is calculated in this process, they have the constraint that the embedding size of the teacher and the student model must be the same.

PKD pkd transfers teacher knowledge to the student with Logit KD and patient loss. The patient loss is the mean-square loss between the normalized hidden states of the teacher and student. To calculate the mean square error between the hidden states, they have a constraint that the dimensions of hidden states must be the same between teacher and student.

TinyBERT tinybert uses additional loss that matches word representations and attention matrices between the teacher and student. Although they acquire flexibility on the embedding size, using an additional parameter, since the attention matrices of the teacher and student are matched through mean square error loss, the number of attention heads of the teacher and student must be the same.

MobileBERT mobilebert utilizes a similar objective with TinyBERT (tinybert) for task-agnostic distillation. However, since they match the hidden states with distance and attention matrices with divergence between teacher and student, they have restrictions on the size of hidden states and the number of attention heads.

MiniLM minilm proposes distilling the self-attention module of the last Transformer layer of the teacher. In self-attention module, they transfer attention matrices such as TinyBERT and MobileBERT and Value-Value relation matrices. Since they match the attention matrices of the teacher and student in a one-to-one correspondence, the number of attention heads of the teacher and student must be the same.

The methods introduced in Table 5 have constraints by their respective knowledge distillation objectives. However, our CKD method which utilizes the relation statistics between the word representations (hidden states) has the advantage of not having any constraints on student architecture.

Appendix B Details of experiment setting

This section introduces the experimental setting in detail. We implemented with PyTorch framework and huggingface’s transformers package 


Task-agnostic distillation

We use the pre-trained original BERT with masked language modeling objective as the teacher and a document of English Wikipedia as training data. We set the max sequence length to 128 and follow the preprocess and WordPiece tokenization of bert. Then, we perform the distillation for 3 epochs. For the pre-training stage, we use the CKD objective with class probability matching of masked language modeling and keep other hyper-parameters the same as BERT pre-training bert.

Task-specific distillation

Our contextual knowledge distillation proceeds in the following order. First, from pre-trained BERT, task-specific fine-tuning is conducted to serve as a teacher. Then, prepare the pre-trained small-size architecture which serves as a student. In this case, pre-trained models of various model sizes provided by pd are employed. Finally, task-specific distillation with our CKD is performed.

To reduce the hyperparameters search cost, in Eq. (4.1) and in Eq. (4.2) are used with same value. For the importance weights introduced for pair-wise and triple-wise terms, the locality is applied only to the importance weight of the word relation (WR)-based CKD loss. The importance weight of the layer transforming relation (LTR)-based CKD loss is set to 1. In this paper, we report the best result among the following values to find the optimal hyperparameters of each dataset:

  • Alpha () : 0.7, 0.9

  • Temperature () : 3, 4

  • , : 1, 10, 100, 1000

  • : 1, 10, 100, 1000

Other training configurations such as batch size, learning rate and warm up proportion are used following the BERT bert.

max width= Model #Params CoLA MNLI-(m/-mm) SST-2 QNLI MRPC QQP RTE STS-B (Mcc) (Acc) (Acc) (Acc) (F1) (Acc) (Acc) (Spear) BERT (Teacher) 110M 60.4 84.8/84.6 94.0 91.8 90.3 91.4 70.4 89.5 Truncated BERT pkd 67.5M 41.4 81.2/- 90.8 87.9 82.7 90.4 65.5 - BERT pd 67.5M 81.1/81.7 91.1 87.8 87.9 90.0 63.0 DistilBERT distilbert 67.5M 51.3 82.2/- 91.3 89.2 87.5 88.5 59.9 86.9 TinyBERT tinybert 67.5M 42.8 83.5/ 91.6 90.5 88.4 90.6 72.2 MINI-LM minilm 67.5M 49.2 84.0/- 92.0 91.0 88.4 91.0 71.5 - CKD 67.5M 52.7 83.5/83.4 92.4 90.7 89.1 90.8 70.1 89.1

Table 6: Full comparison of task-agnostic distillation comparing our CKD against the baseline methods. For the task-agnostic distillation comparison, we do not use task-specific distillation for a fair comparison. The results of TinyBERT cited as reported by minilm. Other results are as reported by their authors. Results of the development set are averaged over 4 runs. “-" means the result is not reported and the trained model is not released. marks our runs with the officially released model.

Appendix C Additional comparison on task-agnostic distillation

We report the fair comparison of our method and baselines about the task-agnostic distillation in Section 5.1 of the main paper. Several works distilbert; minilm use the additional BookCorpus dataset which is no longer publicly available. Here, we present the full comparison of CKD and baselines including DistilBERT distilbert and MINI-LM minilm. As shown in Table 6, even though we do not use the BookCorpus dataset, we outperform all baselines on four datasets and obtain comparable performance on the rest of the datasets.

Appendix D Applying CKD to DynaBERT

In this section, we describe how we apply our CKD objective to DynaBERT (dynabert). Training DynaBERT consists of three stages: 1) Rewire the model according to the importance and then 2) Go through the two-stage of adaptive pruning with distillation objective. Since we suppress some details of DynaBERT for clarity, refer to the paper dynabert for more information.

We summarize the training procedure of DynaBERT with CKD in algorithm LABEL:alg:dynabert_w/CKD

. To fully exploit the capacity, more important attention heads and neurons must be shared more across the various sub-networks. Therefore, we follow phase 1 in DynaBERT to rewire the network by calculating the loss and estimating the importance of each attention head in the Multi-Head Attention (MHA) and neuron in the Feed-Forward Network (FFN) based on gradients. Then, they train the DynaBERT by accumulating the gradient varying the width and depth of BERT. In these stages, they utilize distillation objective which matches hidden states and logits to improve the performance. We apply our CKD at these stages by replacing their objective with CKD as shown in algorithm

LABEL:alg:dynabert_w/CKD (Blue). Since CKD has no restrictions on student’s architecture, it can be easily applied.


Appendix E More Results

Due to space limitations in the main paper, we only report the results on a subset of GLUE datasets for experiments about the effect of model size for CKD and boosting the DynaBERT with CKD. Here, we report all datasets of GLUE except for CoLA for two experiments. We exclude the CoLA dataset since the distillation losses are not converged properly in the very small-size models.

Here, we present the results of three experiments on additional datasets in order. 1) Effect of CKD on various sizes of models. 2) Boosting the performance of DynaBERT when CKD is applied.

Effect of CKD on various sizes of models.

Figure 5 illustrates the performance of task-specific distillation on various sizes of models. Again, we note that MINI-LM and TinyBERT objectives are applicable only to models (*/768), which have the same number of attention heads as the teacher model (12/768). As shown in Figure 5, our CKD consistently exhibits significant improvements in the performance compared LogitKD for all model sizes. Compared to TinyBERT and MINI-LM, CKD shows higher performance on all datasets for almost all model sizes (*/768).

Incorporating with DynaBERT

Figure 6 shows the performance of the original DynaBERT and when CKD is applied. As illustrated in Figure 6, CKD further improves the original DynaBERT on dynamic width and depth size, convincingly verifying its effectiveness. The results also present the possibility that our method boosts the performance of other compression methods.

Figure 5: The efficiency of various sizes of models for CKD compared to baselines. The performance graph according to (a) network structure (b) the number of parameters (c) the number of FLOPs. The results are averaged over 4 runs on the development set.
Figure 6: Boosting the performance of DynaBERT when CKD is applied. The performance graph for comparison of original DynaBERT and CKD according to (a) the number of parameters and (b) the number of FLOPs. The results are averaged over 4 runs on the development set.