Pea-KD: Parameter-efficient and Accurate Knowledge Distillation

by   Ikhyun Cho, et al.
Seoul National University

How can we efficiently compress a model while maintaining its performance? Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model and tries to retain the teacher model's level of performance as much as possible. However, the existing KD methods suffer from the following limitations. First, since the student model is small in absolute size, it inherently lacks model complexity. Second, the absence of an initial guide for the student model makes it difficult for the student to imitate the teacher model to its fullest. Conventional KD methods yield low performance due to these limitations. In this paper, we propose Parameter-efficient and accurate Knowledge Distillation (Pea-KD), a novel approach to KD. Pea-KD consists of two main parts: Shuffled Parameter Sharing (SPS) and Pretraining with Teacher's Predictions (PTP). Using this combination, we are capable of alleviating the KD's limitations. SPS is a new parameter sharing method that allows greater model complexity for the student model. PTP is a KD-specialized initialization method, which can act as a good initial guide for the student. When combined, this method yields significant increase in student model's performance. Experiments conducted on different datasets and tasks show that the proposed approach improves the student model's performance by 4.4 GLUE tasks, outperforming existing KD baselines by significant margins.



page 1

page 2

page 3

page 4


Fixing the Teacher-Student Knowledge Discrepancy in Distillation

Training a small student network with the guidance of a larger teacher n...

PURSUhInT: In Search of Informative Hint Points Based on Layer Clustering for Knowledge Distillation

We propose a novel knowledge distillation methodology for compressing de...

Relational Knowledge Distillation

Knowledge distillation aims at transferring knowledge acquired in one mo...

On the Efficiency of Subclass Knowledge Distillation in Classification Tasks

This work introduces a novel knowledge distillation framework for classi...

Adaptive Distillation: Aggregating Knowledge from Multiple Paths for Efficient Distillation

Knowledge Distillation is becoming one of the primary trends among neura...

LILA-BOTI : Leveraging Isolated Letter Accumulations By Ordering Teacher Insights for Bangla Handwriting Recognition

Word-level handwritten optical character recognition (OCR) remains a cha...

Heterogeneous Knowledge Distillation using Information Flow Modeling

Knowledge Distillation (KD) methods are capable of transferring the know...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

How can we improve the accuracy of knowledge distillation (KD) with smaller number of parameters? KD uses a well-trained large teacher model to train a smaller student model. Conventional KD method (Hinton et al. (2006)) trains the student model using the teacher model’s predictions as targets. That is, the student model uses not only the true labels (hard distribution) but also the teacher model’s predictions (soft distribution) as targets. Since better KD accuracy is directly linked to better model compression, improving KD accuracy is valuable and crucial.

Naturally, there have been many studies and attempts to improve the accuracy of KD. Sun et al. (2019) introduced Patient KD which utilizes not only the teacher model’s final output but also the intermediate outputs generated from the teacher’s layers. Jiao et al. (2019) applied additional KD in the pretraining step of the student model. However, existing KD methods share the limitation of students having lower model complexity compared to their teacher models, since they are small in size. In addition, there are no proper initial guides for the student model, which is important especially when the student models are small. These limitations lead to insufficient accuracy of student models.

In this paper, we propose Pea-KD (Parameter-efficient and accurate Knowledge Distillation), a novel KD method designed especially for Transformer-based models (Vaswani et al. (2017)), which significantly improves the student model’s accuracy. Pea-KD is composed of two modules, Shuffled Parameter Sharing (SPS) and Pretraining with Teacher’s Predictions (PTP). When combined, these two methods alleviate the aforementioned KD’s limitations and yield higher performance. Pea-KD is based on the following two main ideas.

  1. We apply SPS in order to increase the effective model complexity of the student model while not increasing the number of parameters. SPS has two steps: 1) stacking the layers that share parameters, and 2) shuffling the parameters between shared pairs of layers. Doing so increases the model’s effective complexity which enables the student to better replicate the teacher model (details in Section 3.2).

  2. We design an effective pretraining task called PTP for a student in KD. Through PTP, the student model obtains additional information about the teacher and the task itself, which helps the student acquire the teacher’s knowledge more efficiently during the KD process (details in Section 3.3).

Throughout the paper we use PeaBERT (Parameter-efficient and accurate BERT), which is Pea-KD applied on BERT, as an example to investigate our proposed approach. We summarize our main contributions as follows:

  • Novel framework for KD. We propose SPS and PTP, a novel parameter sharing method and a novel KD-specialized initialization method. These methods serve as a new framework for KD to significantly improve performance.

  • Performance. When tested on four of the widely used GLUE tasks, PeaBERT improves KD accuracy up to 14.8% and 4.4% on average compared to the original BERT model. PeaBERT also outperforms existing state-of-the-art KD baselines by 3.5% on average.

  • Generality.

    Our proposed method Pea-KD can be applied to any transformer-based models and any classfication tasks with minimal modification. Our method can thus be generally applied to many ongoing KD studies in Natural Language Processing.

The rest of the paper is organized as follows. Section 2 covers related works. Section 3 describes our proposed method in detail. Section 4 presents experimental results. Then we conclude in Section 5.

2 Related Work

Pretrained Language Models.

The framework of first pre-training language models and then finetuning for downstream tasks has now become the industry standard for Natural Language Processing (NLP) models. Pretrained language models, such as BERT (Devlin et al. (2018)), XLNet (Yang et al. (2019)), RoBERTa (Liu et al. (2019)) and ELMo (Peters et al. (2018)) prove how powerful pretrained language models can be. Specifically, BERT is a language model consisting of multiple Transformer layers. By pretraining using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), BERT has achieved the state-of-the-art performance on a wide range of NLP tasks, such as the GLUE benchmark (Wang et al., 2018) and SQuAD (Rajpurkar et al., 2016). However, these modern pretrained models are very large in size and contain millions of parameters, making them nearly impossible to apply on edge devices with limited amount of resources. In our work, we address this challenging problem by applying a novel KD method on the BERT model. Our approach can be easily applied to other transformer-based models as well.

Model Compression.

As deep learning algorithms started getting adopted, implemented, and researched in diverse fields, high computation costs and memory shortage have started to become challenging factors. Especially in NLP, pretrained language models typically require a large set of parameters. This results in extensive cost of computation and memory. As such, Model Compression, which is to compress a model while preserving the performance as much as possible, has now become an important task for deep learning. There have already been many attempts to tackle this problem, including quantization (

Gong et al. (2014)) and weight pruning (Han et al. (2015)). One promising approach is KD (Hinton et al. (2015)) which we focus on in this paper.

Knowledge Distillation (KD).

As briefly covered in Section 1, KD transfers knowledge from a well-trained and large teacher model to a smaller student model. KD uses the teacher model’s predictions on top of the true labels to train the student model. It is proven through many experiments that the student model learns to imitate the soft distribution of the teacher model’s predictions, and ultimately performs better than learning solely from the original data. There have already been many attempts to compress BERT using KD. Patient Knowledge Distillation (Sun et al. (2019)) extracts knowledge not only from the final prediction of the teacher, but also from the intermediate layers. TinyBERT (Jiao et al. (2019)) uses a two-stage learning framework which applies knowledge distillation in both pretraining and task-specific finetuning. DistilBERT (Sanh et al. (2019)) uses half of the layers of BERT-base model and applies KD during pretraining and finetuning of BERT. Zhao et al. (2019) trains the student model with smaller vocabulary set and lower hidden state dimensions. Unfortunately, these existing methods do not give sufficient accuracy due to the student model’s insufficient complexity and absence of a clear guideline for initialization. In this paper, we propose a new KD approach using parameter sharing and KD-specific initialization to alleviate the above issues. Our method improves the student model’s performance significantly and shows the state-of-the-art performance among distillation methods on BERT.

Parameter Sharing.

The idea of parameter sharing across different layers is a widely used idea for model compression. There have been several attempts to use parameter sharing in transformer architecture and BERT model. However, existing parameter sharing methods exhibit a large tradeoff between model performance and model size. They reduce the model’s size significantly, but as a result, also suffer from a great loss in performance. In this paper, we propose a novel parameter sharing method which uses a shuffling mechanism to reduce this tradeoff, resulting in an improved performance using the same number of parameters.

3 Proposed Methods

We propose PeaBERT, a novel KD method applied on BERT that shows a higher KD accuracy with smaller number of parameters compared to existing methods. PeaBERT consists of two main modules: SPS and PTP, which together boost the student model’s performance. In the following, we provide an overview of the main challenges in KD and our methods to address them in Section 3.1. We then discuss the precise procedures of SPS and PTP in Sections 3.2 and 3.3. Lastly, we explain our final method, PeaBERT and the training details in Section 3.4.

3.1 Overview

BERT-base model contains over 109 million parameters. Its extensive size makes model deployment often infeasible and computationally expensive in many cases, such as on mobile devices. As a result, industry practitioners commonly use a smaller version of BERT and apply KD. However, the existing KD methods entail the following challenges:

  • Insufficient model complexity of the student model. Since the student model contains fewer number of parameters than the teacher model, it’s model complexity is also lower. The smaller and simpler the student model gets, the gap between the student and the teacher grows, making it increasingly difficult for the student to replicate the teacher model’s performance. The limited complexity hinders the student model’s performance. How can we enlarge the student model’s complexity while maintaining the same number of parameters?

  • Absence of proper initial guide for the student model. Most of the existing KD methods do not consider the student model’s initialization to be crucial. In most cases, a truncated version of pretrained BERT-base model is used. There is no widely accepted and vetted guide to selecting the student’s initial state of the KD process. In reality, this hinders the student from efficiently reproducing the teacher’s results. How can we effectively initialize the student model to achieve a better KD accuracy?

We propose the following main ideas to address the challenges:

  • Shuffled Parameter Sharing (SPS): amplifying ‘effective’ model complexity of the student. To address the complexity limitation, we introduce SPS. SPS increases the student’s effective model complexity while using the same number of parameters (see details in Section 3.2). As a result, the SPS-applied student model achieves a better performance without running into the usual computational challenges.

  • Pretraining with Teacher’s Predictions (PTP): a novel pretraining task utilizing teacher’s predictions for student initialization. To address the limitation of the initial guide, we propose PTP, a novel pretraining method for the student by utilizing teacher model’s predictions. Through PTP, the student model pre-learns information about the teacher and the task itself. It helps the student better acquire the teacher’s knowledge during the KD process (see details in Section 3.3).

The following subsections describe the precise procedures of SPS, PTP, and PeaBERT in detail.

3.2 Shuffled Parameter Sharing (SPS)

(a) SPS step 1
(b) SPS step 1 + step 2
(c) SPS for six-layer student
Figure 1: Graphical representation of SPS: (a) the first step of SPS, (b) the second step of SPS, and (c) modified SPS for a six layer student.

SPS improves student’s effective model complexity while not increasing the number of parameters, addressing the complexity limitations of a typical KD. SPS is composed of the following two steps.

Step1. Paired Parameter Sharing. We start with doubling the number of layers in the student model. We then share the parameters between the bottom half and the upper half of the model, as graphically represented in Figure 0(a). By doing so the model now has twice the number of layers and thus a higher effective model complexity while maintaining the same number of actual parameters used.

Step2. Shuffling.

We shuffle the Query and Key parameters between the shared pairs. That is, for the upper half of layers we use the original Key parameters as Query parameters and the original Query parameters as Key parameters. This allows the parameter-shared pairs to have more degree of freedom and

behave more closely to individual layers, resulting in an increased model complexity of the student. We call this architecture SPS, which is depicted in Figure 0(b). For the 6-layer student case we slightly modify the architecture as in Figure 0(c) (we apply SPS on the top 3 layers only).

In sum, the SPS model has the same number of parameters as the original student model but has much greater effective model complexity. In Section 4, we validate through experiments that step1 (Paired Parameter Sharing) and step2 (Shuffling) indeed increase the effective model complexity, directly contributing to performance improvement.

3.3 Pretraining with Teacher’s Predictions (PTP)

There can be several candidates for KD-specialized initialization. We propose a pretraining approach called PTP, and experimentally show that it improves KD accuracy significantly. The intuition here is that by PTP, the student model acquires additional knowledge about both the teacher as well as the downstream task. With this additional information, the student obtains the teacher’s knowledge more efficiently during the actual KD process.

Most of the previous studies on KD do not elaborate on the initialization of the student model. There are some studies that use a pretrained student model as an initial state, but those pretraining tasks are irrelevant to either the teacher model or the downstream task. To the best of our knowledge, our study is the first case that pretrains the student model with a task relevant to the teacher model and its downstream task. PTP consists of the following two steps.

Step 1. Creating artificial data based on the teacher’s predictions (PTP labels).

We first input the training data in the teacher model and collect the teacher model’s predictions. We then define ”confidence” as the following. We apply softmax function to the teacher model’s predictions, and the maximum value of the predictions is defined as the confidence. Next, with a specific threshold ”t” (a hyperparameter between 0.5 and 1.0), we assign a new label to the training data according to the rules listed in Table

1. We call these new artificial labels PTP labels.

PTP label Teacher’s prediction correct confidence t
confidently correct True True
unconfidently correct True False
confidently wrong False True
unconfidently wrong False False
Table 1: Assigning new PTP labels to the training data.

Step 2. Pretrain the student model to predict the PTP labels. Using the artificial PTP labels (data , PTP label) we created, we now pretrain the student model to predict the PTP label when is provided as an input. In other words, the student model is trained to predict the PTP labels given the downstream training dataset. We train the student model until convergence.

Once these two steps are complete, we use this PTP-pretrained student model as the initial state for finetuning on the downstream task.

3.4 PeaBERT: SPS and PTP combined

3.4.1 overall details of PeaBERT

Our complete PeaBERT applies SPS and PTP together on BERT for maximum impact on performance. That is, given a student model, we first transform it into an SPS model and apply PTP. Once PTP is complete, we use this model as the initial state of the student model during the KD process. The overall framework of PeaBERT is depicted in Figure 2.

Figure 2: The overall framework of PeaBERT. Left half represents applying SPS and PTP to student model. Right half represents the learning baseline.

3.4.2 Learning details of PeaBERT

For the starting point of the KD process, a well finetuned teacher model should be used. Throughout the experiments we use a 12 layer BERT-base model as the teacher. The learned parameters are denoted as:


where the denotes parameters of the teacher, denotes the softmax function, denotes the training data, denotes the teacher model’s output predictions, denotes the true labels, and denotes cross-entropy loss. We then pretrain the student model with PTP labels using the following equation:


where, denotes the PTP labels and the subscript denotes the student model. When PTP is complete, we use the as the initial state of the KD process. During the KD process, we used a softmax-temperature T which controls the softness of teacher model’s output predictions (Hinton et al. (2015)):


The loss function is as follows:



denotes the Kullback-Leibler divergence loss,

denotes the indices of particular layers we use in Patient KD, and

denotes the output logits of the k-th layer.

and are hyperparameters.

4 Experiments

In this section, we will discuss the experiments we performed to assess the effectiveness of our proposed method. Our goal was to answer the following questions.

  • Q1. Overall Performance. How does PeaBERT perform compared to the currently existing KD methods? (Section 4.2)

  • Q2. Effectiveness of SPS. To what extent does the SPS process improve the effective complexity of the student model? (Section 4.3)

  • Q3. Effectiveness of PTP. Is the new PTP-training a good initialization method? Compared to the conventionally used truncated version of the BERT-base model, what is the impact that our PTP initialization method has on model performance?(Section 4.4)

4.1 Experimental Settings and Datasets Used

Datasets. We evaluate our proposed methods on four of the most widely used datasets in the General Language Understanding Evaluation (GLUE) benchmark (Wang et al. (2018)): SST-2111





, and MRPC

444 For sentiment classification, we use the Stanford Sentiment Treebank (SST-2) (Socher et al. (2013)). For natural language inference, we use QNLI(Rajpurkar et al. (2016)) and RTE. For paraphrase similarity matching, we use Microsoft Research Paraphrase Corpus (MRPC) (Dolan and Brockett (2005)). Specifically, SST-2 is a movie review dataset with binary annotations where the binary label indicates positive and negative reviews. QNLI is a task for predicting whether a pair of question and answer is entailment or not. RTE is based on a series of textual entailment challenges and MRPC contains pairs of sentences and corresponding labels, which indicate the semantic equivalence relationship between each pair.

Competitors. We use Patient Knowledge Distillation (PKD, Sun et al. (2019)) as our baseline learning method to compare and quantify the effectiveness of our proposed approach. Patient Knowledge Distillation is a variant of original Knowledge Distillation method (Hinton et al. (2006)) which is one of the most widely used baselines. We conduct our experiments on BERT model (Devlin et al. (2018)) and compare the results of PeaBERT to the original BERT. In addition, we compare the results with other state-of-the-art BERT-distillation models, including DistillBERT(Sanh et al. (2019)) and TinyBERT(Jiao et al. (2019)).

Training Details. We use the full twelve-layer original BERT model (Devlin et al. (2018)) as the teacher model and further finetune the teacher for each task independently. The student models are created using the same architecture as the original BERT, but the number of layers were reduced to either 1,2,3, and 6 layers depending on the experiment. That is, we initialize the student model using the first n-layers of parameters from the pretrained original BERT obtained form Google’s official BERT repo555 Industry norms are used to create the baseline to measure the effectiveness of our proposed method. PKD is used, and the following hyperparameter settings are used. Training batch size is chosen from {32, 64}, learning rate from {1, 2, 5}

, number of epochs from {6, 10},

from {0.3, 0.7}, and from {100, 500}.

4.2 Overall Performance

width=1 Method RTE MRPC SST-2 QNLI Avg BERT-PKD 52.8 80.6 83.6 64.0 70.3 PeaBERT 53.0 81.0 86.9 78.8 75.0 BERT-PKD 53.5 80.4 87.0 80.1 75.2 PeaBERT 64.1 82.7 88.2 86.0 80.3 BERT-PKD 58.4 81.9 88.4 85.0 78.4 PeaBERT 64.5 85.0 90.4 87.0 81.7

Table 2: Overall results of PeaBERT compared to the state-of-the-art KD baseline, PKD. The results are evaluated on the test set of GLUE official benchmark. The subscript numbers denote the number of independent layers of the student. F1 metric is used for MRPC.
Method # of parameters RTE MRPC SST-2 QNLI Avg
DistilBERT 42.6M 59.9 87.5 91.3 89.2 82.0
TinyBERT 42.6M 70.4 90.6 93.0 91.1 86.3
PeaBERT 42.6M 73.6 92.9 93.5 90.3 87.6
Table 3: PeaBERT in comparison to other state-of-the-art competitors in dev set. The cited results of the competitors are from the official papers of each method. For accurate comparison, model dimensions are fixed to six layers across all models compared.

We summarize the performance of our proposed method, PeaBERT, against the baseline in Table 2. We also compare the results of PeaBERT against its competitors DistilBERT and TinyBERT in Table 3. We observe the following from the results.

First, we see from Table 2 that PeaBERT consistently yields higher performance in downstream tasks across all three model sizes that were tested. Notably, the proposed method shows average improvements of 4.7% for the 1-layer student, 5.1% for the 2-layer student, and 3.3% for the 3-layer student. These results strongly validate the effectiveness of our proposed PeaBERT across varying downstream tasks and student model sizes.

Second, using the same number of parameters, PeaBERT outperforms the state-of-the-art KD baselines DistilBERT and TinyBERT, by 5.6% and 1.3% on average. We use a 6-layer student model for this comparison. An inspiring advantage of PeaBERT is that it achieves remarkable performance improvement just by using the downstream dataset without touching the original pretraining tasks. Unlike its competitors, PeaBERT does not touch the original pretraining tasks. Specifically, Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). This reduces training time significantly. For example, DistilBERT took approximately 90 hours with eight 16GB V100 GPUs while PeaBERT took a minimum of one minute (PeaBERT with RTE) to a maximum of one hour (PeaBERT with QNLI) using just two NVIDIA T4 GPUS.

Finally, another strong suit of PeaBERT is that it can be directly applied to other transformer-based models with minimal modificiations. The SPS method can be directly applied to any transformer-based models, and the PTP method can be applied to any classification task. For instance, we can easily implement the proposed method on DistilBERT and TinyBERT models, which will likely further improve their performance.

4.3 Effectiveness of SPS

Method # of parameters RTE MRPC SST-2 QNLI Avg
BERT 21.3M 61.4 84.3 89.4 84.8 80.0
SPS-1 21.3M 63.5 85.8 89.6 85.5 81.1
SPS-2 21.3M 68.6 86.8 90.2 86.5 83.0
Table 4: An ablation study to validate each steps of SPS. A three-layer student model is chosen as representative. The results are derived using GLUE dev set.

In this section we perform an ablation study to verify the effectiveness of SPS at increasing the student model’s complexity. We compare three models BERT, SPS-1, and SPS-2. BERT is the original BERT model, which applies none of the SPS steps. SPS-1 applies only the first step in the SPS process to BERT. SPS-2 applies both the first and second SPS steps to BERT.

The ablation study is conducted on BERT as a representative model and the results are summarized in Table 4. Compared to the original BERT, SPS-1 shows improved accuracy in all the downstream datasets with an average of 1.1%. Comparing SPS-1 with SPS-2, we note that SPS-2 consistently shows better performance with an average of 1.9 %. We can therefore conclude that both steps of the SPS process individually increase the effective model complexity of student model.

4.4 Effectiveness of PTP

Model # of parameters RTE MRPC SST-2 QNLI Avg
PeaBERT-p 21.3M 68.6 86.8 90.2 86.5 82.9
PeaBERT 21.3M 70.8 88.0 91.2 87.1 84.3
Table 5: An ablation study to verify the effectiveness of PTP. The results are derived using GLUE dev set.

In this section, we perform an ablation study to validate the effectiveness of using PTP as an initial guide for the student model. Similarly to how we validate the effectiveness of SPS in section 4.3, we use BERT as our representative model and compare PeaBERT to its variant PeaBERT-p. PeaBERT-p is the PeaBERTmodel without PTP. This essentially is applying only SPS to BERT. PeaBERT goes one step further and applies both SPS and PTP. The results are reported in Table 5.

As summarized in Table 5, applying PTP increases accuracy across all four of the datasets with an average of 1.4 points, proving the effectiveness of using PTP to increase model performance. We can now validate our second main claim that initializing a student model with KD-specialized initialization prior to applying KD improves performance. As existing KD methods do not place much emphasis on the initialization process, this finding highlights a potentially major, undiscovered path to improving model performance. Further and deeper research related to KD-specialized initialization could be promising.

5 Conclusion

In this paper, we introduced and proved the efficacy of a new KD method for transformer-based distillation, Pea-KD. Our goal was to address and reduce the limitations of the currently available KD methods. We first introduced SPS, a new parameter sharing approach that uses shuffling mechanism, which enhances the complexity of the student model while using the same number of parameters. We then introduced PTP, a KD-specific initialization method for the student model. Applying these two methods on BERT, we introduced PeaBERT. Through experiments conducted using multiple datasets and varying model sizes, we showed that our method improves KD accuracy significantly. We showed that PeaBERT works well across different datasets and outperforms the original BERT as well as other state-of-the-art baselines on BERT distillation.

In future work, we would like to delve deeper into the concept of KD-specialized initialization of the student model. Also, since PTP and SPS are independent processes on their own, we anticipate model compression to further improve when combined with other existing model compression techniques, such as weight pruning and quantization.


  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2, §4.1, §4.1.
  • W. B. Dolan and C. Brockett (2005) Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), Cited by: §4.1.
  • Y. Gong, L. Liu, M. Yang, and L. Bourdev (2014)

    Compressing deep convolutional networks using vector quantization

    arXiv preprint arXiv:1412.6115. Cited by: §2.
  • S. Han, J. Pool, J. Tran, and W. Dally (2015)

    Learning both weights and connections for efficient neural network

    In Advances in neural information processing systems, pp. 1135–1143. Cited by: §2.
  • G. E. Hinton, S. Osindero, and Y. W. Teh (2006) A fast learning algorithm for deep belief nets. Neural Computation 18, pp. 1527–1554. Cited by: §1, §4.1.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2, §3.4.2.
  • X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu (2019) Tinybert: distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351. Cited by: §1, §2, §4.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §4.1.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §2, §4.1.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §4.1.
  • S. Sun, Y. Cheng, Z. Gan, and J. Liu (2019) Patient knowledge distillation for bert model compression. arXiv preprint arXiv:1908.09355. Cited by: §1, §2, §4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §4.1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5753–5763. Cited by: §2.
  • S. Zhao, R. Gupta, Y. Song, and D. Zhou (2019) Extreme language model compression with optimal subwords and shared projections. arXiv preprint arXiv:1909.11687. Cited by: §2.