Log In Sign Up

WaLDORf: Wasteless Language-model Distillation On Reading-comprehension

by   James Yi Tian, et al.

Transformer based Very Large Language Models (VLLMs) like BERT, XLNet and RoBERTa, have recently shown tremendous performance on a large variety of Natural Language Understanding (NLU) tasks. However, due to their size, these VLLMs are extremely resource intensive and cumbersome to deploy at production time. Several recent publications have looked into various ways to distil knowledge from a transformer based VLLM (most commonly BERT-Base) into a smaller model which can run much faster at inference time. Here, we propose a novel set of techniques which together produce a task-specific hybrid convolutional and transformer model, WaLDORf, that achieves state-of-the-art inference speed while still being more accurate than previous distilled models.


Making Neural Machine Reading Comprehension Faster

This study aims at solving the Machine Reading Comprehension problem whe...

FastFormers: Highly Efficient Transformer Models for Natural Language Understanding

Transformer-based models are the state-of-the-art for Natural Language U...

KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with Learned Step Size Quantization

Recently, transformer-based language models such as BERT have shown trem...

Fast and Accurate Deep Bidirectional Language Representations for Unsupervised Learning

Even though BERT achieves successful performance improvements in various...

TangoBERT: Reducing Inference Cost by using Cascaded Architecture

The remarkable success of large transformer-based models such as BERT, R...

Perceiver IO: A General Architecture for Structured Inputs Outputs

The recently-proposed Perceiver model obtains good results on several do...

RecoBERT: A Catalog Language Model for Text-Based Recommendations

Language models that utilize extensive self-supervised pre-training from...

I Introduction

[nindent=0em,lines=3]T he recent emergence of VLLMs[1, 2, 3, 4, 5, 6, 7] have made a wide variety of previously extremely difficult NLU tasks feasible. From an academic perspective, the leader boards of a broad set of NLU benchmarks, such as the General Language Understanding Evaluation (GLUE[8] or Super-GLUE[9]) and the Stanford Question Answering Dataset (SQuAD[10, 11]), are dominated by fine-tuned versions of such VLLMs, generally based on transformers[12]. From industry, real world problems are now being solved by putting these VLLMs into production. Recently, both Google and Microsoft have announced that they use some (possibly distilled) version of a VLLM in their search algorithms. As such, it has become increasingly apparent that deploying a (fine-tuned) VLLM into production can generate a lot of value. However the resource requirements of deploying a VLLM into production can be prohibitively expensive. VLLMs, being very large, require both a lot of memory and a lot of computation for inference. Therefore, there has been much recent work into distilling knowledge from a VLLM into a much smaller, faster model[13, 14, 15, 16]. Our work continues that line of work by examining several novel techniques which, together with previously published techniques, produce a leaner, faster but still highly accurate model. In concrete terms, while maintaining Exact Match (EM) and F1 scores on SQuAD v2.0 higher than TinyBERT, 4-layer Patient Knowledge Distillation of BERT (BERT-PKD), and 4-layer DistilBERT, we have achieved 1.24x, 3.1x, and 3.1x speed up from those models respectively (4-layer BERT-PKD and 4-layer DistilBERT have the same model architecture and so inference speed is the same). Compared with BERT-Base and BERT-Large, we achieved 9.13x and 28.63x speed up respectively.

For this work, we are looking into task specific distillation for the SQuAD v2.0 task. We chose to look more deeply into task specific distillation so that we could explore task specific data augmentation and its effects on model distillation in more detail. We chose specifically the SQuAD v2.0 task because it is relatively difficult and flexible. Recent advances in NLU have proposed to frame a wide variety of NLU tasks into the Question-Answering format[7, 17].

The major contributions (which we will describe in detail in section III) of our work are:

  • With inspiration from convolutional auto encoders, we use convolutions, max pooling, and up sampling to shrink the sequence length of inputs to the transformer encoder blocks and then expand the outputs of those encoder blocks back to the original sequence length.

  • We modify the hidden-state distillation and attention-weight distillation by using average pooling and max pooling respectively to align the sequence dimension of our model with the teacher model.

  • We modify the patient knowledge distillation procedure to build layers up from the bottom one layer at a time to reduce the covariate shift experienced by each encoder block.

  • We fine-tune a pre-trained sequence-to-sequence VLLM to produce high quality questions from given contexts to perform task specific data augmentation.

The rest of our paper is organized as follows: in section II we discuss some related work and models from which we drew inspiration; in section III we describe in detail our methodology, including model architecture, model distillation procedure, data augmentation, and inference speed testing procedure; in section IV we present all of our results, including results from our ablation study; in section V we discuss our findings in detail; and finally in section VI we present a few concluding remarks and directions for future work.

Ii Related Work

Model compression has been explored in the past in several parallel paths. There is a whole field of study devoted to weight quantization, weight pruning, and weight encoding to reduce the size of large neural networks - often achieving impressive results

[18]. Recently, there have been efforts to apply weight quantization and pruning to transformer models as well[19, 20, 21]. For this work, we instead focus on the parallel path(s) of exploration into compressing large transformers using architectural changes or model distillation.

The recently released ALBERT[22] approached model compression from the point of view of making architectural and training objective changes to the transformer model. The authors of ALBERT use the following three main techniques: 1) Factorized embedding parameterization to greatly reduce the number of weights in the embedding layer, 2) Weight sharing among all encoder blocks of the transformer, greatly reducing the number of weights in the encoder stack, and 3) Using an inter-sentence coherence loss instead of a next sentence prediction loss during BERT-like pre-training. With these architectural and training modifications, ALBERT produces impressive results and achieved state-of-the-art accuracy, for their xx-large model, on a wide variety of NLU tasks. ALBERT-base also achieves quite high accuracy while reducing the number of parameters to 12 million (compared with 110 million for BERT-Base and 335 million for BERT-Large). However, because data still must be processed through every layer of ALBERT, even though the weights are shared between all the layers, the inference speed improvement achieved by ALBERT is not so striking. ALBERT-Base achieves 21.1x speed up from an extra large version of BERT built by the authors of ALBERT (BERT-XLarge) which is itself 17.7x slower than BERT-Base. This means that ALBERT-Base achieves only roughly 1.2x speed up from BERT-Base itself. For our purposes, we wanted to look at much faster architectures. Therefore, we only incorporated the factorized embedding parameterization idea from ALBERT. Other possibly faster architectures have of course also been explored for language modeling[23].

Model distillation was first applied to distilling ensemble model knowledge into a single model[24] and then expanded to other domains[25]. In this vein, we point out three recent efforts at model distillation of BERT: DistilBERT[16], BERT-PKD[13], and TinyBERT[14]. DistilBERT took inspiration from model distillation as put forth by Hinton et. al. and distilled knowledge from the softmax outputs produced by BERT-Base into a 6 encoder block (also called 6-layer) version of BERT. DistilBERT has half the number of encoder blocks as BERT-Base but is otherwise identical to BERT-Base (in terms of hidden dimension size, the number of attention heads, etc.). BERT-PKD goes further and distils the hidden state activations, layer-by-layer, from BERT-Base (and in one experiment, BERT-Large) into 3 and 6 encoder block versions of BERT-Base. The authors of BERT-PKD show that patient knowledge distillation generally outperforms last-layer distillation. Finally, TinyBERT is a 4 encoder block version of BERT-Base that goes even further and includes distillation objectives for the attention scores and the embedding layer. In addition, TinyBERT shrinks the hidden dimension sizes of BERT-Base to produce a much smaller and much faster model than either DistilBERT or BERT-PKD. The accuracy on SQuAD v2.0 attained by TinyBERT also outperforms those attained by DistilBERT and BERT-PKD, for a given number of encoder blocks, so for our work we generally benchmark against TinyBERT. We take inspiration from TinyBERT and perform a version of their layer-by-layer patient knowledge distillation. However, different from TinyBERT, we also included some architectural changes that necessitated changes to the knowledge distillation procedure. Furthermore, we did not reproduce the language modeling pre-training used by BERT and all these previously mentioned distillations. Instead, we relied purely on data augmentation to achieve state-of-the-art results.

Iii Methodology

i Problem Statement

For this work, we are working specifically in the framework of question-answering as posed by the SQuAD v2.0 dataset. To perform question-answering, we are given a question and a context and we must find the concrete answer to the question. More specifically, SQuAD v2.0[11] is an extractive, text-based, question-answering dataset which tasks us to find spans of text within the context which answer the given question or, if the question is unanswerable given the context, to return a null result. Formally, given a tuple , where is the question, and is the context consisting of a passage of text, find , which is a span in , that answers the question or return a null result if is not answerable given . For training, we are given tuples of training examples , while for evaluation, we are given the tuple and a set of ground truth answers that are variants of the answer for the given question.

We use two primary metrics to evaluate our model for accuracy, the EM and F1 metrics. The EM score divides the number of examples for which our model’s answer, , exactly matched any of the ground truth answers in into the total number of evaluation examples. The F1 score averages the maximum overlap between and over all the evaluation examples. These are the same two metrics used by the SQuAD v2.0 dataset itself.

ii Model Architecture

Figure 1:

A visualization of WaLDORf’s architecture. There are 8 transformer encoder blocks sandwiched between 1D convolutional layers along the sequence length direction. All convolutional layers use padding to preserve sequence length. Conv_1 and Conv_2 layers are paired with maxpooling to reduce sequence length by a total factor of 4. Conv_3 and Conv_4 layers are paired with up sampling to restore the sequence length to its original size. Skip connections are used after up sampling layers to skip past convolutional layers but before layer normalization.

WaLDORf is a hybrid model that incorporates convolutional layers into the transformer encoder stack architecture of BERT[2]; see figure 1 for an illustration. Taking inspiration from ALBERT, we chose the embedding dimension to be smaller than the internal hidden dimension used in the encoder blocks[22]. We use the same word piece tokenization scheme as BERT (which is itself a sub word tokenization scheme similar to Byte Pair Encoding[26]) in order to allow distillation of the embedding layer as described in section III(iii

). Before feeding data into the 8 encoder blocks, we use a time-distributed feed-forward layer to expand the dimensions of the hidden representation to equal that of the encoder blocks. The major architectural insight we bring forth with WaLDORf is to use convolutional layers to shrink and then re-expand the input sequence dimension. The self-attention mechanism employed by the transformer encoder blocks have complexity which is quadratic in the input sequence length

[12]. By using two 1D convolutions (along the sequence length dimension) coupled with two max-pooling layers, we shrink the input sequence lengths to the encoder blocks by a factor of 4. In turn, this means each self-attention mechanism needs to do only 1/16th the amount of work. We then use two more 1D convolutions (since there are no transposed convolutions in 1 dimension) coupled with two up sampling layers to re-expand the sequence length dimension back to its original length. By greatly reducing the amount of work carried out by the self-attention mechanism, we are able to make the encoder blocks run much faster so that we can have a deeper model without sacrificing speed. The trade-off is, of course, that the model must learn to encode the input information in a "smeared out" way; with roughly every 4 word-pieces getting 1 internal representation. This trade-off may affect different tasks differently depending on how sensitive a certain task is to specific word-pieces. However, we found that even for the SQuAD v2.0 task, which intuitively should be relatively sensitive to specific words or word-pieces, since the task is specifically about finding spans within the text, we were still able to achieve good accuracy with our model.

The 8 encoder blocks used in our model are the standard transformer encoder blocks used by BERT which have been scaled down to increase inference speed. A detailed list of architecture hyperparmeters can be seen in table LABEL:table:archHyper. The self-attention mechanisms in our model use 16 heads of attention like BERT-Large instead of 12 heads of attention like BERT-Base because we chose to use BERT-Large as our teacher model. In total, our model has parameters which is a bit less than a quarter of the number of parameters of BERT-Base.

Architecture Hyperparameters

# Encoder Blocks
# Encoder Convolutional Layers
# Decoder Convolutional Layers
Embedding Size
Hidden Size
Feed-Forward Size
# Attention Heads
# Total Parameters
Table 1: hyperparameters for our model’s architecture

iii Training Objectives and Procedure

To train our model, we used a modified version of the patient knowledge distillation procedure. We have 5 main training losses:

  • The distillation loss over the embedding layer: .

  • The distillation loss over the hidden state representations inside the 8 encoder blocks: .

  • The distillation loss over the attention scores used to calculate the attention weights inside the 8 self-attention mechanisms: .

  • The distillation loss over the final softmax layer which corresponds to the vanilla model distillation loss introduced by Hinton et. al.:


  • The ground truth loss (for data for which we have the ground truth): .

The student model (our model) learns from a teacher model which we chose to be BERT-Large Whole-Word-Masking (BERT-Large WWM) which is a variant of BERT-Large that was trained using a masked token objective where the masked tokens were ensured to be over whole words instead of word pieces as in the original. The first 4 losses mentioned, , are all losses for the student model with respect to the teacher model. The final loss is the loss with respect to the ground truth and would be the only loss available to us if we did not perform model distillation. As we will show in section IV(iii), if we use only to try to train our model, our results are extremely poor.

The embedding loss, , is a simple mean squared error loss that measures how different the embedding representations of the student model are from the teacher model. Given , where are the student model embeddings for the text sequence, is the input sequence length, and is the embedding size of the student model, and , where are the teacher model embeddings for the text sequence, and is the teacher model embedding size, then:


Where MSE is the mean-squared error, is the batch size, and

is a learned tensor that projects the student embeddings to the same dimension size as the teacher embeddings.

Both the hidden loss, , and the attention scores loss are mean squared error losses that tries to match the hidden states and attention weights within the 8 encoder blocks of the student model to 8 encoder blocks teacher model. We choose to follow the PKD-skip scheme[13] and use every 3

encoder block in the teacher model for distillation. Just like for the embedding loss, we also have to project the student model hidden states to the same dimension size as the teacher model’s. In addition, we also use an average pooling over the teacher model’s hidden states along the sequence length dimension to shrink the sequence length of the teacher model to match the sequence length inside the student model’s encoder blocks. Because WaLDORf used convolutions to shrink the sequence length used in the encoder blocks, we can’t match hidden states, word for word, between the two models. If we think of hidden states analogously to word vectors, then we are essentially asking WaLDORf to approximate the average of 4 word vectors with each of its hidden states. This procedure is perhaps the simplest way to make the dimensions of our student model align with the targets given by the teacher model. Empirically, this procedure gives us good results. Thus, given

, where are the hidden state outputs for the text sequence of the encoder block of the student model, and is the hidden dimension size of the student model, and , where are the hidden state outputs for the text sequence of the encoder block of the teacher model, and is the hidden dimension size of the teacher model, then:


Where AVG denotes a 1-D average pooling with size 4 and stride 4 along the sequence length dimension,

is the hidden loss of the encoder block, and is a learned matrix that projects student model hidden dimension sizes to those of the teacher model. Note that we use the same for every encoder block because we want the distilled learning to be learned by WaLDORf itself and not by these projection matrices which will be discarded at inference time.

For the attention score loss, instead of using average pooling to make sequence dimensions match, we switch to using max pooling. We decided to use max pooling here because we want the student model to learn the strongest attention scores rather than an average of attention scores. Hence, given , where are the attention scores for the text sequence of the encoder block, and , where are the attention scores for the text sequence of the encoder block, then:


Where MAX denotes a 2-D max pooling operation with size and stride , and is the attention score loss of the encoder block. Note that here we are operating on attention scores and not the attention weights. The attention scores have not had a softmax applied to them and so do not necessarily add up to 1, hence we are free to perform a simple max pooling without destroying any normalization since the softmax will be applied after. One detail that should be noted is that proper masking needs to be maintained on the attention score loss so that the system does not try to learn minor changes to the masking value.

The distillation loss over the final softmax layer follows closely the distillation loss introduced in Hinton et. al.[25]:


Where is a temperature hyperparameter and

are teacher and student logits for the

text sequence respectively. We multiply the loss by the factor of to ensure that the gradients are of the correct scale.

Lastly, the ground truth loss is the standard cross entropy loss with the ground truth labels:


Where is the ground truth label for the text sequence. This ground truth loss is only available for data for which we have the ground truth labels. When we perform data augmentation, as discussed in the next section, we will no longer have access to ground truth labels and so this loss will not apply. Note that for both the ground truth loss and the distillation loss over the final softmax, there are two sets of logits (and labels), one for the start position and one for the end position. The softmax functions are taken over the sequence length dimension.

To train our model, we found it was very beneficial to train each layer from the bottom up (or left to right in 1

) so that we start with training only using the embedding loss, then combine that loss with the hidden state and attention score losses for layer 0, then for layer 1, etc. After each intermediate layer is trained for some time, we then add the final softmax layer distillation loss and the ground truth loss. This training procedure makes sense intuitively because it should reduce the internal covariate shift experienced by each layer during training (the same goal as for batch normalization


). Since we have targets available for each layer in the student model, with the exception of the convolutional layers, we can use those intermediate targets to ensure that the inputs to each layer have stabilized somewhat by the time that layer is trained. Thus, our final loss function is:


Where are real number hyperparameters and:


Where is the Heaviside step function, is the current global training step and is an integer hyperparameter specifying how many training steps to take for each layer. We will show in our ablation study in section IV(iii) that this layer-by-layer building up of the loss function significantly improves accuracy.

iv Data Augmentation

A large part of why VLLMs perform so well on downstream tasks is due to the massive amount of data that they pretrain on[3, 5, 7]. For distilled architectures, such as DistilBERT or BERT-PKD, which keep hidden dimensions all the same size as a teacher model (e.g. BERT-Base) and just use fewer encoder blocks, the student encoder blocks can be initialized with the weights from a subset of encoders blocks of a trained teacher model. For distilled architectures, such as TinyBERT or our model, which change the hidden dimension sizes as well as the number of encoder blocks, there is no obvious way to initialize the weights of the model with weights from the teacher model. Hence, distilled architectures also often reproduce the pretraining phase of BERT before being fine-tuned (with or without additional distillation) on downstream tasks. Here, we are focused on task-specific distillation and so we do not desire to reproduce the pretraining, language modeling, phase. However, not reproducing the pretraining phase would put our model at a huge disadvantage simply due to the much smaller coverage of input data that our model would be trained on. Hence, to help balance out the data volume a bit, we chose to perform extensive data augmentation. It should be noted though that even with data augmentation, the final volume of data, as measured by total token count, that we train on is still much smaller than the pretraining data used to train BERT.

To perform data augmentation, we sourced 500,000 additional paragraphs, between 200 and 320 words long, from 280,250 English Wikipedia articles. For each paragraph, we automatically generated a question using a fine-tuned recently released Text-to-Text Transfer Transformer, T5-Large[7]. We fine-tuned T5-Large by using question-context pairs from the SQuAD v2.0 dataset itself. Visual analysis of the produced questions showed that the automatically generated questions were of very high quality. Our empirical experiments will also show (in section IV) that this data augmentation increased the end-to-end accuracy by a large margin. For comparison, the SQuAD v2.0 training dataset has on the order of 130,000 questions on 20,000 paragraphs sourced from 500 articles.

v Inference Speed Testing

To perform inference speed testing, we set up a Nvidia V100 GPU running Tensorflow 1.14 and CUDA 10. We must point out here that inference speed and relative speed up are quite sensitive to the exact testing conditions. For example, changing the input sequence length or prediction batch size may change both the absolute inference speed and relative speed up one architecture gets over another. Inference speed and relative speed up are therefore task dependent and are not universal. These reasons are why the relative speed up reported by us may not be entirely consistent with the relative speed up reported by others. In particular, while TinyBERT found 9.4x speed up over BERT-Base for their particular testing environment, which used a batch size of 128 on the QNLI training set and a maximum sequence length of 128 on a NVidia K80 GPU, within our testing environment, TinyBERT showed only 7.4x speedup over BERT-Base. For our test environment, we chose to use a maximum sequence length of 384 and a batch size of 32. We ran inference multiple times on the SQuAD v2.0 dev set and averaged the run-times to average out any network or memory caching dependent performance issues.

vi Training Hyperparameters

Training Hyperparameters
Init Learning Rate
Batch Size
Adam Epsilon
Init Temperature
Table 2: hyperparameters for training WaLDORf

We present the hyperparameters used to train WaLDORf in table LABEL:table:trainHyper. These are the hyperparameters we found gave us the best results. The initial learning rate was kept constant over the "building up" phase of training where layers and encoder blocks are slowly being added to the training objective. Once the model reached the final phase of training and all of the loss functions were in play, the learning rate was decayed linearly to 0 by the end of training. The initial temperature was also decayed linearly to 1 for the final phase of training. For our choice of data volume, batch size, and steps per layer (), 35 epochs of training corresponds roughly to 1 million global steps with 517,500 of those steps being spent during the build-up phase of training.

vii Hardware

All experiments were performed using a single Google v3 TPU running TensorFlow 1.14. Data parallelization among the 8 TPU cores was handled by the TPUEstimator API.

Iv Results

Overall, WaLDORf achieved significant speed up over previous state-of-the-art distilled architectures while maintaining a higher level of accuracy on SQuAD v2.0.

i Inference Speed

Inference Speed Results
BERT-Base 1x
BERT-Large 0.3x
BERT-6 2x
BERT-4 2.9x
BERT-2 5.8x
TinyBERT 7.4x
Ours 9.1x
Table 3: Results obtained from testing for inference speed. is the time to perform inference once over the SQuAD v2.0 cross validation set averaged over many trials. is relative speed up as compared with BERT-Base. See section III(v) for details on how the testing was conducted.

Detailed results of our inference speed testing are presented in table LABEL:table:speedTest. As clearly shown, our model is the fastest model to perform inference over the SQuAD v2.0 cross validation set by a wide margin. Because of our use of convolutional layers, we were able to have a model which is deeper and has larger hidden dimensions while still being faster than TinyBERT. DistilBERT and BERT-PKD would both fall under the BERT-6 umbrella since in terms of architecture both are simply 6 encoder block versions of BERT-Base. BERT-PKD also has a 3 encoder block version, but we can see that our model is much faster than even a 2 encoder block version of BERT-Base.

ii Accuracy Performance

SQuAD v2.0 Results
Model EM F1
Our Model 66.0 70.3
Table 4: WaLDORf’s accuracy performance on SQuAD v2.0 as compared with TinyBERT, BERT-PKD, and DistilBERT. BERT-PKD, DistilBERT and TinyBERT results are taken from TinyBERT paper. TinyBERT-6 has scaled up internal dimension sizes to match those of BERT-Base. BERT-Base and BERT-Large-WWM results are from models that we fine-tuned. We used BERT-Large-WWM as our teacher model.

Our model’s performance on SQuAD v2.0 is benchmarked against other models in table LABEL:table:accuracy. The BERT-PKD-4 and DistilBERT-4 models presented have internal dimensions identical with BERT-4 from table LABEL:table:speedTest. Thus, our model would be 3.1x faster than those models for inference while being significantly more accurate. WaLDORf is also somewhat more accurate than TinyBERT ( EM, F1) while still being 1.24x faster. To obtain EM and F1 scores higher than ours, one would have to go to 6-layer versions of those models for which our model would be 4.5x faster. TinyBERT-6 has internal dimension sizes scaled to be the same as BERT-Base as well and so would also be 4.5x slower than WaLDORf.

iii Ablation Study

Ablation Study
EM F1 +EM +F1
+ Softmax Distil
+ All Layer Distil
+ Slow Build 11.8 12.5
+ Data Augment 66.0 70.3
Table 5:

Ablation study for WaLDORf. The baseline model is trained on just ground truth labels. "Softmax Distil" is, in addition, trained on the softmax probabilities provided by the teacher model. "All Layer Distil" adds on the embedding layer, hidden state, and attention score losses as described in section

III(iii). "Slow Build" builds each layer of the student model one-by-one. "Data Augment", as described in III(iv), was the final technique we added and gave us our final results. +EM and +F1 denote improvements over the previous best model.

Results from our ablation study are provided in table LABEL:table:ablation. Each additional technique shown in the table was added on to the sum of the previously used techniques. As we can clearly see, each technique which we added produced significant improvements. With the addition of each new technique, we also explored a whole set of hyperparameters. Results are presented for the best set of hyperparameters found within a given framework. When all the techniques are combined, we get a very significant absolute improvement of in EM score and in F1 score over the baseline.

The biggest incremental improvement in accuracy that we saw was in moving from trying to distill every layer all at once to slowly building up the layers from the bottom embedding layer up. This is likely due to the fact that training every layer all at once introduces a significant covariate shift for every layer’s inputs during training. Stabilizing the inputs to each layer by training layers from a bottom up fashion appears to make optimization much easier.

V Discussion

i Convolutions and Autoencoding

The core architectural change that we bring forth in this work is the use of convolutions, max pooling, and up sampling to reduce and then re-expand the sequence length dimension. We introduced this architectural change with the hope that, analogous to convolutional autoencoders, WaLDORf will find an efficient way to compress the information contained within an example into a shorter sequence. By attaining accuracy performance higher than previous state-of-the-art, we have shown that it is in fact feasible for the model to compress information in this way. Information can be effectively encoded by the input convolutions, processed by the encoder blocks, and then decoded by the output convolutions. In addition, we also showed that by using max pooling and average pooling we are able to use only the signals given by the "important" attention weights and the averaged hidden states from the teacher model to make our student model perform strongly. These properties may be somewhat task dependent, but even for the SQuAD v2.0 task we were able to attain a high level of accuracy.

Due to the quadratic dependence of the complexity of the self-attention mechanism on the sequence length, our architecture will actually see larger relative gains in speed the longer the input sequence length is. We chose a sequence length of 384 for our tests because that is the sequence length used by BERT for the SQuAD data set. Other transformer-based VLLMs may use even longer sequence lengths in order to deal better with longer paragraphs, e.g. XLNet uses a max sequence length of 512. For circumstances where the max input sequence length is even longer than 384, we expect to gain even larger relative speed ups.

ii Distillation

Model distillation is currently a very active area of research. Having access to a teacher model opens up the possibility of using a huge variety of techniques to improve a student model’s performance. The distillation techniques available often depend on the architecture of the student model and how similar it is to the teacher model. A model, like BERT-PKD or DistilBERT, that is essentially a copy of a teacher model with just fewer layers can straight-forwardly distil a subset of the layers in the teacher model. If one scales down the internal dimensions of the model, like in the case of TinyBERT, then one needs to introduce some way to project the student model’s hidden dimensions to those of the teacher model or vice versa. That projection operator introduces new degrees of freedom to the problem and the distillation procedure is no longer as straight forward. In our case, we had to also modify the sequence-length dimension of the teacher model’s signals in order to perform intermediate-layer distillation.

In the case that the student model is entirely different from the teacher model, e.g. when a LSTM-based student model tries to learn from a transformer based teacher model as in Tang et. al.[15], then perhaps the only easily distilled layer is the final layer and no intermediate layer distillation is feasible. However, we (and others) have shown that there is significant performance improvements to be had by distilling also the intermediate layers and operations (e.g. attention scores) of a VLLM. It is uncertain, at this point, how much of the performance improvement is due to the difference in the raw information contained within the training signals (softmax signals vs intermediate layers and operations signals) and how much is due to a difference in how much those signals help the student model to optimize. The concrete difference in these scenarios would be that if the latter is true, then student models with architectures wildly different from a teacher model may still attain a high level of accuracy if the last layer distillation is carried out with a very strong optimization scheme. If the former scenario is true, then we may expect that there is no method by which we can optimize a student model using only the signals from the last layer as strongly as we could by using signals from all the layers.

iii Data Augmentation

As we showed in section IV(iii), data augmentation gives us a significant 8-point boost in EM and F1 scores over distillation using just the SQuAD v2.0 dataset itself. This result shows that even with roughly 130,000 question-context-answer tuples, SQuAD v2.0 is still not large enough of a dataset to fully train our student model using pure model distillation. The set of targets that the student model had to hit, given only SQuAD v2.0 inputs, was already enormously large. The embedding layer targets are of size , hidden layer targets are of size and the attention score targets are of size , where denotes the total number of examples. For a BERT-Large-WWM teacher model over the SQuAD v2.0 dataset, these tensors would comprise roughly 1.2 Terabytes of data. The sheer size of this data makes for a very rich set of signals given to the student model, however, given the success of our data augmentation experiments, the input space provided by SQuAD v2.0 itself did not appear rich enough for the student model to fully exploit those signals.

A huge benefit of performing model distillation and student-teacher learning is the ease with which "labeled" data can be obtained from unlabeled data. Data augmentation for us was simplified further by the introduction of the powerful T5 model which we fine-tuned to automatically generate questions given a passage. But, in the absence of such a model, generating a huge amount of unlabeled data in the context of a production environment is often quite feasible. In the case of question-answering, for example, we may be able to mine user-generated questions and run those through a teacher model to provide data for the student model to train on. For task specific knowledge distillation, we have shown that purely performing data augmentation is a viable alternative to reproducing the language modeling pre-training phase.

Vi Conclusion and Further Work

In this work, we have shown a set of techniques that together successfully advance the state-of-the-art in accuracy and speed of distilled language models for the SQuAD v2.0 task. There are quite many promising directions for future work within this field. Of course, the main considerations would be in pushing the limits of model distillation itself. Perhaps by cleverly using adapter layers of some kind, it will become possible in the future to distil the knowledge contained within intermediate layers of a VLLM to a much smaller model that has a totally different architecture from the teacher model. Or maybe a combination of model distillation techniques and model pruning and weight sharing could bring even greater speed gains without loss in accuracy. Lastly, beyond the practical benefits, model distillation techniques may be used to probe just how much the size of a VLLM is necessary to performing specific NLU tasks. Model distillation is an extremely rich area of research well deserving additional exploration.

Vii Acknowledgments

The authors would like to thank all the members of the SAP Innovation Center Newport Beach as well as our collaborators at the University of California, Irvine, for helpful discussions and inspiration.