I Introduction
[nindent=0em,lines=3]T he recent emergence of VLLMs[1, 2, 3, 4, 5, 6, 7] have made a wide variety of previously extremely difficult NLU tasks feasible. From an academic perspective, the leader boards of a broad set of NLU benchmarks, such as the General Language Understanding Evaluation (GLUE[8] or SuperGLUE[9]) and the Stanford Question Answering Dataset (SQuAD[10, 11]), are dominated by finetuned versions of such VLLMs, generally based on transformers[12]. From industry, real world problems are now being solved by putting these VLLMs into production. Recently, both Google and Microsoft have announced that they use some (possibly distilled) version of a VLLM in their search algorithms. As such, it has become increasingly apparent that deploying a (finetuned) VLLM into production can generate a lot of value. However the resource requirements of deploying a VLLM into production can be prohibitively expensive. VLLMs, being very large, require both a lot of memory and a lot of computation for inference. Therefore, there has been much recent work into distilling knowledge from a VLLM into a much smaller, faster model[13, 14, 15, 16]. Our work continues that line of work by examining several novel techniques which, together with previously published techniques, produce a leaner, faster but still highly accurate model. In concrete terms, while maintaining Exact Match (EM) and F1 scores on SQuAD v2.0 higher than TinyBERT, 4layer Patient Knowledge Distillation of BERT (BERTPKD), and 4layer DistilBERT, we have achieved 1.24x, 3.1x, and 3.1x speed up from those models respectively (4layer BERTPKD and 4layer DistilBERT have the same model architecture and so inference speed is the same). Compared with BERTBase and BERTLarge, we achieved 9.13x and 28.63x speed up respectively.
For this work, we are looking into task specific distillation for the SQuAD v2.0 task. We chose to look more deeply into task specific distillation so that we could explore task specific data augmentation and its effects on model distillation in more detail. We chose specifically the SQuAD v2.0 task because it is relatively difficult and flexible. Recent advances in NLU have proposed to frame a wide variety of NLU tasks into the QuestionAnswering format[7, 17].
The major contributions (which we will describe in detail in section III) of our work are:

With inspiration from convolutional auto encoders, we use convolutions, max pooling, and up sampling to shrink the sequence length of inputs to the transformer encoder blocks and then expand the outputs of those encoder blocks back to the original sequence length.

We modify the hiddenstate distillation and attentionweight distillation by using average pooling and max pooling respectively to align the sequence dimension of our model with the teacher model.

We modify the patient knowledge distillation procedure to build layers up from the bottom one layer at a time to reduce the covariate shift experienced by each encoder block.

We finetune a pretrained sequencetosequence VLLM to produce high quality questions from given contexts to perform task specific data augmentation.
The rest of our paper is organized as follows: in section II we discuss some related work and models from which we drew inspiration; in section III we describe in detail our methodology, including model architecture, model distillation procedure, data augmentation, and inference speed testing procedure; in section IV we present all of our results, including results from our ablation study; in section V we discuss our findings in detail; and finally in section VI we present a few concluding remarks and directions for future work.
Ii Related Work
Model compression has been explored in the past in several parallel paths. There is a whole field of study devoted to weight quantization, weight pruning, and weight encoding to reduce the size of large neural networks  often achieving impressive results
[18]. Recently, there have been efforts to apply weight quantization and pruning to transformer models as well[19, 20, 21]. For this work, we instead focus on the parallel path(s) of exploration into compressing large transformers using architectural changes or model distillation.The recently released ALBERT[22] approached model compression from the point of view of making architectural and training objective changes to the transformer model. The authors of ALBERT use the following three main techniques: 1) Factorized embedding parameterization to greatly reduce the number of weights in the embedding layer, 2) Weight sharing among all encoder blocks of the transformer, greatly reducing the number of weights in the encoder stack, and 3) Using an intersentence coherence loss instead of a next sentence prediction loss during BERTlike pretraining. With these architectural and training modifications, ALBERT produces impressive results and achieved stateoftheart accuracy, for their xxlarge model, on a wide variety of NLU tasks. ALBERTbase also achieves quite high accuracy while reducing the number of parameters to 12 million (compared with 110 million for BERTBase and 335 million for BERTLarge). However, because data still must be processed through every layer of ALBERT, even though the weights are shared between all the layers, the inference speed improvement achieved by ALBERT is not so striking. ALBERTBase achieves 21.1x speed up from an extra large version of BERT built by the authors of ALBERT (BERTXLarge) which is itself 17.7x slower than BERTBase. This means that ALBERTBase achieves only roughly 1.2x speed up from BERTBase itself. For our purposes, we wanted to look at much faster architectures. Therefore, we only incorporated the factorized embedding parameterization idea from ALBERT. Other possibly faster architectures have of course also been explored for language modeling[23].
Model distillation was first applied to distilling ensemble model knowledge into a single model[24] and then expanded to other domains[25]. In this vein, we point out three recent efforts at model distillation of BERT: DistilBERT[16], BERTPKD[13], and TinyBERT[14]. DistilBERT took inspiration from model distillation as put forth by Hinton et. al. and distilled knowledge from the softmax outputs produced by BERTBase into a 6 encoder block (also called 6layer) version of BERT. DistilBERT has half the number of encoder blocks as BERTBase but is otherwise identical to BERTBase (in terms of hidden dimension size, the number of attention heads, etc.). BERTPKD goes further and distils the hidden state activations, layerbylayer, from BERTBase (and in one experiment, BERTLarge) into 3 and 6 encoder block versions of BERTBase. The authors of BERTPKD show that patient knowledge distillation generally outperforms lastlayer distillation. Finally, TinyBERT is a 4 encoder block version of BERTBase that goes even further and includes distillation objectives for the attention scores and the embedding layer. In addition, TinyBERT shrinks the hidden dimension sizes of BERTBase to produce a much smaller and much faster model than either DistilBERT or BERTPKD. The accuracy on SQuAD v2.0 attained by TinyBERT also outperforms those attained by DistilBERT and BERTPKD, for a given number of encoder blocks, so for our work we generally benchmark against TinyBERT. We take inspiration from TinyBERT and perform a version of their layerbylayer patient knowledge distillation. However, different from TinyBERT, we also included some architectural changes that necessitated changes to the knowledge distillation procedure. Furthermore, we did not reproduce the language modeling pretraining used by BERT and all these previously mentioned distillations. Instead, we relied purely on data augmentation to achieve stateoftheart results.
Iii Methodology
i Problem Statement
For this work, we are working specifically in the framework of questionanswering as posed by the SQuAD v2.0 dataset. To perform questionanswering, we are given a question and a context and we must find the concrete answer to the question. More specifically, SQuAD v2.0[11] is an extractive, textbased, questionanswering dataset which tasks us to find spans of text within the context which answer the given question or, if the question is unanswerable given the context, to return a null result. Formally, given a tuple , where is the question, and is the context consisting of a passage of text, find , which is a span in , that answers the question or return a null result if is not answerable given . For training, we are given tuples of training examples , while for evaluation, we are given the tuple and a set of ground truth answers that are variants of the answer for the given question.
We use two primary metrics to evaluate our model for accuracy, the EM and F1 metrics. The EM score divides the number of examples for which our model’s answer, , exactly matched any of the ground truth answers in into the total number of evaluation examples. The F1 score averages the maximum overlap between and over all the evaluation examples. These are the same two metrics used by the SQuAD v2.0 dataset itself.
ii Model Architecture
WaLDORf is a hybrid model that incorporates convolutional layers into the transformer encoder stack architecture of BERT[2]; see figure 1 for an illustration. Taking inspiration from ALBERT, we chose the embedding dimension to be smaller than the internal hidden dimension used in the encoder blocks[22]. We use the same word piece tokenization scheme as BERT (which is itself a sub word tokenization scheme similar to Byte Pair Encoding[26]) in order to allow distillation of the embedding layer as described in section III(iii
). Before feeding data into the 8 encoder blocks, we use a timedistributed feedforward layer to expand the dimensions of the hidden representation to equal that of the encoder blocks. The major architectural insight we bring forth with WaLDORf is to use convolutional layers to shrink and then reexpand the input sequence dimension. The selfattention mechanism employed by the transformer encoder blocks have complexity which is quadratic in the input sequence length
[12]. By using two 1D convolutions (along the sequence length dimension) coupled with two maxpooling layers, we shrink the input sequence lengths to the encoder blocks by a factor of 4. In turn, this means each selfattention mechanism needs to do only 1/16th the amount of work. We then use two more 1D convolutions (since there are no transposed convolutions in 1 dimension) coupled with two up sampling layers to reexpand the sequence length dimension back to its original length. By greatly reducing the amount of work carried out by the selfattention mechanism, we are able to make the encoder blocks run much faster so that we can have a deeper model without sacrificing speed. The tradeoff is, of course, that the model must learn to encode the input information in a "smeared out" way; with roughly every 4 wordpieces getting 1 internal representation. This tradeoff may affect different tasks differently depending on how sensitive a certain task is to specific wordpieces. However, we found that even for the SQuAD v2.0 task, which intuitively should be relatively sensitive to specific words or wordpieces, since the task is specifically about finding spans within the text, we were still able to achieve good accuracy with our model.The 8 encoder blocks used in our model are the standard transformer encoder blocks used by BERT which have been scaled down to increase inference speed. A detailed list of architecture hyperparmeters can be seen in table LABEL:table:archHyper. The selfattention mechanisms in our model use 16 heads of attention like BERTLarge instead of 12 heads of attention like BERTBase because we chose to use BERTLarge as our teacher model. In total, our model has parameters which is a bit less than a quarter of the number of parameters of BERTBase.
Architecture Hyperparameters 


# Encoder Blocks  
# Encoder Convolutional Layers  
# Decoder Convolutional Layers  
Embedding Size  
Hidden Size  
FeedForward Size  
# Attention Heads  
# Total Parameters 
iii Training Objectives and Procedure
To train our model, we used a modified version of the patient knowledge distillation procedure. We have 5 main training losses:

The distillation loss over the embedding layer: .

The distillation loss over the hidden state representations inside the 8 encoder blocks: .

The distillation loss over the attention scores used to calculate the attention weights inside the 8 selfattention mechanisms: .

The distillation loss over the final softmax layer which corresponds to the vanilla model distillation loss introduced by Hinton et. al.:
. 
The ground truth loss (for data for which we have the ground truth): .
The student model (our model) learns from a teacher model which we chose to be BERTLarge WholeWordMasking (BERTLarge WWM) which is a variant of BERTLarge that was trained using a masked token objective where the masked tokens were ensured to be over whole words instead of word pieces as in the original. The first 4 losses mentioned, , are all losses for the student model with respect to the teacher model. The final loss is the loss with respect to the ground truth and would be the only loss available to us if we did not perform model distillation. As we will show in section IV(iii), if we use only to try to train our model, our results are extremely poor.
The embedding loss, , is a simple mean squared error loss that measures how different the embedding representations of the student model are from the teacher model. Given , where are the student model embeddings for the text sequence, is the input sequence length, and is the embedding size of the student model, and , where are the teacher model embeddings for the text sequence, and is the teacher model embedding size, then:
(1) 
Where MSE is the meansquared error, is the batch size, and
is a learned tensor that projects the student embeddings to the same dimension size as the teacher embeddings.
Both the hidden loss, , and the attention scores loss are mean squared error losses that tries to match the hidden states and attention weights within the 8 encoder blocks of the student model to 8 encoder blocks teacher model. We choose to follow the PKDskip scheme[13] and use every 3
encoder block in the teacher model for distillation. Just like for the embedding loss, we also have to project the student model hidden states to the same dimension size as the teacher model’s. In addition, we also use an average pooling over the teacher model’s hidden states along the sequence length dimension to shrink the sequence length of the teacher model to match the sequence length inside the student model’s encoder blocks. Because WaLDORf used convolutions to shrink the sequence length used in the encoder blocks, we can’t match hidden states, word for word, between the two models. If we think of hidden states analogously to word vectors, then we are essentially asking WaLDORf to approximate the average of 4 word vectors with each of its hidden states. This procedure is perhaps the simplest way to make the dimensions of our student model align with the targets given by the teacher model. Empirically, this procedure gives us good results. Thus, given
, where are the hidden state outputs for the text sequence of the encoder block of the student model, and is the hidden dimension size of the student model, and , where are the hidden state outputs for the text sequence of the encoder block of the teacher model, and is the hidden dimension size of the teacher model, then:(2) 
Where AVG denotes a 1D average pooling with size 4 and stride 4 along the sequence length dimension,
is the hidden loss of the encoder block, and is a learned matrix that projects student model hidden dimension sizes to those of the teacher model. Note that we use the same for every encoder block because we want the distilled learning to be learned by WaLDORf itself and not by these projection matrices which will be discarded at inference time.For the attention score loss, instead of using average pooling to make sequence dimensions match, we switch to using max pooling. We decided to use max pooling here because we want the student model to learn the strongest attention scores rather than an average of attention scores. Hence, given , where are the attention scores for the text sequence of the encoder block, and , where are the attention scores for the text sequence of the encoder block, then:
(3) 
Where MAX denotes a 2D max pooling operation with size and stride , and is the attention score loss of the encoder block. Note that here we are operating on attention scores and not the attention weights. The attention scores have not had a softmax applied to them and so do not necessarily add up to 1, hence we are free to perform a simple max pooling without destroying any normalization since the softmax will be applied after. One detail that should be noted is that proper masking needs to be maintained on the attention score loss so that the system does not try to learn minor changes to the masking value.
The distillation loss over the final softmax layer follows closely the distillation loss introduced in Hinton et. al.[25]:
(4) 
Where is a temperature hyperparameter and
are teacher and student logits for the
text sequence respectively. We multiply the loss by the factor of to ensure that the gradients are of the correct scale.Lastly, the ground truth loss is the standard cross entropy loss with the ground truth labels:
(5) 
Where is the ground truth label for the text sequence. This ground truth loss is only available for data for which we have the ground truth labels. When we perform data augmentation, as discussed in the next section, we will no longer have access to ground truth labels and so this loss will not apply. Note that for both the ground truth loss and the distillation loss over the final softmax, there are two sets of logits (and labels), one for the start position and one for the end position. The softmax functions are taken over the sequence length dimension.
To train our model, we found it was very beneficial to train each layer from the bottom up (or left to right in 1
) so that we start with training only using the embedding loss, then combine that loss with the hidden state and attention score losses for layer 0, then for layer 1, etc. After each intermediate layer is trained for some time, we then add the final softmax layer distillation loss and the ground truth loss. This training procedure makes sense intuitively because it should reduce the internal covariate shift experienced by each layer during training (the same goal as for batch normalization
[27]). Since we have targets available for each layer in the student model, with the exception of the convolutional layers, we can use those intermediate targets to ensure that the inputs to each layer have stabilized somewhat by the time that layer is trained. Thus, our final loss function is:
(6) 
Where are real number hyperparameters and:
(7) 
Where is the Heaviside step function, is the current global training step and is an integer hyperparameter specifying how many training steps to take for each layer. We will show in our ablation study in section IV(iii) that this layerbylayer building up of the loss function significantly improves accuracy.
iv Data Augmentation
A large part of why VLLMs perform so well on downstream tasks is due to the massive amount of data that they pretrain on[3, 5, 7]. For distilled architectures, such as DistilBERT or BERTPKD, which keep hidden dimensions all the same size as a teacher model (e.g. BERTBase) and just use fewer encoder blocks, the student encoder blocks can be initialized with the weights from a subset of encoders blocks of a trained teacher model. For distilled architectures, such as TinyBERT or our model, which change the hidden dimension sizes as well as the number of encoder blocks, there is no obvious way to initialize the weights of the model with weights from the teacher model. Hence, distilled architectures also often reproduce the pretraining phase of BERT before being finetuned (with or without additional distillation) on downstream tasks. Here, we are focused on taskspecific distillation and so we do not desire to reproduce the pretraining, language modeling, phase. However, not reproducing the pretraining phase would put our model at a huge disadvantage simply due to the much smaller coverage of input data that our model would be trained on. Hence, to help balance out the data volume a bit, we chose to perform extensive data augmentation. It should be noted though that even with data augmentation, the final volume of data, as measured by total token count, that we train on is still much smaller than the pretraining data used to train BERT.
To perform data augmentation, we sourced 500,000 additional paragraphs, between 200 and 320 words long, from 280,250 English Wikipedia articles. For each paragraph, we automatically generated a question using a finetuned recently released TexttoText Transfer Transformer, T5Large[7]. We finetuned T5Large by using questioncontext pairs from the SQuAD v2.0 dataset itself. Visual analysis of the produced questions showed that the automatically generated questions were of very high quality. Our empirical experiments will also show (in section IV) that this data augmentation increased the endtoend accuracy by a large margin. For comparison, the SQuAD v2.0 training dataset has on the order of 130,000 questions on 20,000 paragraphs sourced from 500 articles.
v Inference Speed Testing
To perform inference speed testing, we set up a Nvidia V100 GPU running Tensorflow 1.14 and CUDA 10. We must point out here that inference speed and relative speed up are quite sensitive to the exact testing conditions. For example, changing the input sequence length or prediction batch size may change both the absolute inference speed and relative speed up one architecture gets over another. Inference speed and relative speed up are therefore task dependent and are not universal. These reasons are why the relative speed up reported by us may not be entirely consistent with the relative speed up reported by others. In particular, while TinyBERT found 9.4x speed up over BERTBase for their particular testing environment, which used a batch size of 128 on the QNLI training set and a maximum sequence length of 128 on a NVidia K80 GPU, within our testing environment, TinyBERT showed only 7.4x speedup over BERTBase. For our test environment, we chose to use a maximum sequence length of 384 and a batch size of 32. We ran inference multiple times on the SQuAD v2.0 dev set and averaged the runtimes to average out any network or memory caching dependent performance issues.
vi Training Hyperparameters
Training Hyperparameters  

Init Learning Rate  
Batch Size  
Dropout  
Adam Epsilon  
Init Temperature  
Epochs  
We present the hyperparameters used to train WaLDORf in table LABEL:table:trainHyper. These are the hyperparameters we found gave us the best results. The initial learning rate was kept constant over the "building up" phase of training where layers and encoder blocks are slowly being added to the training objective. Once the model reached the final phase of training and all of the loss functions were in play, the learning rate was decayed linearly to 0 by the end of training. The initial temperature was also decayed linearly to 1 for the final phase of training. For our choice of data volume, batch size, and steps per layer (), 35 epochs of training corresponds roughly to 1 million global steps with 517,500 of those steps being spent during the buildup phase of training.
vii Hardware
All experiments were performed using a single Google v3 TPU running TensorFlow 1.14. Data parallelization among the 8 TPU cores was handled by the TPUEstimator API.
Iv Results
Overall, WaLDORf achieved significant speed up over previous stateoftheart distilled architectures while maintaining a higher level of accuracy on SQuAD v2.0.
i Inference Speed
Inference Speed Results  
Model  
BERTBase  1x  
BERTLarge  0.3x  
BERT6  2x  
BERT4  2.9x  
BERT2  5.8x  
TinyBERT  7.4x  
Ours  9.1x 
Detailed results of our inference speed testing are presented in table LABEL:table:speedTest. As clearly shown, our model is the fastest model to perform inference over the SQuAD v2.0 cross validation set by a wide margin. Because of our use of convolutional layers, we were able to have a model which is deeper and has larger hidden dimensions while still being faster than TinyBERT. DistilBERT and BERTPKD would both fall under the BERT6 umbrella since in terms of architecture both are simply 6 encoder block versions of BERTBase. BERTPKD also has a 3 encoder block version, but we can see that our model is much faster than even a 2 encoder block version of BERTBase.
ii Accuracy Performance
SQuAD v2.0 Results  

Model  EM  F1 
BERTBase  
BERTLargeWWM  
BERTPKD4  
DistilBERT4  
TinyBERT  
Our Model  66.0  70.3 
BERTPKD6  
DistilBERT6  
TinyBERT6 
Our model’s performance on SQuAD v2.0 is benchmarked against other models in table LABEL:table:accuracy. The BERTPKD4 and DistilBERT4 models presented have internal dimensions identical with BERT4 from table LABEL:table:speedTest. Thus, our model would be 3.1x faster than those models for inference while being significantly more accurate. WaLDORf is also somewhat more accurate than TinyBERT ( EM, F1) while still being 1.24x faster. To obtain EM and F1 scores higher than ours, one would have to go to 6layer versions of those models for which our model would be 4.5x faster. TinyBERT6 has internal dimension sizes scaled to be the same as BERTBase as well and so would also be 4.5x slower than WaLDORf.
iii Ablation Study
Ablation Study  

EM  F1  +EM  +F1  
Baseline  
+ Softmax Distil  
+ All Layer Distil  
+ Slow Build  11.8  12.5  
+ Data Augment  66.0  70.3 
Ablation study for WaLDORf. The baseline model is trained on just ground truth labels. "Softmax Distil" is, in addition, trained on the softmax probabilities provided by the teacher model. "All Layer Distil" adds on the embedding layer, hidden state, and attention score losses as described in section
III(iii). "Slow Build" builds each layer of the student model onebyone. "Data Augment", as described in III(iv), was the final technique we added and gave us our final results. +EM and +F1 denote improvements over the previous best model.Results from our ablation study are provided in table LABEL:table:ablation. Each additional technique shown in the table was added on to the sum of the previously used techniques. As we can clearly see, each technique which we added produced significant improvements. With the addition of each new technique, we also explored a whole set of hyperparameters. Results are presented for the best set of hyperparameters found within a given framework. When all the techniques are combined, we get a very significant absolute improvement of in EM score and in F1 score over the baseline.
The biggest incremental improvement in accuracy that we saw was in moving from trying to distill every layer all at once to slowly building up the layers from the bottom embedding layer up. This is likely due to the fact that training every layer all at once introduces a significant covariate shift for every layer’s inputs during training. Stabilizing the inputs to each layer by training layers from a bottom up fashion appears to make optimization much easier.
V Discussion
i Convolutions and Autoencoding
The core architectural change that we bring forth in this work is the use of convolutions, max pooling, and up sampling to reduce and then reexpand the sequence length dimension. We introduced this architectural change with the hope that, analogous to convolutional autoencoders, WaLDORf will find an efficient way to compress the information contained within an example into a shorter sequence. By attaining accuracy performance higher than previous stateoftheart, we have shown that it is in fact feasible for the model to compress information in this way. Information can be effectively encoded by the input convolutions, processed by the encoder blocks, and then decoded by the output convolutions. In addition, we also showed that by using max pooling and average pooling we are able to use only the signals given by the "important" attention weights and the averaged hidden states from the teacher model to make our student model perform strongly. These properties may be somewhat task dependent, but even for the SQuAD v2.0 task we were able to attain a high level of accuracy.
Due to the quadratic dependence of the complexity of the selfattention mechanism on the sequence length, our architecture will actually see larger relative gains in speed the longer the input sequence length is. We chose a sequence length of 384 for our tests because that is the sequence length used by BERT for the SQuAD data set. Other transformerbased VLLMs may use even longer sequence lengths in order to deal better with longer paragraphs, e.g. XLNet uses a max sequence length of 512. For circumstances where the max input sequence length is even longer than 384, we expect to gain even larger relative speed ups.
ii Distillation
Model distillation is currently a very active area of research. Having access to a teacher model opens up the possibility of using a huge variety of techniques to improve a student model’s performance. The distillation techniques available often depend on the architecture of the student model and how similar it is to the teacher model. A model, like BERTPKD or DistilBERT, that is essentially a copy of a teacher model with just fewer layers can straightforwardly distil a subset of the layers in the teacher model. If one scales down the internal dimensions of the model, like in the case of TinyBERT, then one needs to introduce some way to project the student model’s hidden dimensions to those of the teacher model or vice versa. That projection operator introduces new degrees of freedom to the problem and the distillation procedure is no longer as straight forward. In our case, we had to also modify the sequencelength dimension of the teacher model’s signals in order to perform intermediatelayer distillation.
In the case that the student model is entirely different from the teacher model, e.g. when a LSTMbased student model tries to learn from a transformer based teacher model as in Tang et. al.[15], then perhaps the only easily distilled layer is the final layer and no intermediate layer distillation is feasible. However, we (and others) have shown that there is significant performance improvements to be had by distilling also the intermediate layers and operations (e.g. attention scores) of a VLLM. It is uncertain, at this point, how much of the performance improvement is due to the difference in the raw information contained within the training signals (softmax signals vs intermediate layers and operations signals) and how much is due to a difference in how much those signals help the student model to optimize. The concrete difference in these scenarios would be that if the latter is true, then student models with architectures wildly different from a teacher model may still attain a high level of accuracy if the last layer distillation is carried out with a very strong optimization scheme. If the former scenario is true, then we may expect that there is no method by which we can optimize a student model using only the signals from the last layer as strongly as we could by using signals from all the layers.
iii Data Augmentation
As we showed in section IV(iii), data augmentation gives us a significant 8point boost in EM and F1 scores over distillation using just the SQuAD v2.0 dataset itself. This result shows that even with roughly 130,000 questioncontextanswer tuples, SQuAD v2.0 is still not large enough of a dataset to fully train our student model using pure model distillation. The set of targets that the student model had to hit, given only SQuAD v2.0 inputs, was already enormously large. The embedding layer targets are of size , hidden layer targets are of size and the attention score targets are of size , where denotes the total number of examples. For a BERTLargeWWM teacher model over the SQuAD v2.0 dataset, these tensors would comprise roughly 1.2 Terabytes of data. The sheer size of this data makes for a very rich set of signals given to the student model, however, given the success of our data augmentation experiments, the input space provided by SQuAD v2.0 itself did not appear rich enough for the student model to fully exploit those signals.
A huge benefit of performing model distillation and studentteacher learning is the ease with which "labeled" data can be obtained from unlabeled data. Data augmentation for us was simplified further by the introduction of the powerful T5 model which we finetuned to automatically generate questions given a passage. But, in the absence of such a model, generating a huge amount of unlabeled data in the context of a production environment is often quite feasible. In the case of questionanswering, for example, we may be able to mine usergenerated questions and run those through a teacher model to provide data for the student model to train on. For task specific knowledge distillation, we have shown that purely performing data augmentation is a viable alternative to reproducing the language modeling pretraining phase.
Vi Conclusion and Further Work
In this work, we have shown a set of techniques that together successfully advance the stateoftheart in accuracy and speed of distilled language models for the SQuAD v2.0 task. There are quite many promising directions for future work within this field. Of course, the main considerations would be in pushing the limits of model distillation itself. Perhaps by cleverly using adapter layers of some kind, it will become possible in the future to distil the knowledge contained within intermediate layers of a VLLM to a much smaller model that has a totally different architecture from the teacher model. Or maybe a combination of model distillation techniques and model pruning and weight sharing could bring even greater speed gains without loss in accuracy. Lastly, beyond the practical benefits, model distillation techniques may be used to probe just how much the size of a VLLM is necessary to performing specific NLU tasks. Model distillation is an extremely rich area of research well deserving additional exploration.
Vii Acknowledgments
The authors would like to thank all the members of the SAP Innovation Center Newport Beach as well as our collaborators at the University of California, Irvine, for helpful discussions and inspiration.
References
 [1] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” arXiv preprint arXiv:1802.05365, 2018.
 [2] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
 [3] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” arXiv preprint arXiv:1906.08237, 2019.
 [4] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
 [5] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI Blog, vol. 1, no. 8, 2019.
 [6] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatronlm: Training multibillion parameter language models using gpu model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
 [7] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified texttotext transformer,” arXiv preprint arXiv:1910.10683, 2019.
 [8] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Glue: A multitask benchmark and analysis platform for natural language understanding,” arXiv preprint arXiv:1804.07461, 2018.
 [9] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Superglue: A stickier benchmark for generalpurpose language understanding systems,” arXiv preprint arXiv:1905.00537, 2019.
 [10] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” arXiv preprint arXiv:1606.05250, 2016.
 [11] P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for squad,” arXiv preprint arXiv:1806.03822, 2018.
 [12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, pp. 5998–6008, 2017.
 [13] S. Sun, Y. Cheng, Z. Gan, and J. Liu, “Patient knowledge distillation for bert model compression,” arXiv preprint arXiv:1908.09355, 2019.
 [14] X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “Tinybert: Distilling bert for natural language understanding,” arXiv preprint arXiv:1909.10351, 2019.
 [15] R. Tang, Y. Lu, L. Liu, L. Mou, O. Vechtomova, and J. Lin, “Distilling taskspecific knowledge from bert into simple neural networks,” arXiv preprint arXiv:1903.12136, 2019.
 [16] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
 [17] B. McCann, N. S. Keskar, C. Xiong, and R. Socher, “The natural language decathlon: Multitask learning as question answering,” arXiv preprint arXiv:1806.08730, 2018.
 [18] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
 [19] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing multihead selfattention: Specialized heads do the heavy lifting, the rest can be pruned,” arXiv preprint arXiv:1905.09418, 2019.
 [20] G. Prato, E. Charlaix, and M. Rezagholizadeh, “Fully quantized transformer for improved translation,” arXiv preprint arXiv:1910.10485, 2019.
 [21] R. Cheong and R. Daniel, “transformers. zip: Compressing transformers with pruning and quantization,” tech. rep., Technical report, Stanford University, Stanford, California, 2019. URL https …, 2019.
 [22] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for selfsupervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
 [23] S. Merity, “Single headed attention rnn: Stop thinking with your head,” arXiv preprint arXiv:1911.11423, 2019.
 [24] C. Buciluǎ, R. Caruana, and A. NiculescuMizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541, ACM, 2006.
 [25] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
 [26] P. Gage, “A new algorithm for data compression,” The C Users Journal, vol. 12, no. 2, pp. 23–38, 1994.
 [27] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.