FastBERT: a Self-distilling BERT with Adaptive Inference Time

04/05/2020 ∙ by Weijie Liu, et al. ∙ Tencent Peking University 0

Pre-trained language models like BERT have proven to be highly performant. However, they are often computationally expensive in many practical scenarios, for such heavy models can hardly be readily implemented with limited resources. To improve their efficiency with an assured model performance, we propose a novel speed-tunable FastBERT with adaptive inference time. The speed at inference can be flexibly adjusted under varying demands, while redundant calculation of samples is avoided. Moreover, this model adopts a unique self-distillation mechanism at fine-tuning, further enabling a greater computational efficacy with minimal loss in performance. Our model achieves promising results in twelve English and Chinese datasets. It is able to speed up by a wide range from 1 to 12 times than BERT if given different speedup thresholds to make a speed-performance tradeoff.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Last two years have witnessed significant improvements brought by language pre-training, such as BERT (Devlin et al., 2019), GPT (Radford et al., 2018), and XLNet (Yang et al., 2019)

. By pre-training on unlabeled corpus and fine-tuning on labeled ones, BERT-like models achieved huge gains on many Natural Language Processing tasks.

Despite this gain in accuracy, these models have greater costs in computation and slower speed at inference, which severely impairs their practicalities. Actual settings, especially with limited time and resources in the industry, can hardly enable such models into operation. For example, in tasks like sentence matching and text classification, one often requires to process billions of requests per second. What’s more, the number of requests varies with time. In the case of an online shopping site, the number of requests during the holidays is five to ten times more than that of the workdays. A large number of servers need to be deployed to enable BERT in industrial settings, and many spare servers need to be reserved to cope with the peak period of requests, demanding huge costs.

To improve their usability, many attempts in model acceleration have been made, such as quantinization (Gong et al., 2014), weights pruning (Han et al., 2015), and knowledge distillation (KD) (Romero et al., 2014). As one of the most popular methods, KD requires additional smaller student models that depend entirely on the bigger teacher model and trade task accuracy for ease in computation (Hinton et al., 2015). Reducing model sizes to achieve acceptable speed-accuracy balances, however, can only solve the problem halfway, for the model is still set as fixated, rendering them unable to cope with drastic changes in request amount.

By inspecting many NLP datasets (Wang et al., 2018), we discerned that the samples have different levels of difficulty. Heavy models may over-calculate the simple inputs, while lighter ones are prone to fail in complex samples. As recent studies (Kovaleva et al., 2019) have shown redundancy in pre-training models, it is useful to design a one-size-fits-all model that caters to samples with varying complexity and gains computational efficacy with the least loss of accuracy.

Based on this appeal, we propose FastBERT, a pre-trained model with a sample-wise adaptive mechanism. It can adjust the number of executed layers dynamically to reduce computational steps. This model also has a unique self-distillation process that requires minimal changes to the structure, achieving faster yet as accurate outcomes within a single framework. Our model not only reaches a comparable speedup (by 2 to 11 times) to the BERT model, but also attains competitive accuracy in comparison to heavier pre-training models.

Experimental results on six Chinese and six English NLP tasks have demonstrated that FastBERT achieves a huge retrench in computation with very little loss in accuracy. The main contributions of this paper can be summarized as follows:

  • This paper proposes a practical speed-tunable BERT model, namely FastBERT, that balances the speed and accuracy in the response of varying request amounts;

  • The sample-wise adaptive mechanism and the self-distillation mechanism are proposed in this paper for the first time. Their efficacy is verified on twelve NLP datasets;

  • The code is publicly available at https://github.com/autoliuweijie/FastBERT.

2 Related work

BERT (Devlin et al., 2019) can learn universal knowledge from mass unlabeled data and produce more performant outcomes. Many works have followed: RoBERTa (Liu et al., 2019) that uses larger corpus and longer training steps. T5 (Raffel et al., 2019) that scales up the model size even more; UER (Zhao et al., 2019) pre-trains BERT in different Chinese corpora; K-BERT (Liu et al., 2020)

injects knowledge graph into BERT model. These models achieve increased accuracy with heavier settings and even more data.

However, such unwieldy sizes are often hampered under stringent conditions. To be more specific, BERT-base contains 110 million parameters by stacking twelve Transformer blocks (Vaswani et al., 2017), while BERT-large expands its size to even 24 layers. ALBERT (Lan et al., 2019) shares the parameters of each layer to reduce the model size. Obviously, the inference speed for these models would be much slower than classic architectures (e.g., CNN (Kim, 2014), RNN (Wang, 2018), etc). We think a large proportion of computation is caused by redundant calculation.

Knowledge distillation: Many attempts have been made to distill heavy models (teachers) into their lighter counterparts (students). PKD-BERT (Sun et al., 2019a) adopts an incremental extraction process that learns generalizations from intermediate layers of the teacher model. TinyBERT (Jiao et al., 2019) performs a two-stage learning involving both general-domain pre-training and task-specific fine-tuning. DistilBERT (Sanh et al., 2019) further leveraged the inductive bias within large models by introducing a triple loss. As shown in Figure 1, student model often require a separated structure, whose effect however, depends mainly on the gains of the teacher. They are as indiscriminate to individual cases as their teachers, and only get faster in the cost of degraded performance.

Figure 1: Classic knowledge distillation approach: Distill a small model using a separate big model.

Adaptive inference: Conventional approaches in adaptive computations are performed token-wise or patch-wise, who either adds recurrent steps to individual tokens (Graves, 2016) or dynamically adjusts the number of executed layers inside discrete regions of images (Figurnov et al., ). These local adjustments, however, are no good for parallelization. To the best of our knowledge, there has been no work in applying adaptive mechanisms to pre-training language models for efficiency improvements so far.

Figure 2: The inference process of FastBERT, where the number of executed layers with each sample varies based on its complexity. This illustrates a sample-wise adaptive mechanism. Taking a batch of inputs () as an example, the Transformer0 and Student-classifier0

inferred their labels as probability distributions and calculate the individual uncertainty. Cases with low uncertainty are immediately removed from the batch, while those with higher uncertainty are sent to the next layer for further inference.

3 Methodology

Distinct to the above efforts, our approach fusions the adaptation and distillation into a novel speed-up approach, shown in Figure 2, achieving competitive results in both accuracy and efficiency.

3.1 Model architecture

As shown in Figure 2

, FastBERT consists of backbone and branches. The backbone is built upon 12-layers Transformer encoder with an additional teacher-classifier, while the branches include student-classifiers which are appended to each Transformer output to enable early outputs.

3.1.1 Backbone

The backbone consists of three parts: the embedding layer, the encoder containing stacks of Transformer blocks (Vaswani et al., 2017), and the teacher classifier. The structure of the embedding layer and the encoder conform with those of BERT (Devlin et al., 2019). Given the sentence length , an input sentence

will be transformed by the embedding layers to a sequence of vector representations

like (1),

(1)

where

is the summation of word, position, and segment embeddings. Next, the transformer blocks in the encoder performs a layer-by-layer feature extraction as (

2),

(2)

where () is the output features at the th layer, and . is the number of Transformer layers.

Following the final encoding output is a teacher classifier that extracts in-domain features for downstream inferences. It has a fully-connected layer narrowing the dimension from to , a self-attention joining a fully-connected layer without changes in vector size, and a fully-connected layer with a function projecting vectors to an -class indicator as in (3), where is the task-specific number of classes.

(3)

3.1.2 Branches

To provide FastBERT with more adaptability, multiple branches, i.e. the student classifiers, in the same architecture with the teacher are added to the output of each Transformer block to enable early outputs, especially in those simple cases. The student classifiers can be described as (4),

(4)

The student classifier is designed carefully to balance model accuracy and inference speed, for simple networks may impair the performance, while a heavy attention module severely slows down the inference speed. Our classifier has proven to be lighter with ensured competitive accuracy, detailed verifications are showcased in Section 4.1.

3.2 Model training

FastBERT requires respective training steps for the backbone and the student classifiers. The parameters in one module is always frozen while the other module is being trained. The model is trained in preparation for downstream inference with three steps: the major backbone pre-training, entire backbone fine-tuning, and self-distillation for student classifiers.

3.2.1 Pre-training

The pre-training of backbone resembles that of BERT in the same way that our backbone resembles BERT. Any pre-training method used for BERT-like models (e.g., BERT-WWM (Cui et al., 2019), RoBERTa (Liu et al., 2019), and ERNIE (Sun et al., 2019b)) can be directly applied. Note that the teacher classifier, as it is only used for inference, stays unaffected at this time. Also conveniently, FastBERT does not even need to perform pre-training by itself, for it can load high-quality pre-trained models freely.

3.2.2 Fine-tuning for backbone

For each downstream task, we plug in the task-specific data into the model, fine-tuning both the major backbone and the teacher classifier. The structure of the teacher classifier is as previously described. At this stage, all student classifiers are not enabled.

3.2.3 Self-distillation for branch

With the backbone well-trained for knowledge extraction, its output, as a high-quality soft-label containing both the original embedding and the generalized knowledge, is distilled for training student classifiers. As student are mutually independent, their predictions are compared with the teacher soft-label respectively, with the differences measured by KL-Divergence in (5),

(5)

As there are student classifiers in the FastBERT, the sum of their KL-Divergences is used as the total loss for self-distillation, which is formulated in (6),

(6)

where refers to the probability distribution of the output from student-classifier i.

Since this process only requires the teacher‘s output, we are free to use an unlimited number of unlabeled data, instead of being restricted to the labeled ones. This provides us with sufficient resources for self-distillation, which means we can always improve the student performance as long as the teacher allows. Moreover, our method differs from the previous distillation method, for the teacher and student outputs lie within the same model. This learning process does not require additional pre-training structures, making the distillation entirely a learning process by self.

Operation Sub-operation FLOPs Total FLOPs
Transformer
Self-attention
()
603.0M 1809.9M
Feedforward
(
)
1207.9M
Classifier
Fully-connect
()
25.1M 46.1M
Self-attention
()
16.8M
Fully-connect
()
4.2M
Fully-connect
()
-
Table 1: FLOPs of each operation within the FastBERT (M = Million, = the number of labels).

3.3 Adaptive inference

With the above steps, FastBERT is well-prepared to perform inference in an adaptive manner, which means we can adjust the number of executed encoding layers within the model according to the sample complexity.

At each Transformer layer, we measure for each sample on whether the current inference is credible enough to be terminated.

Given an input sequence, the uncertainty of a student classifier’s output is computed with a normalized entropy in (7),

(7)

where is the distribution of output probability, and N is the number of labeled classes.

With the definition of the uncertainty, we make an important hypothesis.

Hypothesis 1.

LUHA: the Lower the Uncertainty, the Higher the Accuracy.

Definition 1.

Speed: The threshold to distinguish high and low uncertainty.

LUHA is verified in Section 4.4. Both Uncertainty and Speed range between and . The adaptive inference mechanism can be described as: At each layer of FastBERT, the corresponding student classifier will predict the label of each sample with measured Uncertainty. Samples with Uncertainty below the Speed will be sifted to early outputs, while samples with Uncertainty above the Speed will move on to the next layer.

Intuitively, with a higher Speed, fewer samples will be sent to higher layers, and overall inference speed will be faster, and vice versa. Therefore, Speed can be used as a halt value for weighing the inference accuracy and efficiency.

Dataset/ Model ChnSentiCorp Book review Shopping review LCQMC Weibo THUCNews
Acc.
FLOPs
(speedup)
Acc.
FLOPs
(speedup)
Acc.
FLOPs
(speedup)
Acc.
FLOPs
(speedup)
Acc.
FLOPs
(speedup)
Acc.
FLOPs
(speedup)
BERT 95.25
21785M
(1.00x)
86.88
21785M
(1.00x)
96.84
21785M
(1.00x)
86.68
21785M
(1.00x)
97.69
21785M
(1.00x)
96.71
21785M
(1.00x)
DistilBERT
(6 layers)
88.58
10918M
(2.00x)
83.31
10918M
(2.00x)
95.40
10918M
(2.00x)
84.12
10918M
(2.00x)
97.69
10918M
(2.00x)
95.54
10918M
(2.00x)
DistilBERT
(3 layers)
87.33
5428M
(4.01x)
81.17
5428M
(4.01x)
94.84
5428M
(4.01x)
84.07
5428M
(4.01x)
97.58
5428M
(4.01x)
95.14
5428M
(4.01x)
DistilBERT
(1 layers)
81.33
1858M
(11.72x)
77.40
1858M
(11.72x)
91.35
1858M
(11.72x)
71.34
1858M
(11.72x)
96.90
1858M
(11.72x)
91.13
1858M
(11.72x)
FastBERT
(speed=0.1)
95.25
10741M
(2.02x)
86.88
13613M
(1.60x)
96.79
4885M
(4.45x)
86.59
12930M
(1.68x)
97.71
3691M
(5.90x)
96.71
3595M
(6.05x)
FastBERT
(speed=0.5)
92.00
3191M
(6.82x)
86.64
5170M
(4.21x)
96.42
2517M
(8.65x)
84.05
6352M
(3.42x)
97.72
3341M
(6.51x)
95.64
1979M
(11.00x)
FastBERT
(speed=0.8)
89.75
2315M
(9.40x)
85.14
3012M
(7.23x)
95.72
2087M
(10.04x)
77.45
3310M
(6.57x)
97.69
1982M
(10.09x)
94.97
1854M
(11.74x)
Dataset/ Model Ag.news Amz.F Dbpedia Yahoo Yelp.F Yelp.P
Acc.
FLOPs
(speedup)
Acc.
FLOPs
(speedup)
Acc.
FLOPs
(speedup)
Acc.
FLOPs
(speedup)
Acc.
FLOPs
(speedup)
Acc.
FLOPs
(speedup)
BERT 94.47
21785M
(1.00x)
65.50
21785M
(1.00x)
99.31
21785M
(1.00x)
77.36
21785M
(1.00x)
65.93
21785M
(1.00x)
96.04
21785M
(1.00x)
DistilBERT
(6 layers)
94.64
10872M
(2.00x)
64.05
10872M
(2.00x)
99.10
10872M
(2.00x)
76.73
10872M
(2.00x)
64.25
10872M
(2.00x)
95.31
10872M
(2.00x)
DistilBERT
(3 layers)
93.98
5436M
(4.00x)
63.84
5436M
(4.00x)
99.05
5436M
(4.00x)
76.56
5436M
(4.00x)
63.50
5436M
(4.00x)
93.23
5436M
(4.00x)
DistilBERT
(1 layers)
92.88
1816M
(12.00x)
59.48
1816M
(12.00x)
98.95
1816M
(12.00x)
74.93
1816M
(12.00x)
58.59
1816M
(12.00x)
91.59
1816M
(12.00x)
FastBERT
(speed=0.1)
94.38
6013M
(3.62x)
65.50
21005M
(1.03x)
99.28
2060M
(10.57x)
77.37
16172M
(1.30x)
65.93
20659M
(1.05x)
95.99
6668M
(3.26x)
FastBERT
(speed=0.5)
93.14
2108M
(10.33x)
64.64
10047M
(2.16x)
99.05
1854M
(11.74x)
76.57
4852M
(4.48x)
64.73
9827M
(2.21x)
95.32
3456M
(6.30x)
FastBERT
(speed=0.8)
92.53
1858M
(11.72x)
61.70
2356M
(9.24x)
99.04
1853M
(11.75x)
75.05
1965M
(11.08x)
60.66
2602M
(8.37x)
94.31
2460M
(8.85x)
Table 2: Comparison of accuracy (Acc.) and FLOPs (speedup) between FastBERT and Baselines in six Chinese datasets and six English datasets.
Figure 3: The trade-offs of FastBERT on twelve datasets (six in Chinese and six in English): (a) and (d) are Speed-Accuracy relations, showing changes of Speed (the threshold of Uncertainty) in dependence of the accuracy; (b) and (e) are Speed-Speedup relations, indicating that the Speed manages the adaptibility of FastBERT; (c) and (f) are the Speedup-Accuracy relations, i.e. the trade-off between efficiency and accuracy.

4 Experimental results

In this section, we will verify the effectiveness of FastBERT on twelve NLP datasets (six in English and six in Chinese) with detailed explanations.

4.1 FLOPs analysis

Floating-point operations (FLOPs) is a measure of the computational complexity of models, which indicates the number of floating-point operations that the model performs for a single process. The FLOPs has nothing to do with the model’s operating environment (CPU, GPU or TPU) and only reveals the computational complexity. Generally speaking, the bigger the model’s FLOPs is, the longer the inference time will be. With the same accuracy, models with low FLOPs are more efficient and more suitable for industrial uses.

We list the measured FLOPs of both structures in Table 1, from which we can infer that, the calculation load (FLOPs) of the Classifier is much lighter than that of the Transformer. This is the basis of the speed-up of FastBERT, for although it adds additional classifiers, it achieves acceleration by reducing more computation in Transformers.

4.2 Baseline and dataset

4.2.1 Baseline

In this section, we compare FastBERT against two baselines:

4.2.2 Dataset

To verify the effectiveness of FastBERT, especially in industrial scenarios, six Chinese and six English datasets pressing closer to actual applications are used. The six Chinese datasets include the sentence classification tasks (ChnSentiCorp, Book review(Qiu et al., 2018), Shopping review, Weibo and THUCNews) and a sentences-matching task (LCQMC(Liu et al., 2018)). All the Chinese datasets are available at the FastBERT project. The six English datasets (Ag.News, Amz.F, DBpedia, Yahoo, Yelp.F, and Yelp.P) are sentence classification tasks and were released in (Zhang et al., 2015).

4.3 Performance comparison

To perform a fair comparison, BERT / DistilBERT / FastBERT all adopt the same configuration as follows. In this paper, . The number of self-attention heads, the hidden dimension of embedding vectors, and the max length of the input sentence are set to 12, 768 and 128 respectively. Both FastBERT and BERT use pre-trained parameters provided by Google, while DistilBERT is pre-trained with (Sanh et al., 2019). We fine-tune these models using the AdamW (Loshchilov and Hutter, ) algorithm, a learning rate, and a

warmup. Then, we select the model with the best accuracy in 3 epochs. For the self-distillation of FastBERT, we increase the learning rate to

and distill it for 5 epochs.

We evaluate the text inference capabilities of these models on the twelve datasets and report their accuracy (Acc.) and sample-averaged FLOPs under different Speed values. The result of comparisons are shown in Table 2, where the is obtained by using BERT as the benchmark. It can be observed that with the setting of , FastBERT can speed up 2 to 5 times without losing accuracy for most datasets. If a little loss of accuracy is tolerated, FastBERT can be 7 to 11 times faster than BERT. Comparing to DistilBERT, FastBERT trades less accuracy to catch higher efficiency. Figure 3 illustrates FastBERT’s tradeoff in accuracy and efficiency. The speedup ratio of FastBERT are free to be adjusted between 1 and 12, while the loss of accuracy remains small, which is a very attractive feature in the industry.

Figure 4: The relation of classifier accuracy and average case uncertainty: Three classifiers at the bottom, in the middle, and on top of the FastBERT were analyzed, and their accuracy within various uncertainty intervals were calculated with the Book Review dataset.

4.4 LUHA hypothesis verification

As is described in the Section 3.3, the adaptive inference of FastBERT is based on the LUHA hypothesis, i.e., “the Lower the Uncertainty, the Higher the Accuracy”. Here, we prove this hypothesis using the book review dataset. We intercept the classification results of Student-Classifier0, Student-Classifier5, and Teacher-Classifier in FastBERT, then count their accuracy in each uncertainty interval statistically. As shown in Figure 4, the statistical indexes confirm that the classifier follows the LUHA hypothesis, no matter it sits at the bottom, in the middle or on top of the model.

From Figure 4, it is easy to mistakenly conclude that Students has better performance than Teacher due to the fact that the accuracy of Student in each uncertainty range is higher than that of Teacher. This conclusion can be denied by analysis with Figure 6(a) together. For the Teacher, more samples are located in areas with lower uncertainty, while the Student

s’ samples are nearly uniformly distributed. Therefore the overall accuracy of the

Teacher is still higher than that of Students.

4.5 In-depth study

In this section, we conduct a set of in-depth analysis of FastBERT from three aspects: the distribution of exit layer, the distribution of sample uncertainty, and the convergence during self-distillation.

Figure 5: The distribution of executed layers on average in the Book review dataset, with experiments at three different speeds (0.3, 0.5 and 0.8).

4.5.1 Layer distribution

In FastBERT, each sample walks through a different number of Transformer layers due to varied complexity. For a certain condition, fewer executed layers often requires less computing resources. As illustrated in Figure 5, we investigate the distribution of exit layers under different constraint of Speeds (0.3, 0.5 and 0.8) in the book review dataset. Take as an example, at the first layer Transformer0, 61% of the samples is able to complete the inference. This significantly eliminates unnecessary calculations in the next eleven layers.

4.5.2 Uncertainty distribution

Figure 6: The distribution of Uncertainty at different layers of FastBERT in the Book review dataset: (a) The speed is set to 0.0, which means that all samples will pass through all the twelve layers; (b) (d): The Speed is set to 0.3, 0.5, and 0.8 respectively, iand only the samples with Uncertainty higher than Speed will be sent to the next layer.

The distribution of sample uncertainty predicted by different student classifiers varies, as is illustrated in Figure 6. Observing these distributions help us to further understand FastBERT. From Figure 6(a), it can be concluded that the higher the layer is posited, the lower the uncertainty with given Speed will be, indicating that the high-layer classifiers more decisive than the lower ones. It is worth noting that at higher layers, there are samples with uncertainty below the threshold of Uncertainty (i.e., the Speed), for these high-layer classifiers may reverse the previous judgments made by the low-layer classifiers.

4.5.3 Convergence of self-distillation

Figure 7: The change in accuracy and FLOPs of FastBERT during fine-tuning and self-distillation with the Book review dataset. The accuracy firstly increases at the fine-tuning stage, while the self-distillation reduces the FLOPs by six times with almost no loss in accuracy.

Self-distillation is a crucial step to enable FastBERT. This process grants student classifiers with the abilities to infer, thereby offloading work from the teacher classifier. Taking the Book Review dataset as an example, we fine-tune the FastBERT with three epochs then self-distill it for five more epochs. Figure 7 illustrates its convergence in accuracy and FLOPs during fine-tune and self-distillation. It could be observed that the accuracy increases with fine-tuning, while the FLOPs decrease during the self-distillation stage.

4.6 Ablation study

Adaptation and self-distillation are two crucial mechanisms in FastBERT. We have preformed ablation studies to investigate the effects brought by these two mechanisms using the Book Review dataset and the Yelp.P dataset. The results are presented in Table 3, in which ‘without self-distillation’ implies that all classifiers, including both the teacher and the students, are trained in the fine-tuning; while ‘without adaptive inference’ means that the number of executed layers of each sample is fixated to two or six.

From Table 3, we have observed that: (1) At almost the same level of speedup, FastBERT without self-distillation or adaption performs poorer; (2) When the model is accelerated more than five times, downstream accuracy degrades dramatically without adaption. It is safe to conclude that both the adaptation and self-distillation play a key role in FastBERT, which achieves both significant speedups and favorable low losses of accuracy.

Config. Book review Yelp.P
Acc.
FLOPs
(speedup)
Acc.
FLOPs
(speedup)
FastBERT
speed=0.2 86.98
9725M
(2.23x)
95.90
52783M
(4.12x)
speed=0.7 85.69
3621M
(6.01x)
94.67
2757M
(7.90x)
FastBERT without self-distillation
speed=0.2 86.22
9921M
(2.19x)
95.55
4173M
(5.22x)
speed=0.7 85.02
4282M
(5.08x)
94.54
2371M
(9.18x)
FastBERT without adaptive inference
layer=6 86.42
11123M
(1.95x)
95.18
11123M
(1.95x)
layer=2 82.88
3707M
(5.87x)
93.11
3707M
(5.87x)
Table 3: Results of ablation studies on the Book review and Yelp.P datasets.

5 Conclusion

In this paper, we propose a fast version of BERT, namely FastBERT. Specifically, FastBERT adopts a self-distillation mechanism during the training phase and an adaptive mechanism in the inference phase, achieving the goal of gaining more efficiency with less accuracy loss. Self-distillation and adaptive inference are first proposed in this paper. In addition, FastBERT has a very practical feature in industrial scenarios, i.e., its inference speed is tunable.

Our experiments demonstrate promising results on twelve NLP datasets. Empirical results have shown that FastBERT can be 2 to 3 times faster than BERT without performance degradation. If we slack the tolerated loss in accuracy, the model is free to tune its speedup between 1 and 12 times. Besides, FastBERT remains compatible to the parameter settings of other BERT-like models (e.g., BERT-WWM, ERNIE, and RoBERTa), which means these public available models can be readily loaded for FastBERT initialization.

6 Future work

These promising results point to future works in (1) linearizing the Speed-Speedup curve; (2) extending this approach to other pre-training architectures such as XLNet (Yang et al., 2019) and ELMo (Peters et al., 2018); (3)

applying FastBERT on a wider range of NLP tasks, such as named entity recognition and machine translation.

Acknowledgments

This work is funded by 2019 Tencent Rhino-Bird Elite Training Program.

References

  • Y. Cui, W. Che, T. Liu, B. Qin, Z. Yang, S. Wang, and G. Hu (2019) Pre-training with whole word masking for chinese BERT. arXiv preprint arXiv:1906.08101. External Links: Link Cited by: §3.2.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of ACL, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2, §3.1.1, 1st item.
  • [3] M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. Vetrov, and R. Salakhutdinov Spatially adaptive computation time for residual networks. In Proceedings of CVPR, pp. 1790–1799. External Links: Link Cited by: §2.
  • Y. Gong, L. Liu, M. Yang, and L. Bourdev (2014) Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115. External Links: Link Cited by: §1.
  • A. Graves (2016)

    Adaptive computation time for recurrent neural networks.

    .
    arXiv preprint arXiv:1603.08983. External Links: Link Cited by: §2.
  • S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in NeurIPS, pp. 1135–1143. External Links: Link Cited by: §1.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. Computer Science 14 (7), pp. 38–39. External Links: Link Cited by: §1.
  • X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu (2019) TinyBERT: distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351. External Links: Link Cited by: §2.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In Proceedings of EMNLP, pp. 1746–1751. External Links: Link, Document Cited by: §2.
  • O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky (2019) Revealing the dark secrets of BERT. In Proceedings of EMNLP-IJCNLP, pp. 4356–4365. External Links: Link Cited by: §1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    ALBERT: a lite BERT for self-supervised learning of language representations

    .
    arXiv preprint arXiv:1909.11942. External Links: Link Cited by: §2.
  • W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng, and P. Wang (2020) K-BERT: enabling language representation with knowledge graph. In Processing of AAAI, External Links: Link Cited by: §2.
  • X. Liu, Q. Chen, C. Deng, H. Zeng, J. Chen, D. Li, and B. Tang (2018) Lcqmc: a large-scale chinese question matching corpus. In Proceedings of the ICCL, pp. 1952–1962. External Links: Link Cited by: §4.2.2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. External Links: Link Cited by: §2, §3.2.1.
  • [15] I. Loshchilov and F. Hutter Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101. External Links: Link Cited by: §4.3.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of NAACL-HLT, pp. 2227–2237. External Links: Link Cited by: §6.
  • Y. Qiu, H. Li, S. Li, Y. Jiang, R. Hu, and L. Yang (2018) Revisiting correlations between intrinsic and extrinsic evaluations of word embeddings. In Processing of CCL, pp. 209–221. External Links: Link Cited by: §4.2.2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Technical report, OpenAI.. External Links: Link Cited by: §1.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    .
    arXiv preprint arXiv:1910.10683. External Links: Link Cited by: §2.
  • A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550. External Links: Link Cited by: §1.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. In NeurIPS EMC2 Workshop, External Links: Link Cited by: §2, 2nd item, §4.3.
  • S. Sun, Y. Cheng, Z. Gan, and J. Liu (2019a) Patient knowledge distillation for bert model compression. In Proceedings of EMNLP-IJCNLP, pp. 4314–4323. External Links: Link Cited by: §2.
  • Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, and H. Wu (2019b) ERNIE: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223. External Links: Link Cited by: §3.2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in NeurIPS, pp. 5998–6008. External Links: Link Cited by: §2, §3.1.1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of EMNLP, pp. 353–355. External Links: Link Cited by: §1.
  • B. Wang (2018) Disconnected recurrent neural networks for text categorization. In Proceedings of ACL, pp. 2311–2320. External Links: Link, Document Cited by: §2.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. External Links: Link Cited by: §1, §6.
  • X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In Advances in NeurIPS, pp. 649–657. External Links: Link Cited by: §4.2.2.
  • Z. Zhao, H. Chen, J. Zhang, X. Zhao, T. Liu, W. Lu, X. Chen, H. Deng, Q. Ju, and X. Du (2019)

    UER: an open-source toolkit for pre-training models

    .
    In Processing of EMNLP-IJCNLP 2019, pp. 241. External Links: Link Cited by: §2.