1 Introduction
With the recent success of unsupervised language pretraining [17, 5, 31, 16, 3, 11, 9, 18, 13, 22, 15, 23]
, the power of neural selfattention models (a.k.a. Transformer)
[26]has been pushed to a new level, leading to dramatic advancements in machine learning and natural language processing (NLP). More importantly, it has been observed that with more FLOPs invested in longer pretraining and/or larger models, the performance of pretrained Transformer models consistently improve. However, it is extremely expensive to pretrain or even just to finetune the stateoftheart selfattention models, as they require much more FLOPs and memory resources compared to traditional models in NLP. This largely limits their applications and success in more fields.
Given this challenge, there has been an increasing amount of efforts to reduce the costs of pretraining and finetuning selfattention models. From the perspective of postpretraining processing, typical approaches include distillation, pruning and quantization of various kinds, which try to derive a lighter model from an wellpretrained model by taking advantage of the richer signals in the larger model or learning to remove less important operations. Another line of research aims at designing an architecture that not only has a lower resourcetoperformance ratio (more efficient) but also scales as well as the Transformer, at least in certain domains. Most of such methods build upon the Transformer backbone and focus on redesigning its building blocks. Representative solutions include searching for better micro operation or macro module designs [21, 2], replacing the full pairwise attention with local operations such as convolution [29] and dynamic convolution [28], and optimizing the hidden size combinations for existing blocks [25].
Across the wide variety of ideas mentioned above, a common strategy is to identify redundant operations or representations and replace them with more efficient ones. Inspired by this line of thinking, in this work, we will be focusing on the potential redundancy induced by always maintaining a fulllength sequence
of hidden representations across all layers in Transformer. Intuitively, for many sequencelevel NLP tasks such as text classification and ranking, the most common use case is to extract a
singlevector from the entire sequence, which does not necessarily preserve all information down to the tokenlevel granularity. Hence, for such tasks, the fulllength sequence of hidden states may contain significant redundancy. This is analogous to the case of image recognition, where the convolution neural network gradually reduces the spatial resolution/size of feature maps as the neural network goes deeper. In addition, linguistic prior also encourages gradually merging nearby tokens (words) into larger semantic units (phrases), which naturally leads to a shorter sequence of representations.
Concretely, we propose to gradually reduce the sequential resolution (i.e. length) of the hidden representation in selfattention models. Immediately, the reduction in sequence length can lead to significant savings in both FLOPs and memory. More importantly, the saved computational resource can be directly reinvested in constructing a deeper (or wider) model to boost the model capacity without additional computational burden. In addition, to address the challenge that common pretraining objectives such as masked language modeling (MLM) [5] require separate representations for each token, we design a simple strategy to decode a fulllength sequence of deep representations from the hidden state of reduced length. As a result, the proposed model can be directly trained without modifying the pretraining objectives, as well as adopted for downstream tasks that require tokenlevel representations.
Empirically, with comparable or even fewer FLOPs, by trading sequential resolution for depth, our proposed model achieves an improved performance over the standard Transformer on a wide variety of sequencelevel prediction tasks, including text classification, language understanding, and reading comprehension.
2 Method
2.1 Background
Transformer Architecture
The Transformer architecture [26]
is a highly modularized neural network, where each Transformer layer consists of two submodules, namely the multihead selfattention (SAttn) and positionwise feedforward network (PFFN). Both submodules are wrapped by a residual connection and layer normalization. Schematically, given a length
sequence of hidden states , the computation of a single Transformer layer can be expressed as(1)  
(2) 
Pretraining Objectives
The most commonly used pretraining objective is the masked language modeling (MLM) proposed by BERT [5]. For a length natural language sequence sample from a large unlabeled set , the MLM objective first constructs a corrupted sequence by randomly replacing 15% of the tokens of with a special token [mask] and then trains a Transformer model [5] to reconstruct the original based on , i.e.,
where is the positions of masked tokens, the subscript in emphasizes its dependence on , denotes the embedding of the token , and the lastlayer hidden state at position produced by the Transformer model. After pretraining, the entire model is finetuned in downstream tasks.
To show the generality of our proposed model, we also experiment with another pretraining objective ELECTRA [3]. Different from MLM, ELECTRA relies a pair of jointly trained generator and discriminator. Specifically, the generator usually has a smaller size (1/4 of that of the discriminator) and is directly trained via the MLM objective, i.e., . Then, for each masked position, a token is sampled from the reconstruction distribution of the generator to replace the [mask] token and form a new sequence , i.e., if , else . Given the new sequence , the discriminator is then trained to distinguish whether each token in is real (same as ) or fake (different from ) via binary classification. After pretraining, only the discriminator will be used during finetuning and the generator is simply discarded.
Discussion
Note that both pretraining objectives introduced above require the ability to produce a hidden state for each input token, i.e., and . Due to this requirement, it seems natural to keep a full sequence of hidden states. However, in contrast, many sequencelevel downstream tasks like classification or ranking only need a singlevector summary of the entire sequence. Fundamentally, this suggests that some kind of compression is usually required to remove the unnecessary redundancy during finetuning. This observation immediately leads to the following two questions:

[leftmargin=*,itemsep=0em,parsep=0em,topsep=0em]

Can we design a general model that is equally expressive but more efficient by compressing the full sequence of hidden states into a more compact form?

With the compressed representations, how can the model retain the ability to produce tokenlevel representations for pretraining?
To answer these two questions, we next present our proposed architecture.
2.2 Proposed Architecture
To inherit the high capacity and optimization advantages of the Transformer architecture, the proposed model keeps the same overall skeleton of interleaved SAttn and PFFN submodules wrapped by residual connection and layer normalization. But differently, to achieve representation compression and computation reduction, our model employs an encoder that gradually reduces the sequence length of the hidden states as the layer gets deeper. In addition, for tasks involving pertoken predictions like pretraining, a simple decoder is used to reconstruct a full sequence of tokenlevel representations from the compressed encoder output.
Encoder
As illustrated in the left part of Fig. 1, the encoder consists of several blocks of consecutive Transformer layers. Within each block, the sequence length of the hidden states always remains the same. But when going from a lowerlevel block to a higherlevel block, the length of the hidden sequence is reduced by performing certain type of pooling along the sequence dimension, i.e.,
(3) 
where and for some . Importantly, instead of directly feeding the pooled sequence into the first SAttn layer of the new block, we only use pooled sequence to construct the query vector (and the residual signal) of the selfattention, while the unpooled sequence serves that role of key and value vectors, i.e.
(4) 
Note that the output sequence of this special SAttn module has the same length as the pooled sequence . To understand the advantage of this particular design, it is helpful to compare the proposed “poolqueryonly” variant with the naive alternative of using for both the query and keyvalue vectors, i.e., :

[leftmargin=*,itemsep=0em,topsep=0em]

Under the naive approach, the compression is solely controlled by the pooling operation, which is finished before the attention module. Hence, relatively simple pooling methods such as average/mean pooling won’t be able to achieve good compression.

Under the poolqueryonly variant, the compression depends on not only how the pooling is performed, but also how the selfattention weighted sums the unpooled sequence to form each pooled vector. Effectively, the particular attention here can be seen as a type of linear compression that combines bases into a smaller number of “compressed bases”. Therefore, with minimum computational overhead, this variant makes compression operation more expressive.
With this particular poolqueryonly design in place, we find the simplest strided mean pooling applied to each sliding window of the sequence work very well in practice. For simplicity, we only experiment with stride 2 and window size 2 in this work. Hence, the pooling operation will reduce the sequence by half and each pooled hidden state corresponds to a window of 2 unpooled hidden vectors. Intuitively, this type of pooling roughly follows the linguistic prior that nearby tokens could be gradually merged (or compressed) into a larger semantic component. Once the sequence length is halved after the pooling and poolqueryonly attention, the rest of the encoder computation simply follows the standard updates in Eqn. (
2) and (1).Finally, as an extra implementation detail, recall that a particular design in language pretraining is to add a special token [cls] to the beginning of the original input sequence, and use the lastlayer hidden state corresponding to [cls] (i.e., ) as the representation of the sequence. To prevent the pooling from destroying this special structure, we first separate the [cls] hidden state and the rest of hidden states and only apply the pooling to the rest of hidden states. For some practical implementation issues and an efficient solution, we refer readers to Appendix A.1.
Decoder
In order to recover a full sequence of hidden states from the encoder output of reduced length, a natural idea would be performing some kind of upsampling. For instance, in image generation or superresolution, deconvolution (transposed convolution) or parameterfree resizing with bilinear interpolation are often used to increase the spatial resolution of the feature map. Hence, we can simply adapt these ideas from 2D processing to our 1D case and apply proper upsampling to the encoder output.
However, instead of performing multiple upsamplings with small expansion rate (e.g. increasing the sequence length by 2x each time) as in image domain, we here choose to employ a single upsampling with a large expansion rate, as shown on the right part of Fig. 1. Specifically, given the output sequence of length from an block encoder, we directly upsample it to a fulllength sequence by repeating each hidden vector times:
(5) 
where denotes floor division. However, note that every consecutive vectors in are exactly the same and hence do not contain detailed tokenlevel information. Hence, we further extract the lastlayer hidden states from the first block of the encoder , which still has the full length and contains the uncompressed tokenlevel information. Then, the lowerlevel representation and upsampled higherlevel representation are added together to form a deep tokenlevel representation . Effectively, this forms a residual/skip connection that enables detailed token information and potentially easier optimization. In addition, we stack a few more Transformer layers upon to achieve a better deep fusion of the lowlevel and highlevel features. In this work, we always use 2 Transformer layers in decoder.
It is important to emphasize that the decoder is only used if the task requires tokenlevel prediction, such as in standard pretraining or sequence labeling. For tasks that only requires a single vectorial representation of the sequence like classification, the decoder is discarded after pretraining and only the encoder is finetuned. Finally, to emphasize the filtering/compression property of the encoder as well as its shape, we name the proposed model FunnelTransformer (FTFM).
2.3 Complexity & Capacity Analysis
With the architecture design specified, we now analyze how the sequence compression affects the complexity and capacity of the proposed model, especially compared to the standard Transformer.
Firstly, for a Transformer layer with an SAttn and a PFFN module of hidden size , the complexity of processing a length sequence is .^{2}^{2}2Since the corresponding memory complexity is simply , which is always offset by a multiplier , we will focus on the computation complexity with the conclusion directly carried through. Hence, every time the sequence length is reduced by half in the encoder, we enjoy a superlinear (more than half) complexity drop. In practice, as the term has a large constant, a nearlinear speedup is observed more often. The superlinear effect is more detectable when the sequence length is relatively long like in pretraining. Therefore, given the same FLOPs, we can at least trade a fulllength layer in the 1st block for layers in the th block, which provides an economical way to increase the depth of network.
On the other hand, the capacity of a compressedlength layer is clearly upperbounded by that of a normal fulllength layer. In most cases where the compression is lossy, reducing the sequence length will inevitably lead to capacity drop. The good news is that the capacity drop of a single layer could be well compensated by reinvesting the saved FLOPs in stacking more cheaper layers of reduced length or increasing the width of the model.
As a concrete example, for a Transformer of BERT size, i.e., 12 layers of hidden size 768 (L12H768), we may construct a FunnelTransformer of 3 blocks where each block has 6 layers of hidden size 768 (B666H768). Despite having 18 layers in total, when finetuned for classification, the FLOPs of the B666H768 architecture only corresponds to at most fulllength layers, clearly fewer than that of L12H768. More importantly, as we will show in the experiments, B666H768 significantly outperforms L12H768. While intuitive, how to construct an optimal block layout given this depthlength tradeoff remains an open challenge. For this work, we only consider relatively regular layout and leave more systematic studies for future work.
Finally, notice that trading sequential resolution for depth or width has a side effect of increasing the total number of parameters. For instance, B666H768 has 1.5x Transformer parameters compared to L12H768. In practice, more parameters may increase communication cost in distributed training as well as the memory consumption and memory access time. A simple remedy is to perform certain parameter sharing, as used in ALBERT, to recover the same parameter count. Taking B666H768 as an example, one may tie the parameters for every two layers in the 2nd and 3rd blocks, denoted as B63x23x2H768, which gives back the same number of parameters to L12H768. However, parameter sharing could result in performance loss. Fundamentally, this brings us another tradeoff between the gain (capacity) and cost (memory and communication cost) of using more parameters, which can be highly device dependent.
3 Related Work
As far as we know, no previous work achieves performance gain via compressing the sequence length of the hidden states under language pretraining. Meanwhile, our proposed model is quite similar to the bottomup model proposed by a contemporary work [24] for causal language modeling. The key differences include the poolqueryonly design for downsampling, how the upsampling is performed, and our relative attention parameterization. Another closely related idea is PowerBERT [8], which learns to softeliminate
word vectors that are less “significant” during finetuning. Hence, for postfinetuning inference, the sequence length can be reduced to achieve acceleration. More generally, our work is also related to previous work on hierarchical recurrent neural networks
[14] and Transformer models [34, 7]. Different from these methods, our model does not rely on any predefined hierarchy or boundary of semantic meanings and always captures the fulllength dependency input with attention.In contrast, our work draws many inspirations from the computer vision domain. The contracting encoder and expanding decoder framework with residual connections is conceptually similar to the ResUNet
[19] for image segmentation. The strided pooling is also widely used to construct modern image recognition networks [20]. Despite the similarities, apart from the obvious difference in data domain and computation modules, our encoder employs a special poolqueryonly design to improve the compression, and our decoder only requires a single upsampling with a large expansion rate.In addition, a line of research in graph neural networks has tries to gradually reduce the number of nodes in different ways and obtain a single vectorial representation for supervised classification. [32, 6, 12] While these methods could potentially be plugged into our model as alternative compression operations, it remains an open question whether compression techniques developed for supervised graph classification can be extended the largescale language pretraining.
4 Experiment
In this section, we empirically evaluate the proposed FTFM by first pretraining it and then finetuning it in downstream tasks. Following previous work, for pretraining, we consider two common settings:

[leftmargin=*,itemsep=0em,parsep=0em,topsep=0em]

Base scale: Pretraining models for 1M steps with batch size 256 on Wikipedia + Book Corpus. This is the setting used by original BERT [5]. We will rely on this setting to perform fair comparison between FTFM and the standard Transformer as well as some ablation studies.
For finetuning, we mainly focus on sequencelevel tasks that only requires a single vectorial representation of the input sequence, since FTFM is designed with such a purpose in mind. Specifically, such tasks include the GLUE benchmark for language understanding [27], 7 widely used text (sentiment / topic) classification tasks (IMDB, AD, DBpedia, Yelp2, Yelp5, Amazon2, Amazon5) [33], and the RACE reading comprehension dataset [10]. In addition, to see how FTFM performs when tokenlevel prediction is needed, we consider the SQuAD question answering task which requires the model to select a token span from the context paragraph as the answer. For more details of the experiment setting, we refer readers to Appendix B.
Finally, for all models implemented in this work including Transformer baselines in the basescale comparison section 4.1, we always use the relative positional attention parameterization proposed by TransformerXL [4] (see Appendix A.2 for some implementation details of TransformerXL).
4.1 Basescale Results
Firstly, we evaluate how FTFM performs compared to the standard Transformer under similar amount of computation (i.e., FLOPs). For this purpose, we consider three commonly used model sizes for the standard Transformer, namely large (L24H1024), base (L12H768) and small (L6H768). Then, for each Transformer baseline, we construct FTFMs of different block layouts and parameters, while ensuring the FTFMs always have fewer or similar FLOPs. Based on the MLM pretraining objective, the results on GLUE benchmark and text classification are presented in Table 1, where we also include the relative FLOPs and #Params.
Model size  CoLA  SST2  MRPC  STSB  QQP  MNLI  QNLI  RTE  GLUEAVG 

L24H1024  63.2  94.8  91.8/88.5  91.1  88.7/91.7  88.7  94.0  80.5  86.6 
B101010  64.8  95.0  92.5/89.5  90.7  88.6/91.5  88.9  94.0  81.5  87.0 
B888  63.5  94.7  92.2/89.0  90.7  88.9/91.7  88.8  93.6  81.2  86.7 
L12H768  60.5  93.0  92.2/89.0  89.4  88.1/91.2  86.0  92.2  73.6  84.4 
B666  62.5  94.0  92.2/89.0  89.5  88.4/91.4  87.0  92.7  76.5  85.3 
B63x23x2  60.5  93.6  92.4/89.2  89.4  88.2/91.3  86.4  92.5  75.0  84.7 
B444  59.1  92.7  91.8/88.7  89.1  88.2/91.3  85.5  92.0  73.2  83.9 
L6H768  55.2  91.5  91.1/87.8  88.1  87.2/90.6  82.7  90.0  64.6  81.3 
B344  59.0  92.8  91.8/88.5  88.5  87.8/90.9  84.8  91.8  73.2  83.7 
Model size  IMDB  AG  DBpedia  Yelp2  Yelp5  Amazon2  Amazon5  FLOPs  #Params 

L24H1024  4.440  4.987  0.646  1.758  28.73  2.409  32.78  1.00x  1.00x 
B101010  4.404  5.026  0.617  1.734  28.52  2.400  32.65  0.73x  1.22x 
B888  4.552  5.079  0.664  1.713  28.84  2.438  32.87  0.58x  1.00x 
L12H768  5.328  5.184  0.663  2.013  29.35  2.571  33.14  1.00x  1.00x 
B666  4.908  5.079  0.654  1.939  29.03  2.518  32.91  0.88x  1.39x 
B63x23x2  5.144  5.342  0.649  1.892  29.03  2.570  33.01  0.88x  1.00x 
B444  5.348  5.250  0.670  1.979  29.37  2.596  33.16  0.58x  1.00x 
L6H768  6.252  5.421  0.697  2.203  30.33  2.801  33.69  1.00x  1.00x 
B344  5.520  5.342  0.670  2.042  29.51  2.603  33.16  1.00x  1.53x 
. The FLOPs is a rough estimation assuming linear complexity w.r.t. the sequence length. The #Params is exact including the embedding matrix.
Here, we can make a few key observations:

[leftmargin=*,itemsep=0em,parsep=0em,topsep=0em]

Given similar or fewer FLOPs, by trading sequential resolution for more layers, the FTFM outperforms the standard Transformer in most tasks except STSB, especially for smaller models.

When we only compress the sequence length without increasing the depth (and #Params), FTFM could suffer from some performance loss in certain settings on the GLUE datasets. However, as the model size increases, such performance gaps become smaller or even disappear.

In addition, we find partial parametersharing often harms the performance. Therefore, the practical tradeoff should be made according to the actual task and computation device.
To further test generality of FTFM, we additionally consider ELECTRA for pretraining. The results are summarized in Table 2. Overall, we see a similar trend, though the gain is slightly smaller on the GLUE benchmark. This could be attributed to reusing two key hyperparameters (discriminator loss coefficient and generator size multiplier) tuned for Transformer to train FTFMs without any adjustment at all.
Model size  CoLA  SST2  MRPC  STSB  QQP  MNLI  QNLI  RTE  GLUEAVG 

L24H1024  66.5  94.3  92.8/90.0  91.5  89.6/92.2  89.4  94.1  84.5  87.8 
B101010  68.6  95.0  93.0/90.0  91.0  88.9/91.7  89.1  93.6  84.5  87.9 
B888  66.6  94.8  92.6/89.7  90.7  88.8/91.7  89.0  93.6  82.1  87.3 
L12H768  64.3  93.1  92.1/89.2  90.8  88.7/91.7  86.4  92.1  75.4  85.4 
B666  64.3  94.2  92.8/89.7  90.1  88.7/91.6  87.4  92.5  78.3  86.0 
B63x23x2  63.9  94.2  93.0/90.2  89.5  88.4/91.4  87.0  92.2  77.6  85.7 
B444  62.8  93.6  92.5/89.2  89.2  88.4/91.3  86.0  91.6  74.3  84.8 
L6H768  62.1  91.1  90.8/86.8  88.9  88.2/91.3  83.9  89.7  66.7  82.6 
B344  59.0  93.1  90.8/87.5  88.7  88.1/91.0  85.8  91.1  72.5  83.6 
Model size  IMDB  AG  DBpedia  Yelp2  Yelp5  Amazon2  Amazon5  FLOPs  #Params 

L24H1024  4.724  5.053  0.653  1.874  28.84  2.425  32.85  1.00x  1.00x 
B101010  4.324  5.250  0.639  1.789  28.68  2.419  32.72  0.73x  1.22x 
B888  4.364  5.408  0.651  1.729  28.76  2.447  32.85  0.58x  1.00x 
L12H768  5.248  5.355  0.657  1.953  29.24  2.596  33.04  1.00x  1.00x 
B666  4.792  5.237  0.650  1.850  28.73  2.499  32.79  0.88x  1.39x 
B63x23x2  4.924  5.342  0.671  1.913  29.00  2.523  32.85  0.88x  1.00x 
B444  5.152  5.382  0.659  2.032  29.33  2.566  33.03  0.58x  1.00x 
L6H768  6.220  5.395  0.674  2.287  30.16  2.759  33.57  1.00x  1.00x 
B344  5.396  5.342  0.653  2.000  29.60  2.591  33.09  1.00x  1.53x 
Running Time Comparison While FLOPs count offers a general idea of the model speed, it still differs from the actual running time, especially when other overhead exists. Hence, for completeness, we show the speedup provided by the FTFM in terms of actual running time in Appendix C.2. We also compare the actual memory footprint of FTFM and TFM in Appendix C.2.
4.2 Largescale Results
Given the encouraging results of FTFM at basescale, we next consider training FTFM under the largescale setting and compare it with previous models pretrained in similar settings. Due to the slightly better performance of ELECTRA over MLM, we will use the ELECTRA objective for all largescale experiments.
Model  CoLA  SST2  MRPC  STSB  QQP  MNLI  QNLI  RTE  WNLI  AVG 

Dev set results (single model)  
ROBERTA [16]  68.0  96.4  /90.9  92.4  /92.2  90.2  94.7  86.6    88.9 
XLNet [31]  69.0  97.0  /90.8  92.5  /92.3  90.8  94.9  85.9    89.2 
ELECTRA [3]  69.1  96.9  /90.8  92.6  /92.4  90.9  95.0  88.0    89.5 
B101010H1024  72.4  96.8  93.5/90.9  92.1  89.8/92.4  91.1/  95.1  89.5    90.0 
B888H1024  71.3  96.8  93.1/90.7  91.7  89.8/92.4  90.8/  94.7  89.2    89.7 
ROBERTA [16]  63.6  94.8  /90.2  91.2  /91.9  87.6/  92.8  78.7    86.4 
MPNet [23]  65.0  95.4  /91.5  90.9  /91.9  88.5/  93.3  85.2    87.7 
B666H768  70.1  96.3  93.2/90.4  91.1  89.2/92.0  89.7/  93.7  83.4    88.3 
B63x23x2H768  68.5  95.6  92.5/89.5  91.0  89.3/92.0  89.1/  93.0  83.4    87.8 
B444H768  68.2  95.0  92.8/90.2  90.3  89.0/91.8  88.6/  92.6  79.1    87.0 
Leaderboard test set results (single task & single model)  
ELECTRA [3]  68.1  96.7  89.2/92.0  92.1/91.7  74.8/90.4  90.7/90.2  95.5  86.1  65.1  85.2 
B101010H1024  68.9  97.2  89.4/92.1  91.6/91.3  74.3/90.2  90.9/90.9  95.5  86.5  65.1  85.4 
B888H1024  68.3  96.9  89.2/92.0  91.5/91.1  73.8/90.1  90.7/90.7  95.1  85.3  65.1  85.0 
ELECTRA [3]  64.6  96.0  88.1/91.2  91.0/90.2  73.2/89.5  88.5/88.0  93.1  75.2  65.1  82.7 
B666H768  68.3  96.5  89.1/91.9  90.6/89.9  73.3/89.9  89.7/89.4  94.0  80.4  65.1  84.0 
B63x23x2H768  65.9  96.0  87.8/91.0  90.0/89.6  73.3/89.8  88.9/88.7  93.8  79.9  65.1  83.4 
Leaderboard test set results (multitask & ensemble)  
ROBERTA [16]  67.8  96.7  89.8/92.3  92.2/91.9  74.3/90.2  90.8/90.2  95.4  88.2  89.0  88.1 
ELECTRA [3]  71.7  97.1  90.7/93.1  92.9/92.5  75.6/90.8  91.3/90.8  95.8  89.8  91.8  89.4 
B101010H1024  70.5  97.5  91.2/93.4  92.6/92.3  75.4/90.7  91.4/91.1  95.8  90.0  94.5  89.7 
Given the pretrained FTFM of different sizes, we first compare the finetuning performance on the GLUE benchmark in Table 3. Similar to the basescale results, with fewer or comparable FLOPs, FTFM outperforms the corresponding baselines in the majority of tasks, suggesting the good scalability of FTFM. We also test the models on the 7 text classification tasks. But due to the page constraint, we refer readers to Appendix C.1.
Next, we consider the RACE dataset, which is quite different from the GLUE benchmark. At the core, RACE is a multiplechoice reading comprehension task requiring complex reasoning, which though, can be formulated as classifying the correct choice. Also, paragraphs in RACE are much longer. To FTFM, this presents both a challenge, as it requires detailed reasoning, and an opportunity to compress long paragraph. As we can see in Table
5, FTFM achieves better performances compared to all previous models. In particular, within the base model group, the gain is very significant. It shows that FTFM can also excel for sequencelevel task that involves long text and reasoning.Finally, although FTFM is mainly designed for tasks that only require a sequencelevel representation, it is possible to apply FTFM to tokenlevel tasks by additionally finetuning the decoder. To test this ability, we finetune FTFM on the SQuAD datasets and compare it with previous models in Table 5. While FTFM outperforms previous models in the base group by a large margin, in the large model group, the FTFM with about 83% FLOPs (B101010) still falls behind the standard Transformer that always maintains a fulllength tokenlevel representations. This suggests sequential compression could harm the performance when detailed tokenlevel information is critical. On the other hand, compared to the results on SQuAD1.1, FTFMs perform relatively better on SQuAD2.0, which additionally requires the model to make a sequencelevel prediction on whether the question is answerable. This again shows the general effectiveness of the FTFM in sequencelevel tasks.
4.3 Ablation Study
ID  Layout  (FLOPs / Params)  PoolOp  Poolqueryonly  Sep [cls]  RelAttn  GLUEAVG 

(1)  B666  (1.00x / 1.00x)  Mean  ✓  ✓  ✓  83.5 
(2)  Mean  ✓  ✓  82.9  
(3)  Mean  ✓  ✓  83.0  
(4)  Mean  ✓  ✓  81.4  
(5)  Max  ✓  ✓  ✓  83.4  
(6)  TopAttn  ✓  ✓  ✓  75.8  
(7)  B88  (1.14x / 0.91x)  Mean  ✓  ✓  ✓  83.4 
(8)  B5555  (0.89x / 1.08x)  Mean  ✓  ✓  ✓  82.9 
Finally, based on the GLUE benchmark, we perform a series of ablation studies on the importance of various designs in FTFM, including the block layout design, the type of pooling operation, the poolqueryonly technique, maintaining a separate [cls] vector and the usage of TransformerXL parameterization.

[leftmargin=*]

Pooling operation: Including the mean pooling we finally employ in FTFM, we actually test two types of pooling operations.

The first type is just the strided mean/max pooling as described in section
2. 
The second type aims to select a subset of “hub” states, which refer to those hidden vectors that are attended most in the previous SAttn layer and hence likely to carry most critical information about the sequence. Concretely, given the attention map from the previous SAttn layer, we reduce sum the scores along the number of head and query length dimensions to a score for each position. Then, we simply choose the top 50% of states to achieve the same compression rate. Note that, this type of pooling operation is essentially the same as the important states selection procedure in PowerBERT [8].


Poolqueryonly design

Separating [cls] in the pooling operation

Block layout design: In our experiments, all models actually utilize a 3block design. Here, we compare the 3blocks design with the 2blocks and the 4blocks design.
The ablation results are included in Table 6. To save the computation resources, the size of model hidden states in table 6 is set as 512. From the ablation results, we can make the following observations:

[leftmargin=*]

Comparing pooling different operation ((1), (5), and (6)), we found that the performance of the mean and max pooling operation is similar. But they are significantly better than the idea of utilizing attention score (TopAttn pooling) to select the “hub” states.

Comparing (1) with (2) and (3) respectively, we see that the two special designs, i.e. “poolqueryonly” and maintaining a separate nonpooled [cls] , can both bring a clear improvement to the proposed model.

Comparing (1) and (4), we find that the relative positional parameterization is key to the performance of the proposed FTFM. We suspect that the pooling operation could destroy the positional information carried by the absolute position encoding, which is only injected to the model in the input embedding layer. As a result, the higher blocks may not have enough positional information to learn a good enough attention pattern. In comparison, the positional information is injected to each layer under the relative positional attention scheme. Therefore, to achieve good result with FTFM based on absolute positional embedding, one may inject the absolute positional embedding into each attention layer. Actually, a contemporary application of Transformer to the detection problem in computer vision shows injecting positional embedding into each layer is important [1].

Finally, we study the influence of block layout design in our framework. With B666 as the 3block benchmark, we consider two other layout design with similar FLOPs and number of parameters. Specifically, we consider B88 for the 2block design and B5555 for the 4block design. Comparing the results in (1), (7), and (8), we find that the performance of the 3block (B666) design achieves the best performance, which is significantly better than the 4block design and slightly better than the 2block design. However, if we further taking the FLOPs/#Params into consideration, it is more clear that the 3block design is superior. Therefore, in the main paper, we always use the 3block design.
5 Conclusion & Discussion
In this work, under the pretrainingfinetuning paradigm, we investigate a largely overlooked dimension of complexity in language processing. With the proposed FunnelTransformer, we show how sequential resolution can be compressed in a simple form to save computation and how the saved FLOPs can be reinvested in improving the model capacity and hence the performance. Open challenges for future research include the better ways to improve the compression scheme, to optimize the block layout design and to reinvest the saved FLOPs. In addition, combining FunnelTransformer with model compression techniques like knowledge distillation and quantization would be an important direction towards the enhancement of practical impact.
References
 [1] (2020) Endtoend object detection with transformers. ArXiv abs/2005.12872. Cited by: 3rd item.
 [2] (2020) AdaBERT: taskadaptive bert compression with differentiable neural architecture search. arXiv preprint arXiv:2001.04246. Cited by: §1.
 [3] (2020) Electra: pretraining text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. Cited by: §B.3.1, §1, §2.1, 2nd item, Table 3, Table 5.
 [4] (2019) Transformerxl: attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860. Cited by: §A.2, 5th item, §4.
 [5] (2018) Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §B.1, §1, §1, §2.1, 1st item, 5th item.
 [6] (2019) Graph unets. arXiv preprint arXiv:1905.05178. Cited by: §3.

[7]
(2019)
Multiresolution transformer networks: recurrence is not essential for modeling hierarchical structure
. arXiv preprint arXiv:1908.10408. Cited by: §3.  [8] (2020) PoWERbert: accelerating bert inference for classification tasks. arXiv preprint arXiv:2001.08950. Cited by: §3, item (2).
 [9] (2019) A mutual information maximization perspective of language representation learning. arXiv preprint arXiv:1910.08350. Cited by: §1.
 [10] (2017) Race: largescale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683. Cited by: §4.

[11]
(2019)
Albert: a lite bert for selfsupervised learning of language representations
. arXiv preprint arXiv:1909.11942. Cited by: §1, Table 5.  [12] (2019) Selfattention graph pooling. arXiv preprint arXiv:1904.08082. Cited by: §3.

[13]
(2019)
Bart: denoising sequencetosequence pretraining for natural language generation, translation, and comprehension
. arXiv preprint arXiv:1910.13461. Cited by: §1.  [14] (2015) Hierarchical recurrent neural network for document modeling. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 899–907. Cited by: §3, Table 5.
 [15] (2019) Multitask deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504. Cited by: §1.
 [16] (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §B.3.1, §1, Table 3, Table 5, Table 5.
 [17] (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §1.

[18]
(2019)
Exploring the limits of transfer learning with a unified texttotext transformer
. arXiv preprint arXiv:1910.10683. Cited by: §1.  [19] (2015) Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234–241. Cited by: §3.
 [20] (2010) Evaluation of pooling operations in convolutional architectures for object recognition. In International conference on artificial neural networks, pp. 92–101. Cited by: §3.
 [21] (2019) The evolved transformer. arXiv preprint arXiv:1901.11117. Cited by: §1.
 [22] (2019) Mass: masked sequence to sequence pretraining for language generation. arXiv preprint arXiv:1905.02450. Cited by: §1.
 [23] (2020) MPNet: masked and permuted pretraining for language understanding. arXiv preprint arXiv:2004.09297. Cited by: §1, Table 3, Table 5.
 [24] (2020) Multiscale transformer language models. arXiv preprint arXiv:2005.00581. Cited by: §3.
 [25] (2020) Mobilebert: a compact taskagnostic bert for resourcelimited devices. arXiv preprint arXiv:2004.02984. Cited by: §1.
 [26] (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.1.
 [27] (2018) Glue: a multitask benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §4.
 [28] (2019) Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430. Cited by: §1.
 [29] (2020) Lite transformer with longshort range attention. arXiv preprint arXiv:2004.11886. Cited by: §1.
 [30] (2019) Unsupervised data augmentation for consistency training. Cited by: §B.3.1.
 [31] (2019) Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5754–5764. Cited by: §A.3, §B.3.1, §B.3.1, §B.3.2, §B.3, §1, 2nd item, Table 3, Table 5.
 [32] (2018) Hierarchical graph representation learning with differentiable pooling. In Advances in neural information processing systems, pp. 4800–4810. Cited by: §3.
 [33] (2015) Characterlevel convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: §4.

[34]
(2019)
HIBERT: document level pretraining of hierarchical bidirectional transformers for document summarization
. arXiv preprint arXiv:1905.06566. Cited by: §3.
Appendix A Implementation Optimization
a.1 Sequence Truncation for Separating [cls] trick
As discussed in Section 2.2, to avoid breaking the [cls] structure commonly used in pretraining, we do not apply the pooling operation to the [cls] and keep the hidden state corresponding to [cls] intact. While conceptually simple, a naive implementation could slow down the computation by 15% due to the “irregular” sequence length caused by such an operation. Specifically, assume that sequence length of an input sample is a power of two, i.e., , which usually is 512 in the pretraining phase. After one pooling operation with the [cls] intact, the length of the pooled sequence becomes , which is not a power of 2 anymore. As a result, it can cause memory misalignment and the waste of paralleled computation power in accelerators, leading to substantial speed loss.
To resolve this issue, we employ a simple strategy to truncate the last token after the pooling. Formally, denoting the pooled hidden state as , the truncation can be expressed as
(6) 
With this simple trick, we can always keep the sequence length a power of 2, hence avoiding the slowdown caused by maintaining an independent [cls] hidden state.
a.2 Relative Positional Attention Implementation
In this work, we use the relative positional attention parameterization proposed in the TransformerXL [4]. To facilitate further discussion, we first review the details of this parameterization. Taking the case of single head attention as the example head. Let be the sequence length and hidden dimension respectively. Then, the presoftmax attention score between a pair of positions and consists of two terms:
(7) 
where
are two trainable bias vectors,
are three trainable projection matrices, and is the sinusoidal positional encoding that represents the relative distance between the two positions.To compute the entire attention score matrix , the content term can easily be obtained via two head projections and an outer product of complexity :
where collects all hidden states into a matrix. However, we cannot compute the position term in the same way as each corresponds to a different . Hence, a naive solution will be stacking
pairs of position encodings into a tensor
where , and then perform the following tensor product:Note that the head projection now has a complexity of and a memory footprint of , dominating all other computations.
a.2.1 Standard Solution: Gather / Shift
To resolve the computation burden above, a common technique is to instead collect a matrix , where
which includes all possible position encodings arranged from the maximum possible distance value to the minimum one . Note that the full can be formed by gathering specific elements from with an index matrix of shape , i.e.,
Mathematically, this is equivalent to using a permutation tensor to multiply , i.e., , where is a onehot vector used to select/gather a single position of . As the attention score computation only involves linear operations, we can rearrange the computation of the position term as follows
Note that, assuming gathering elements only has a complexity of , which is true for CPU/GPU, this trick reduces the computation complexity back to . In practice, the gather operation can be implemented via a smart reshape operation, that is even cheaper.
a.2.2 Optimization for TPU: factorized relative positional attention
However, on TPUs, the assumption that gathering elements only has a complexity of does not hold. Instead, we found that such a gather operation is dramatically slower on TPU. Hence, we here consider another implementation which is significantly faster on TPU.
Firstly, let’s rewrite the position term as follows
(8) 
For easier derivation, we have introduced a notation of . Then, recall the is the sinusoidal encoding that consists of the sine and the cosine components , where
Hence, we similarly divide defined above into two parts, i.e.,
Given the definitions, we can further break Eqn. (8) into two terms:
Now, using the trigonometric identities and , the two terms can be respectively reformulated into
and  
Hence, combining these two parts together, it follows that
where above are simply 4 positional encodings formed by concatenating the cosine and sine vectors of the corresponding and in different ways. Note that, each term of the last line has a factorized form that can be computed via an outer product, just like the standard content term. Therefore, by stacking of all positions (i.e. and ) into the corresponding respectively, the full position term can be expressed in a simple form
which leads to the complexity of , which is comparable to the content term.
a.3 Potential Model Extensions
In this section, we discuss some potential model extensions of FunnelTransformer. As described in section 2, FunnelTransformer can be divided into an encoder with a compression functionality and a decoder that recovers the fulllength tokenlevel representations. To further extend the proposed model, first note that the encoderdecoder framework can be formulated into a more general form:
where and are the encoder input sequence and the optional and problemspecific decoder input, respectively. The goal of encoder is to compressing the input sequence into the hidden representations with a reduced length. Then, conditioned on the decoder input if any, the decoder will extract relevant information/representations from to solve the specific NLP problem at hand. Next, we will how the general form of FunnelTransformer can be instantiated into specific forms to solve corresponding NLP problems.
Sequencelevel prediction
This is essentially the case we consider in most of our experiments where we want to obtain a vectorial representation of the input sequence such as text classification. In this case, we don’t really need the decoder (i.e. ) and the decoder simply extracts the hidden representation corresponding to the [cls] token from and feeds it into the taskspecific structure (e.g. classifier).
Tokenlevel prediction
In the tokenlevel prediction tasks such as the MLM pretraining, SQuAD and sequence labeling, we need a decoder to recover the tokenlevel representations from the compressed sequence . In many cases, could simply be the original sequence or a tokenlevel hidden representation of it to provide fine grained lowlevel information of each token and hence ease the optimization. In this paper, we utilize the lastlayer hidden states of the 1st block (before the first pooling operation) as the additional decoder input.
But for problems that utilize additional input signals, such as the permutation order used for permuted language modeling in XLNet [31]. This additional information can be injected into FunnelTransformer via the decoder input to (approximately) recover some more complex control of attention mechanism.
Sequencetosequence problems
Another important category of NLP task is sequencetosequence problems, including machine translation, text summarization, and dialog generation, whose stateoftheart solution is the conventional encoderdecoder framework. Hence, FunnelTransformer naturally fits these tasks, where the decoder input corresponds to the target text sequence and the encoder input the source text sequence. This way, the key difference compared to conventional models is the source side compression FunnelTransformer provides.
Overall, we summarize some potential directions to extend FunnelTransformer presented in section 2.2 to NLP problems. Finally, although we focus on discussion on the NLP tasks in this paper, FunnelTransformer could be applied to any tasks dealing with sequential data, such as time series and video stream analysis.
Appendix B Experiment Setting and Hyperparameters
b.1 Preprocessing & Tokenization
For all experiments conducted in this work, we simply adapt the “uncased” word piece model originally used by BERT [5], where the vocabulary size is about 30K. Other than lower case and the default preprocessing included in the word piece tokenizer, the only additional preprocessing we perform is to remove some http symbols (e.g. <b>) in the 7 text classification tasks.
b.2 Pretraining
Hparam  Base Scale  Large Scale 

Hidden dropout  0.1  
GeLU dropout  0.0  
Attention dropout  0.1  
Max sequence length  512  
Batch size  256  8192 
Learning rate  1e4  2e4 
Number of steps  1M  500K 
Warmup steps  10K  30K 
Optimizer  Adam Weight Decay  
Learning rate decay  Linear  
Adam epsilon  1e6  
Weight decay  0.01 
The hyperparameters used for the two different pretraining settings are summarized in Table 7. One exception is the learning rate used for B101010H1024 at the base scale. Specifically, we find the training can be unstable when the depth goes beyond 24 layers (in the case of B101010H1024) at base scale, especially for the MLM objective. Hence, we reduce the learning to 8e5 for the B101010H1024 FTFM during basescale pretraining. This has a side effect of a slower training pace and potentially a slightly worse finetuning performance. However, we does not observe such instability when the batch size is increased such as in the largescale setting.
For ELECTRA, there are two additional important hyperparameters, i.e., the discriminator loss coefficient and the relative size multiplier of the generator. In this work, we does not tune these two hyperparameters at all and simply use the numbers from the original paper, i.e., the discriminator loss coefficient of 50 and size multiplier of 1/4 for all architectures trained with ELECTRA. In addition, in ELECTRA training, whenever FTFM is used as the discriminator, the generator also uses the FTFM.
In additional, in the all experiments, we only annotate the size of hidden states the rest of model sizes can be derived from on it:

The embedding size = hidden size

The size of inner states of PFFN is “”.

The attention head dimension is always .

The number of attention heads is “”.
Finally, another important element in pretraining is the mask sampling strategy. For MLM training, following previous work, we always complete word span (up to 5 complete words) sampling. However, for ELECTRA training, we notice a weird phenomenon that under the basescale setting, the performance of both the Transformer and the FTFM drops significantly if we use word span sampling rather than the singletoken sampling. On the other hand, under the largescale setting, using word span sampling works fine. Hence, we use singletoken sampling for basescale ELECTRA training, and word span sampling for largescale ELECTRA training.
b.3 Finetuning
Hparam  RTE  MRPC  STSB  CoLA  SST2  QNLI  MNLI  QQP 
Hidden dropout  0.1  
GeLU dropout  0.0  
Attention dropout  0.1  
Max sequence length  128  
Batch size  16  16  16  16  32  32  64  64 
Number of epochs 
10  10  10  10  5  3  3  5 
Learning rate decay  Linear  
Weight decay  0.01  
Warmup proportion  0.1  
Adam epsilon  1e6 
Hparam  IMDB  AG  DBpedia  Yelp2  Yelp5  Amazon2  Amazon5 

Hidden dropout  0.1  
GeLU dropout  0.0  
Attention dropout  0.1  
Max sequence length  512  128  128  512  512  512  512 
Batch size  32  32  64  128  128  128  128 
Number of epochs  5  3  3  3  3  3  3 
Learning rate decay  Linear  
Weight decay  0.01  
Warmup proportion  0.1  
Adam epsilon  1e6 
For all the finetuning experiments, we essentially inherit the hyperparameters used by XLNet [31]
. All the performance numbers reported are obtained on TPUs with TensorFlow 2.2.
b.3.1 GLUE & Text Classification
For GLUE and text classification datasets, we first fix the values of most hyperparameters shown in Table 8. Then, we only search the learning rates from the set [1e5, 2e5, 3e5], and choose the best one according to the validation set.
b.3.2 Reading Comprehension
Again, following XLNet [31], the hyperparameters used for finetuning on the RACE and SQuAD datasets are summarized in Table 9. “Layerwise decay” means exponentially decaying the learning rates of individual layers in a topdown manner. For example, suppose the th layer uses a learning rate , and the Layerwise decay rate is , then the learning rate of layer is . In addition, for the two versions of SQuAD, we simply reuse the model trained on SQuAD v2.0 when evaluated on SQuAD v1.1.
Hparam  RACE  SQuAD 

Dropout  0.1  
Attention dropout  0.1  
Max sequence length  512  512 
Training epochs/steps  5 epochs  8000 steps 
Warmup proportion/steps  0.1  1000 steps 
Batch size  [16, 32]  48 
Learning rate  [1e5, 2e5]  3e5 
Learning rate decay  linear  
Weight decay  0.01  
Adam epsilon  1e6  
Layerwise lr decay  1.0  0.75 
Appendix C Additional Experimental Results
c.1 Text Classification at Large Scale
Model  IMDB  AG  DBpedia  Yelp2  Yelp5  Amazon2  Amazon5 

BERTLarge  4.51    0.64  1.89  29.32  2.63  34.17 
ROBERTALarge  3.50             
XLNetLarge  3.20  4.45  0.64  1.37  27.05  2.11  31.67 
B101010H1024  3.36  4.66  0.60  1.33  27.14  2.10  31.64 
B888H1024  3.42  4.96  0.63  1.39  27.20  2.14  31.74 
MPNet  4.40             
B666H768  3.72  5.00  0.64  1.50  27.73  2.27  32.11 
B63x23x2H768  3.82  5.12  0.64  1.58  27.96  2.32  32.23 
B444H768  4.12  5.09  0.67  1.70  28.40  2.35  32.46 
Table 10 includes the performance comparison on 7 text classification tasks under the largescale training setting. Similar to the GLUE benchmark results, compared with the previous result based on Transformer, with fewer FLOPs, the proposed FTFM achieves comparable results.
c.2 Training Cost Comparison
In this section, we test the pretraining and finetuning speed of the FTFM in comparison to the standard Transformer on the TPU and GPU platform. For the pretraining speed evaluation, we test FTFM on TPU v316 (16 cores x 16Gb) with TensorFlow. For the finetuning speed evaluation, we test FTFM on TPU v28 (8 cores x 8Gb) with TensorFlow and on NvidiaV100 (16Gb) GPU with the PyTorch. The TensorFlow version is 2.2.0, and the PyTorch version is 1.5.0. For the GPU experiments, we use an 8GPU node on the Google Cloud Platform. All running speeds are reported with the FP16 optimizer. In the PyTorch implementation, we use “O2” options of AMP manager in the apex
^{3}^{3}3https://github.com/NVIDIA/apex package to handle the FP16 optimization. For finetuning, we consider three different sequence lengths, namely 128, 256 and 512. For pretraining, we only consider the sequence length 512. In each case, we choose the maximum possible batch size allowed by the memory size of the device(s). We measure the actual model running time by performing 1000 steps gradient descent with random input sequences with the fixed length.Sequence length  128  256  512  

Metrics  Run time  Mem  Run time  Mem  Run time  Mem  GLUE  
1 GPU  8 GPUs  1 GPU  8 GPUs  8 GPUs  
Batch size / GPU  64  32  16  
L12H768  1.00x  1.00x  9.2G  1.00x  1.00x  11.0G  1.00x  14.3G  84.40 
B666  0.97x  0.99x  9.1G  0.95x  0.97x  10.3G  0.94x  12.5G  85.37 
B63x23x2  0.93x  0.93x  8.4G  0.91x  0.92x  9.5G  0.90x  11.8G  84.78 
B444  0.67x  0.67x  6.6G  0.65x  0.66x  7.5G  0.64x  9.0G  83.99 
Batch size / GPU  32  12  4  
L24H1024  1.00x  1.00x  14.8G  1.00x  1.00x  14.4G  1.00x  13.9G  86.62 
B101010  0.87x  0.92x  14.0G  0.90x  0.93x  13.0G  0.96x  12.7G  87.03 
B888  0.70x  0.73x  11.6G  0.73x  0.75x  10.8G  0.78x  10.5G  86.70 
Sequence length  128  256  512  

Metrics  Run time on 8 TPU cores (TPUv28)  GLUE  
Batch size / TPU core  64  32  16  
L12H768  1.00x  1.00x  1.00x  84.40 
B666  0.99x  0.88x  0.81x  85.37 
B63x23x2  0.97x  0.87x  0.77x  84.78 
B444  0.69x  0.62x  0.55x  83.99 
Batch size / TPU core  16  8  4  
L24H1024  1.00x  1.00x  1.00x  86.62 
B101010  0.89x  0.81x  0.73x  87.03 
B888  0.66x  0.60x  0.56x  86.70 
Firstly, we compare the model speed in the finetuning stage. Note that the decoder is not used in this setting. Table 11 and 12 summarize the finetuning running time comparison on GPUs and TPUs, respectively.

[leftmargin=*,itemsep=0em]

In the base model (L12H768) group, we observe that the speed of B666H768 is similar or faster than the base Transformer model, despite the fact that B666 is deeper, has more parameters. Moreover, B666H768 achieves better results compared with the base Transformer model. The similar conclusion applies to the B63x23x2 model, which has the same amount of parameters as the base model. The B444 model, which has the same depth and model parameters as the base model, is able to provide 30%50% speedup without losing too much performance.

In the large model (L24H1024) group, the conclusion is similar. The speed of the larger model B101010 is almost the same as the large model, and the speed of B888 is significantly faster than the large model. In addition, when sequence length equals 512, the acceleration of FTFM on the TPU is more obvious than the GPU.

In the both groups, all the tested FTFM variants have smaller memory footprint compared with the standard TFM models, showing the memory efficiency of FTFM.
Next, we compare the model speed during pretraining under the MLM objective in table 13, which has an additional cost due to the decoder. The results show that the proposed method can still substantially improve the pretraining speed compared to the standard Transformer, though the speed gain is slightly smaller than the finetuning stage. In summary, this study demonstrates that the proposed method is more efficient in both the finetuning and pretraining stages in modern parallel computing platforms.
Sequence Length  512  

Running Time  FLOPs  
#TPU cores / Total bsz  16 / 512  
L12H768  1.00x  1.00x 
B666H768D2  0.99x  1.04x 
B63x23x2H768D2  0.97x  1.04x 
B444H768D2  0.79x  0.75x 
#TPU cores / Total bsz  16 / 128  
L24H1024  1.00x  1.00x 
B101010H1024D2  0.83x  0.81x 
B888H1024D2  0.71x  0.66x 
Comments
There are no comments yet.