RealFormer: Transformer Likes Residual Attention

12/21/2020 ∙ by Ruining He, et al. ∙ Google 74

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple Residual Attention Layer Transformer architecture that significantly outperforms canonical Transformers on a spectrum of tasks including Masked Language Modeling, GLUE, and SQuAD. Qualitatively, RealFormer is easy to implement and requires minimal hyper-parameter tuning. It also stabilizes training and leads to models with sparser attentions. Code will be open-sourced upon paper acceptance.



There are no comments yet.


page 7

page 8

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transformer (Vaswani et al., 2017) architectures are the backbone of numerous state-of-the-art NLP models such as BERT  (Devlin et al., 2019), GPT (Radford et al., 2019), and Meena (Adiwardana et al., 2020)

, and have seen wide successes across both academia and industry. Typically, a Transformer network consists of a stack of residual layers. The original design follows a “Post-LN” structure which adds Layer Norm (LN) as a “post-processing” step for each sub-layer, as shown in Figure 

1 (a). It has been adopted by various state-of-the-art models including BERT, XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2019), and Transformer-XL (Dai et al., 2019). Another design is to reorganize the order of sub-layers to create a “direct” / clean path to propagate embeddings of tokens in the input sequence through the whole network, as shown in Figure 1 (b).111Note that a final LN module is usually added at the very top of the whole network.

This design adds LN as a “pre-processing” step for each sub-layer, and is often referred to as “Pre-LN” and used by some well-known extra large models such as GPT-2 

(Radford et al., 2019) and Megatron (Shoeybi et al., 2019). In some sense, Post-LN and Pre-LN are analogous to ResNet v1 (He et al., 2016a) and ResNet v2 (He et al., 2016b)

respectively in the Computer Vision literature. Although ResNet v2 is almost always preferable to v1 for Computer Vision, it might not have been the case for Pre-LN Transformer in the NLP literature. It is likely that the particularities of self-attention modules and Transformer architectures potentially favor (at least slightly) different designs compared to traditional convolutional neural networks.

Figure 1: Comparison of different Transformer layers: (a) The prevalent Post-LN layer used by (e.g.) BERT; (b) Pre-LN layer used by (e.g.) GPT-2 that creates a “direct” path to propagate token embeddings; (c) Our Informer layer that creates a “direct” path to propagate attention scores (by adding a simple skip edge on top of (a)).

In this paper, we propose a simple Transformer-based architecture to show that it is beneficial to create a “direct” path to propagate raw attention scores through the whole network. Our architecture is called ResIdual AtteNtion Transformer, or Informer in short. As shown in Figure 1 (c), each Informer layer takes the raw attention scores of all attention heads from the previous layer and adds “residual scores” (computed the same way as attention scores in regular Transformers) on top. The sum of the two scores is then used to compute attention weights via softmax (again, as in regular Transformers). In other words, Informer can be seen as adding a simple skip connection to the Post-LN Transformer. Note that it does not add any multiplication ops to the computational graph and therefore the performance is expected to be comparable.

Specifically, our main contributions include:

  • We present Informer, a novel and simple model based on original Transformer (with no more than a few lines of code changes and minimal hyper-parameter tuning).

  • We demonstrate that Informer can be used as a drop-in replacement of Transformer in BERT, outperforming both Post-LN and Pre-LN Transformers across a wide spectrum of model sizes (from small to extra large) for Masked Language Modeling (i.e., pre-training).

  • We also demonstrate that Informer can consistently improve accuracy of downstream tasks including GLUE (Wang et al., 2018), SQuAD v1.1 (Rajpurkar et al., 2016) and SQuAD v2.0 (Rajpurkar et al., 2018)

    . Furthermore, it even achieves competitive downstream results when pre-trained with only half the number of epochs of the strongest baseline.

  • Finally, we demonstrate qualitatively that attention in Informer tends to be sparser and more correlated across layers compared to baselines, which we believe may have some regularization effects that could stabilize training and benefit fine-tuning.

2 Related Work

Vaswani et al. (2017)

proposed Transformer initially for machine translation task in 2017 and it has profoundly changed the natural language processing field ever since.

Radford et al. (2018) demonstrated that generative pre-training of a Transformer-based language model (GPT) on a diverse corpus of unlabeled text can give large gains to downstream NLP tasks that suffer from scarce labeled data. Following this thread, Devlin et al. (2019) proposed to pre-train a bidirectional Transformer encoder (BERT) with a novel Masked Language Modeling as the main optimization objective. Since then, advances on many NLP tasks have been dominated by the self-supervised general-purpose pre-training, task-specific fine-tuning paradigm. Following BERT, there has been a large stream of work that explores better self-supervision objectives (e.g.Yang et al. (2019); Clark et al. (2020)), larger pre-training data and better hyper-parameters (e.g.Liu et al. (2019), model parameter sharing (e.g.Lan et al. (2019), multi-task pre-training (e.g.Sun et al. (2019); Raffel et al. (2019). These efforts typically employ a Post-LN Transformer at their core. In this paper we adopt BERT to test different Transformer architectures because it is widely used and representative of this thread of work.

Another notable thread of work focuses on improving the efficiency/scalability of Transformer. Typically, they try to reduce the quadratic complexity of the self-attention mechanism with respect to sequence length via low-rank methods (e.g.Wang et al. (2020)

), fixed strided attention patterns (

e.g.Child et al. (2019)), learnable attention patterns (e.g.Kitaev et al. (2020); Roy et al. (2020)), memory-based global & local attention (e.g.Ainslie et al. (2020); Beltagy et al. (2020); Zaheer et al. (2020)), and so on. These methods are particularly useful when dealing with long documents that go beyond the capacity of standard Transformer models. We would refer the reader to Tay et al. (2020) for a detailed survey. Informer is orthogonal to these methods as it focuses on improving standard Transformer with an universal technique which can apply to these models as well.

3 Informer

3.1 Standard Transformer

There is an encoder and a decoder in Transformer (Vaswani et al., 2017). Since they work in a similar way, here we only introduce the encoder and refer the reader to the original paper for complete details.

There are two sub-layers inside each layer of a Transformer encoder. The first sub-layer contains a Multi-Head Attention module that computes output embeddings of a set of queries () by aggregating the embeddings () of a set of keys ():


where . and are matrices with dimension and is a matrix with dimension . , , and are matrices that linearly project queries, keys, and values into the “attention space” of the -th head.

is a matrix that linearly transforms the concatenation of the outputs of all heads.

Attention function is typically implemented with a Scaled Dot-Product Attention module (Vaswani et al., 2017) which computes a weighted sum of the values:


where matrix

is the raw attention scores for each (query, key) pair. These scores are normalized via the Softmax function for each query and then act as weights for the corresponding vectors in


The second sub-layer contains a fully-connected Feed-Forward Network (FFN) module with one hidden layer:



is an activation function usually implemented with ReLU or GeLU (

e.g.Devlin et al. (2019)). FFN is applied to each position in the sequence separately and identically. Finally, there are Layer Norm (LN) modules inserted into the above two sub-layers to stabilize training.

As shown in Figure 1, there are two canonical designs of the Transformer network which only differ in the ways they organize the modules. Post-LN is the original architecture proposed by Vaswani et al. (2017) which normalizes the outputs at the end of each sub-layer. In contrast, Pre-LN normalizes sub-layer inputs instead and creates a direct path (without LN in the way) to propagate embeddings of the tokens in the sequence.

3.2 Residual Attention Transformer: Informer

Informer closely follows the Post-LN design and simply adds a skip edge to connect Multi-Head Attention modules in adjacent layers, as shown in Figure 1 (c). Formally, it adds , the pre-softmax attention scores from the previous layer with shape ,222Batch dimension is omitted for brevity. as one additional input to the Multi-Head Attention module in the current layer:


where and is the slice of with shape corresponding to . ResidualAttention adds “residual scores” on top of and then computes the weighted sum as usual:


Finally, new attention scores are passed over to the next layer.

Implementing Informer takes no more than adding a few lines of code to the “backbone” Transformer. The same idea can be straightforwardly applied even when there are more than one type of attention modules in the network. For example, there are encoder-encoder self-attention, encoder-decoder attention, and decoder-decoder self-attention modules for machine translation. In such cases, Informer simply creates multiple direct paths, one for each type of attention module, as long as the attention pattern / mask is the same across the layers along the path (which is almost always the case).

As we will show in Section 4, Post-LN Transformer tends to outperform Pre-LN Transformer for a variety of setups (given a reasonable computing budget). Therefore in this paper we choose Post-LN as the backbone for Informer for simplicity. Note however that it should be straightforward to switch the backbone to different Transformer variants in settings that favor different trade-offs.

4 Experiments

In this section, we conduct comprehensive empirical studies on Informer comparing against two canonical Transformer architectures: Post-LN and Pre-LN. We evaluate the strength of Informer in terms of both pre-training accuracy and fine-tuning accuracy on downstream tasks with minimal (if any) hyper-parameter tuning.

4.1 Bert

BERT (Devlin et al., 2019) has been the standard way of transferring knowledge from large unlabeled text corpora by pre-training a bidirectional Transformer encoder. Numerous downstream NLP tasks suffering from scarcity of supervised data have benefited considerably by fine-tuning a pre-trained BERT model. This drives us to adopt BERT as the main evaluation setup for Informer.

Experiment setup.

We follow the standard BERT pre-training setup (dataset: Wikipedia + BookCorpus, vocab: uncased 30K, max sequence length: 512333Unlike BERT which uses a reduced sequence length for the first 90% of steps, we always use 512 for simplicity., dropout: 10%, learning rate: 1e-4, learning rate schedule: warm up and then linearly decay to 0, weight decay: 0.01, optimizer: AdamW, objective: Masked Language Modeling + Next Sentence Prediction, etc.) to compare all three Transformer models: Post-LN, Pre-LN, and Informer. We experiment with Transformer architectures with a wide spectrum of sizes, from Small and Base all the way to xLarge, as detailed in Table 1. For simplicity, all models are pre-trained 1M steps with a mini-batch size of 512. Note that we use a larger mini-batch size than Devlin et al. (2019), i.e., doubling the amount of pre-training epochs, to show more complete behavior of different models.

Model L H A I
BERT-Small 4 512 8 2,048
BERT-Base 12 768 12 3,072
BERT-Large 24 1,024 16 4,096
BERT-xLarge 36 1,536 24 6,144
Table 1: Model architectures for BERT evaluation. L: #layers, H: hidden size, A: #heads, I: intermediate size

We use exactly the same setup for all three Transformer architectures except that for the Pre-LN Transformer we follow the initialization strategy suggested by Radford et al. (2019) and Child et al. (2019).444We also experimented with the initialization strategy used by BERT but with similar results. Note that for simplicity Informer inherits all hyper-parameter setups from Post-LN Transformer unless otherwise specified.

All experiments are performed on 128 or 256 TPU v3 cores depending on model sizes.

Model Post-LN Pre-LN Informer
BERT-Small 61.88% 61.67% 62.02%
BERT-Base 70.20% 69.74% 70.42%
BERT-Large 73.64% 73.21% 73.94%
BERT-xLarge 73.72% 73.53% 74.76%
Table 2: Masked Language Modeling accuracy on development set after pre-training 1M steps.666We found it to be beneficial for Informer to set attention scores at each layer to be the running mean (instead of running sum) for models deeper than 24 layers and therefore used this setup for Informer xLarge.Informer outperforms baselines more as model gets larger.

4.1.1 Pre-training Results

To evaluate pre-trained models, we report Masked Language Modeling (MLM) accuracy777All methods achieved similar (and great) results on the Next Sentence Prediction task presumably because it is much easier. on a randomly held-out development set. As shown in Table 2, Informer outperforms the two baseline Transformers considerably with the gap increasing with model size. Also note that when comparing the last two rows for each column/model, only Informer improves a lot. Our hypothesis is that larger models are inherently harder to train (we did observe that BERT with Post-LN is unstable and sometimes even diverges for xLarge) and Informer can help regularize the model and stabilize training.

We also report the pre-training curves in Figure 2. One interesting finding is that the Pre-LN Transformer seems to favor the combination of extra large models and a small number of steps, though it is consistently outperformed by the other two Transformers in “regular-sized” settings or given enough budget.

Figure 2: Masked Language Modeling (MLM) accuracy on development set (best viewed in color). Improvement gap of Informer over the best baseline tends to increase with model size. Note that these are with zero hyper-parameter tuning for Informer. (As we will show in Section 4.1.3, Informer can be improved considerably by simply using a larger learning rate and thereby even double the gap size over Post-LN.)

4.1.2 Downstream Evaluation Results

We fine-tune the above BERT-Large models with the three Transformers on a variety of downstream tasks including GLUE, SQuAD v1.1 and SQuAD v2.0 to evaluate the performance of Informer on both sentence-level (i.e., GLUE) and token-level (i.e., SQuAD) NLP tasks.

Task Post-LN Pre-LN Informer
MNLI-m 85.960.11 85.030.12 86.280.14
MNLI-nm 85.980.14 85.050.19 86.340.30
QQP 91.290.10 91.290.16 91.340.03
QQP (F1) 88.340.15 88.330.26 88.280.08
QNLI 92.260.15 92.350.26 91.890.17
SST-2 92.890.17 93.810.13 94.040.24
CoLA (MC) 58.851.31 58.041.50 59.831.06
STS-B (PC) 90.080.27 90.060.33 90.110.56
STS-B (SC) 89.770.26 89.620.28 89.880.54
MRPC 87.500.67 86.765.64 87.010.91
MRPC (F1) 91.160.45 90.693.16 90.910.65
RTE 71.122.52 68.591.52 73.650.90
Overall 84.01 83.47 84.53
Table 3: GLUE development set results of fine-tuning BERT-Large models in Table 2

. Default metric: accuracy, MC: Matthews correlation, PC: Pearson correlation, SC: Spearman correlation. Overall: first average metrics within each task (if there are 1+) and then across tasks. Numbers in smaller font are standard deviations. All numbers are scaled by 100.


General Language Understanding Evaluation (GLUE) is a canonical benchmark proposed by Wang et al. (2018) for evaluating the performance of models across a diverse set of NLU tasks. Following the fine-tuning recipe in Devlin et al. (2019), we use a mini-batch size of 32 for all models on all GLUE tasks. For each (task, model) pair, we select number of epochs in {2, 3, 4} and learning rate in {6e-6, 8e-6, 1e-5, 2e-5, 3e-5, 4e-5, 5e-5}.888We use a slightly wider range than Devlin et al. (2019) to better accommodate all three models. For each setup, we run the experiment 5 times and report the best median performance and the corresponding standard deviation on the development set.

Detailed results are tabulated in Table 3. We exclude the problematic WNLI task following Devlin et al. (2019). For each task, we report metric(s) that are suggested by the GLUE benchmark (Wang et al., 2018). Informer achieves the best overall performance and outperforms both Post-LN and Pre-LN Transformers significantly on most tasks, testifying its strength at tackling sentence-level tasks.


The Stanford Question Answering Dataset (SQuAD v1.1) is a reading comprehension dataset consisting of 100K crowd-sourced question-answer pairs, where the answer to each question is a segment of text from the corresponding reading passage 

(Rajpurkar et al., 2016). SQuAD v2.0, a later version, further extends with over 50K unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.

SQuAD Public Post-LN Pre-LN Informer
v1.1 (F1) 90.9 91.680.12 91.060.09 91.930.12
v1.1 (EM) 84.1 85.150.13 83.980.24 85.580.15
v2.0 (F1) 81.9 82.510.12 80.300.12 82.930.05
v2.0 (EM) 78.7 79.570.12 77.350.16 79.950.08
Table 4: SQuAD development set results of fine-tuning BERT-Large models in Table 2. EM: exact match. Public: Post-LN results from Devlin et al. (2019). Numbers in smaller font are standard deviations. All numbers are scaled by 100.

We follow the fine-tuning recipe in Devlin et al. (2019) for all three Transformer models on these two datasets without using any additional data such as TriviaQA (Joshi et al., 2017). For both v1.1 and v2.0, we select mini-batch size in {32, 48}, number of epochs in {2, 3, 4}, and learning rate in {2e-5, 3e-5, 4e-5, 5e-5}. For each setup, we run the experiment 5 times and report the best median performance and the corresponding standard deviation on the development set. As we can see from Table 4, Informer outperforms both Post-LN and Pre-LN Transformers considerably, attesting its strength at tackling token-level tasks.

4.1.3 Research Questions

How well does Informer perform with half the pre-training budget?

Although Informer has outperformed both Post-LN and Pre-LN Transformers considerably when pre-training 1M steps, we are also interested in investigating its potential when the pre-training budget is more limited. For this purpose, we experiment with BERT-Large models. In particular, we take the 500K step checkpoint of the pre-trained Informer in Table 2 and fine-tune it on GLUE and SQuAD datasets using exactly the same procedure as described above. Comparison results against the strongest baseline, Post-LN Transformer pre-trained 500K (checkpoint) and 1M steps respectively, are collected in Table 5. We can see that Informer with merely half the amount of pre-training epochs can beat Post-LN (1M) on GLUE with a significant margin, and almost match its performance on SQuAD.

Task Post-LN
(500K) Post-LN
(1M) Informer
GLUE 83.84 84.01 84.34
v1.1 (F1) 91.460.18 91.680.12 91.560.09
v1.1 (EM) 84.870.24 85.150.13 85.060.12
v2.0 (F1) 81.440.50 82.510.12 82.520.55
v2.0 (EM) 78.640.48 79.570.12 79.540.54
Overall 83.97 84.37 84.51
Table 5: Downstream development set results of fine-tuning BERT-Large with Post-LN and Informer pre-trained with different number of steps. v*.*: SQuAD version, EM: exact match. Overall: First average across SQuAD and then GLUE. Numbers in smaller font are standard deviations. All numbers are scaled by 100.
How well does Informer perform with a larger learning rate?

As suggested by some recent works (e.g.Xiong et al. (2020)), Pre-LN Transformer may benefit from using larger learning rates compared to Post-LN. To this end, we follow the pre-training procedure detailed earlier and switch to a larger learning rate, 2e-4, to pre-train BERT-Large with the three Transformer models. MLM accuracy on the development set with training steps is shown in Figure 3. We can see that

  • Both Pre-LN and Informer can reap some benefits of using larger learning rates;

  • Informer seems to benefit a bit more in this case (73.94% 74.31%) compared to Pre-LN (73.21% 73.46%). Note that Post-LN diverged with the learning rate of 2e-4.

It means that Informer can outperform Post-LN, the strongest baseline, actually with a prominent gap, 0.67% (i.e., 74.31% - 73.64%) for pre-training, though with only minimal learning rate tuning.

Figure 3: Masked Language Modeling (MLM) accuracy on development set of BERT-Large with different learning rates (best viewed in color). Informer seems to benefit more from using a larger learning rate compared to Pre-LN. Note that Post-LN diverged with 2e-4.
How does Informer qualitatively differ from the baseline Transformers?

We conduct one empirical study to understand the differences between Informer and Post-/Pre-LN Transformers. We randomly sample 8,192 examples from the held-out development set and visualize the distribution of attention probabilities of each token (excluding padding) in these examples across all layers and all heads in the three pre-trained BERT-Base models in Table 


. In particular, for each (token, layer, head) triplet, we compute the entropy of the attention weights (probabilities) as the “sparsity measure” of attentions. Intuitively, as entropy gets lower, the attention weight distribution becomes more skewed and therefore attention is sparser.

Figure 4: Distribution of entropies of the attention probabilities of the tokens of 8,192 held-out examples using the pre-trained BERT-Base with Informer (see Section 4.1.1). Attention heads in each layer are ordered by their medians of entropies for better legibility. Distributions are re color-coded based on the median of entropies: RED (median 4.5), YELLOW (1.5 median 4.5), BLUE (median 1.5). I.e., colder colors mean sparser attentions. There is a clear trend that higher layers tend to have sparser attentions.

In a similar fashion to Ramsauer et al. (2020), we use violin plots to show the entropy distributions of Informer (see Figure 4). Plots for Post-LN and Pre-LN Transformers are included in Appendix (Figure 6 and Figure 7). Each row is a layer in BERT-Base and each column is an attention head. For better legibility, (1) for each layer, we sort the attention heads in ascending order of the median of entropies; and (2) we also color code these plots to help distinguish heads with relatively sparse attentions (BLUE: median 1.5) and relatively dense attentions (RED: median 4.5) from the rest (YELLOW: 1.5 median 4.5). From the three figures we can see that attention tends to get sparser for later (upper) layers for all three Transformers. However, Informer differs from the other two in the following two ways:

  • Informer has significantly sparser attentions for top layers (layer 9-11);

  • Informer tends to have lower variance across all layers, which means that attention density is less input-dependent.

We hypothesize that the above two properties might be a sign of stableness and could benefit fine-tuning.

Figure 5: Distribution of Jensen-Shannon Divergence (JSD) of attention probabilities in (vertically) adjacent attention heads, i.e., . Based on 8,192 held-out examples using the pre-trained BERT-Base with Informer (see Section 4.1.1). Distributions are color-coded based on the median of JSD: RED (median 0.75), YELLOW (0.25 median 0.75), BLUE (median 0.25). I.e., colder color means more “similar” attention heads across adjacent layers.
How much do attention heads in layer resemble those in layer ?

Since Informer uses a residual attention scheme, it is interesting to show to what extent an attention head is “relying on” the corresponding head in the previous layer. To this end, we take each of the three pre-trained BERT-Base models in Table 2 and compute the Jensen-Shannon Divergence (JSD) between attention probabilities in each pair of vertically adjacent heads, i.e., , for and . Instead of computing one scalar value for each head pair, we show the full distribution based on the tokens in 8,192 held-out examples, i.e., each data point is the JSD between the attention probabilities of a token at these two heads.

Figure 5 and Figure 8 in Appendix demonstrate the JSD distributions for Informer and Post-LN Transformer999Note that JSD results from Post-LN are used only as a reference. We expect them to be “random” because there is no correspondence between heads in adjacent layers for Post-/Pre-LN. Proof: An equivalent Post-/Pre-LN can be constructed by permuting the order of attention heads in a layer (and the corresponding variables). respectively via violin plots. We can see that Informer tends to have significantly lower JSD values, especially for heads in middle layers. This might mean that Informer has some regularization advantages and provides one hypothesis for why it tends to outperform Post-LN more for larger models. Note that can still be useful even if it has exactly the same attention probabilities with because of the existence of the FFN sub-layer and the potential differences in value matrices (i.e., in Eq. 2).

Can dropout outperform residual attention for regularizing large models?

One may wonder whether increasing dropout rate can already regularize large models well so that residual attention is not necessary. To this end, we experiment with different dropout rates for pre-training BERT-Large with different Transformers (following the procedures in Section 4.1.1). Results are collected in Table 6, from which we can see that (1) Informer outperforms the two baselines across all dropout settings, and (2) simply increasing dropout rate can not regularize Transformer models as well as what residual attention appears to be doing.

Dropout Post-LN Pre-LN Informer
0% 71.16% 69.80% 71.30%
10% 73.64% 73.21% 73.94%
20% 73.21% 72.97% 73.66%
Table 6: Masked Language Modeling (MLM) accuracy on development set of BERT-Large with different dropout rates. When dropout rate is 0%, we report the best possible number with early-stop because all models start to overfit at around 500K steps.

5 Conclusions

We propose Informer, a novel and simple Transformer architecture based on the idea of residual attention. We show that it can be used as a drop-in replacement of Transformer in BERT. Quantitatively, it considerably outperforms two canonical Transformer architectures, Post-LN and Pre-LN, on both pre-training and downstream tasks including GLUE and SQuAD. Furthermore, on downstream tasks, Informer even outperforms baselines pre-trained with twice the amount of epochs. Qualitatively, we show that Informer tends to have comparatively sparser attentions, both within heads and across heads in adjacent layers. Finally, we show that Informer can benefit significantly from hyper-parameter tuning, though the main results in this paper are not based on it.


  • D. Adiwardana, M. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, et al. (2020) Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. Cited by: §1.
  • J. Ainslie, S. Ontanon, C. Alberti, P. Pham, A. Ravula, and S. Sanghai (2020) ETC: encoding long and structured data in transformers. arXiv preprint arXiv:2004.08483. Cited by: §2.
  • I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150. Cited by: §2.
  • R. Child, S. Gray, A. Radford, and I. Sutskever (2019) Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509. Cited by: §2, §4.1.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. Cited by: §2.
  • Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2, §3.1, §4.1, §4.1.2, §4.1.2, §4.1.2, §4.1, Table 4, footnote 8.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016a) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016b) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §1.
  • M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) Triviaqa: a large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551. Cited by: §4.1.2.
  • N. Kitaev, Ł. Kaiser, and A. Levskaya (2020) Reformer: the efficient transformer. arXiv preprint arXiv:2001.04451. Cited by: §2.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    Albert: a lite bert for self-supervised learning of language representations

    arXiv preprint arXiv:1909.11942. Cited by: §1, §2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv. Cited by: §1, §2.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. OpenAI Blog. Cited by: §2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog. Cited by: §1, §4.1.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    arXiv preprint arXiv:1910.10683. Cited by: §2.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. arXiv preprint arXiv:1806.03822. Cited by: 3rd item.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: 3rd item, §4.1.2.
  • H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, L. Gruber, M. Holzleitner, M. Pavlović, G. K. Sandve, V. Greiff, et al. (2020) Hopfield networks is all you need. arXiv preprint arXiv:2008.02217. Cited by: §4.1.3.
  • A. Roy, M. Saffar, A. Vaswani, and D. Grangier (2020) Efficient content-based sparse attention with routing transformers. arXiv preprint arXiv:2003.05997. Cited by: §2.
  • M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro (2019) Megatron-lm: training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §1.
  • Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang (2019) ERNIE 2.0: a continual pre-training framework for language understanding. arXiv preprint arXiv:1907.12412. Cited by: §2.
  • Y. Tay, M. Dehghani, D. Bahri, and D. Metzler (2020) Efficient transformers: a survey. arXiv preprint arXiv:2009.06732. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2, §3.1, §3.1, §3.1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: 3rd item, §4.1.2, §4.1.2.
  • S. Wang, B. Li, M. Khabsa, H. Fang, and H. Ma (2020) Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768. Cited by: §2.
  • R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu (2020) On layer normalization in the transformer architecture. arXiv preprint arXiv:2002.04745. Cited by: §4.1.3.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5753–5763. Cited by: §1, §2.
  • M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al. (2020) Big bird: transformers for longer sequences. arXiv preprint arXiv:2007.14062. Cited by: §2.

Appendix A Appendices

Figure 6: Distribution of entropies of the attention probabilities of the tokens of 8,192 held-out examples using the pre-trained BERT-Base with Post-LN Transformer (see Section 4.1.1). Attention heads in each layer are ordered by their medians of entropies for better legibility. Distributions are color-coded based on the median of entropies: RED (median 4.5), YELLOW (1.5 median 4.5), BLUE (median 1.5). I.e., colder colors mean sparser attentions. Note that here top layers (layer 9-11) tend to have larger entropies compared to Informer, which means that attentions are relatively denser.
Figure 7: Distribution of entropies of the attention probabilities of the tokens of 8,192 held-out examples using the pre-trained BERT-Base with Pre-LN Transformer (see Section 4.1.1). Attention heads in each layer are ordered by their medians of entropies for better legibility. Distributions are color-coded based on the median of entropies: RED (median 4.5), YELLOW (1.5 median 4.5), BLUE (median 1.5). I.e., colder colors mean sparser attentions. Note that here top layers (layer 9-11) tend to have larger entropies compared to Informer, which means that attentions are relatively denser.
Figure 8: Distribution of Jensen-Shannon Divergence (JSD) of attention probabilities in (vertically) adjacent attention heads, i.e., . Based on 8,192 held-out examples using the pre-trained BERT-Base with Post-LN Transformer (see Section 4.1.1). Distributions are color-coded based on the median of JSD: RED (median 0.75), YELLOW (0.25 median 0.75), BLUE (median 0.25). I.e., colder color means more “similar” attention heads across adjacent layers.