The transformer layer (AIAYN)
is currently the primary modeling component in natural language processing, playing a lead role in recent innovations such as BERT(BERT)
and GPT-2(gpt-2). Each transformer layer consists of a self-attention sublayer (s) followed by a feedforward sublayer (f), creating an interleaving pattern of self-attention and feedforward sublayers (sfsfsf ) throughout a multilayer transformer network. To the best of our knowledge, there is no a priori reason to expect this particular pattern to be optimal. We conduct a series of explorations to obtain insights about the nature of transformer orderings that work well, and based on this, we design a new transformer ordering pattern that improves upon the baseline.
First, we generate random transformer models, varying the number of each type of sublayer, and their ordering, while keeping the number of parameters constant. We train these models on the standard WikiText-103 language modeling benchmark merity2016pointer, and observe that some of these random models outperform the original interleaved transformer network, even when the number of self-attention and feedforward layers is not equal. Our analysis shows that models with more self-attention toward the bottom and more feedforward sublayers toward the top tend to perform better in general.
Based on this insight, we design a new family of transformer models that follow a distinct sublayer ordering pattern: sandwich transformers (Figure 3). Our experiments demonstrate that a sandwich transformer outperforms the baseline (of baevski2018adaptive) by perplexity. This result is made more interesting by the fact that our sandwich transformer is simply a reordering of the sublayers in the baseline model, and does not require more parameters, memory, or training time.
Each transformer layer consists of a self-attention sublayer followed by a feedforward sublayer, modifying a sequence of vectorsas follows:111We omit dropout dropout and layer normalization layernorm to simplify the notation.
Stacking multiple transformer layers creates an interleaved network of sublayers. We denote these models as strings, with s and f representing self-attention and feedforward sublayers, respectively. A three-layer transformer network, for example, would be denoted sfsfsf, with the flow of computation moving from input on the left to output on the right. Thus, any string in the regular language sf defines a valid network that uses the same building blocks as the original transformer. For simplicity, we refer to these alternatives as transformers as well.
3 Random Search
We conduct a series of experiments to understand which transformer networks work well and whether particular architectural patterns can improve performance. First, we generate random transformer models while keeping constant the number of parameters. We then train these random models to determine whether the interleaving pattern (sfsfsf ) is optimal (Section 3.1), and whether balancing the number of self-attention and feedforward sublayers is desirable (Section 3.2). Finally, we analyze additional properties of these random models, and find that those with more self-attention at the beginning and more feedforward sublayers near the end tend to outperform the standard interleaved model (Section 3.3).
Our baseline is the strong transformer language model of baevski2018adaptive, trained on WikiText-103 (merity2016pointer).222WikiText-103 contains roughly 103 million tokens from English Wikipedia, split into train, development, and test sets by article. This model contains transformer layers of dimensions, with heads in each self-attention sublayer, and feedforward sublayers with an inner dimension of . In this setting, each self-attention sublayer contains parameters, while each feedforward sublayer contains parameters (excluding bias terms, which have a marginal contribution). Thus, each f sublayer contains twice the parameters of a s sublayer, following the parameter ratio between self-attention and feedforward sublayers described in AIAYN.
All of our experiments use the same hyperparameters as Baevski and Auli’s original model. To set an accurate baseline, we train the baseline model (the standard interleaved transformer stack) with five different random seeds, achieving 18.650.24 perplexity on the development set. Unless otherwise mentioned, we do not modify the random seed in the other experiments.
3.1 Is Interleaving Optimal?
In the baseline 16-layer transformer model, 16 sublayers of each type are interleaved. Can we improve model performance by simply rearranging them? We thus generate 20 random transformer models with 16 self-attention sublayers and 16 feedforward sublayers, randomly permuted, and train these models from scratch, without modifying any of the hyperparameters.
Figure 4 shows that 7 of the 20 randomly-permuted models perform at least as well as the interleaved baseline’s average performance, with the best model achieving perplexity (full results are in Table 2
in the appendix). While the average performance of the baseline model beats the average performance of these random models, the fact that a third of our random models outperformed the average baseline suggests that a better ordering than interleaving probably exists.
3.2 Are Balanced Stacks Better?
Is it necessary to have an identical number of sublayers of each type, or could models with more self-attention (or more feedforward) sublayers yield better results? To find out, we generate 20 unbalanced transformer models by randomly selecting one sublayer at a time (either s or f with equal probability) until the parameter budget is exhausted. Since a feedforward sublayer contains double the parameters of a self-attention sublayer, the networks’ depth is not necessarily 32 sublayers as before and can range from 24 (all f) to 48 (all s).
Figure 5 shows that four of the generated unbalanced models outperform the average baseline transformer (full results are in Table 3 in the appendix). The best performing random model reaches a perplexity of 18.12 and has 12 self-attention and 18 feedforward sublayers. Both the average and the median perplexities of this sample of unbalanced models are worse than those of the balanced permuted models in Section 3.1. We do not observe any preference for more sublayers of one type over the other; there are self-attention-heavy and feedforward-heavy models in both the top five and the bottom five of the results table. While offering no guarantees – given the small sample sizes and fixed hyperparameters – we take from the above explorations that balancing the number of self-attention and feedforward sublayers appears to be a desirable property, though not a necessary one.
3.3 Attention First, Feedforward Later
So far, it is not clear which characteristics make one transformer model more successful than another; for example, measuring the number of times each sublayer type appears in the network does not reveal any strong correlation with performance. However, analyzing the bottom (or top) half of the network in isolation reveals an interesting property.
We first split the models to those that perform better than the average baseline and those that do not. We then slice each one of the previously-generated random models in half by parameter count (e.g., ssssff would be split to ssss and ff, since every f contains twice as many parameters as an s), and count how many sublayers of each type appear in each slice.
Figure 8 shows that models that outperform the average baseline tend to have more self-attention s in the first (bottom) half of the network and more f in the second (top) half. While we do not have a good hypothesis to explain this phenomenon, we can, however, exploit it to improve transformers (Section 4).
4 Designing a Better Transformer
Our analysis in the previous section motivates designing a transformer model that is heavy on self-attention at the bottom and feedforward sublayers at the top, while at the same time containing a more-or-less balanced amount of both sublayer types. As a first attempt to manually design a better transformer, we take this hypothesis to the extreme, and train a transformer model of 16 self-attention sublayers followed by 16 feedforward sublayers (sf). This model achieves 18.82 perplexity, which is comparable to the performance of the baseline with the same number of parameters.
We next generalize this model and the original interleaved transformer, creating the family of sandwich transformers. A sandwich transformer consists of sublayers in total ( of each type), conforming to the regular expression ssf f. The first sublayers are purely self-attention (s), while the last are feedforward sublayers (f). In between, we use the original interleaving pattern (sf) to fill the remaining sublayers. When , we get the original transformer stack, and when (its maximal value) we get the previously mentioned sf model. We refer to as the transformer’s sandwich coefficient.
We train sandwich transformers for (to remain within the same parameter budget as our baseline language model) and all values of . Figure 9 shows the transformer’s performance as a function of the sandwich coefficient . With the exception of , all sandwich transformers achieve lower perplexities than the average baseline transformer. Of those, 6 models outperform the best baseline transformer (). The best performance of 17.84 perplexity is obtained when .
We then take our best model and compare it to the best baseline (selected via development set perplexity) on WikiText-103’s test set and find that our sandwich transformer indeed outperforms the original transformer by 0.44 perplexity (Table 1). To check whether this advantage is consistent, we train 4 more sandwich models with different random seeds (5 in total) and evaluate them on the development set (to avoid running more than once on the test set). Figure 10
compares the distribution of sandwich transformer perplexities to the baseline’s; we obtain a mean perplexity value of 17.98 with a standard deviation of 0.10, while the baseline achieves 18.650.24 perplexity.
Despite its simple and even heuristic design, the sandwich transformer consistently outperforms the standard interleaved transformer. This improvement comes at no extra cost in parameters, data, memory, or computation.
5 Related Work
Neural Architecture Search
In this paper, we manually searched through a constrained transformer architecture space, after analyzing the results of small-scale random searches. This human-in-the-loop for architecture method has advantages over previous methods jozefowicz2015empirical; zoph2016neural; efficientnet since it requires that only a few dozen models be trained, unlike typical architecture search methods that require training thousands, consuming massive computational resources.
While we do find a better performing transformer, our goal is not only to do so, but to better understand how sublayer ordering affects transformer models. Future work could apply methods from the architecture space literature to the sublayer ordering problem. Furthermore, a better understanding of the inner workings of transformers could inspire more efficient, constrained architecture search.
Unlike recent papers that tried to improve the transformer by modifying the sublayers such as evolvedtransformer; startransformer; zhang2019improving; correia2019adaptively, in this paper we do not modify the sublayers at all, but simply rearrange their order. The performance gains from sublayer reordering are orthogonal to improving the sublayers themselves and could be combined to achieve even better performance.
We train random transformer models with reordered sublayers, and find that some perform better than the baseline interleaved transformer in language modeling. We observe that, on average, better models contain more self-attention sublayers at the bottom and more feedforward sublayer at the top. This leads us to design a new transformer stack, the sandwich transformer, which consistently improves performance over the baseline at no cost.
Appendix A Appendix
This section contains detailed results of the experiments described in Section 3. Table 2 shows the performance of permuted but balanced transformers (Section 3.1). Table 3 shows the performance of unbalanced transformers (Section 3.2).