where and are trainable parameters, and is a small constant. Recent work has observed that Post-LN transformers tend to have larger magnitude gradients in later layers compared to earlier layers (Xiong et al., 2020) and has advocated moving the LayerNorm operation to the beginning of each sublayer (“Pre-LN”; see Figure 1, left), i.e.:
In practice Pre-LN transformers can be trained with larger learning rates, shorter learning rate warmup and often yield improved performance compared to Post-LN transformers (Xiong et al., 2020), so most recent, large pretrained language models tend to use Pre-LN transformers (Baevski and Auli, 2019; Radford et al., 2019; Raffel et al., 2020; Brown et al., 2020; Lieber et al., 2021).
In this work we show that, while Pre-LN improves stability over Post-LN, it has the opposite side effect: gradients at earlier layers tend to be larger than gradients at later layers. We propose NormFormer, which alleviates the gradient magnitude mismatch by adding 3 normalization operations to each layer (see Figure 1, middle). These operations reduce gradients to early layers and increase gradients to later layers, bringing their magnitudes closer together.
Compared to compute-matched, well-tuned Pre-LN baselines, NormFormer models reach target pretraining perplexities faster and achieve better pretraining perplexities and downstream task performance.
The rest of this paper is organized as follows: Section 2 describes the proposed modifications, Section 3 shows pretraining and downstream task performance for fully trained NormFormer models against well-tuned, compute-matched baselines. Section 4 shows the gradient mismatch introduced by Pre-LN and how NormFormer alleviates it. Section 4.2 analyzes residual scaling, a related technique proposed to stabilize Post-LN architectures (Xiong et al., 2020; Zhu et al., 2021). Section 5 shows that removing any of the added operations degrades performance and that NormFormer
improves over the baseline at a wide range of hyperparameter configurations.
NormFormer includes three modifications to the Pre-LN transformer: First, we apply head-wise scaling inside the attention module and add two additional LayerNorm operations: one after the attention module and a second after the first fully connected layer. The modifications introduce a small number of additional learnable parameters, which provide a cost-effective way for each layer to change the magnitude of its features, and therefore the magnitude of the gradients to subsequent components. The changes are visualized in Figure 1 and described below.
Scaling Attention Heads
The standard multi-head attention operation is defined as:
where is the number of heads, is the attention head index, is the dimensionality of the keys and are learned projection matrices for the output, query, key and value, respectively.
We propose scaling the output of each attention head via learned scalar coefficients :
where are learnable parameters initialized to 1.
Additional Layer Normalization and Putting it All Together
In the Pre-LN transformer each layer modifies an input as follows:
In this work is the GELU non-linear activation introduced in Hendrycks and Gimpel (2016).
Our overall method, NormFormer, instead modifies each input as:
where bolded operations are newly introduced.
Causal Language Models
We pretrain causal LMs (CLM) that roughly match the “Small” (125M parameter), “Medium” (355M), “Large” (1.3B) and “XL” (2.7B) sizes from Brown et al. (2020).
Our model architecture differs from Brown et al. (2020) in two ways: (1) we use only dense attention, while they alternate between dense and locally banded sparse attention; (2) we train our models with sinusoidal positional embeddings, following Shortformer (Press et al., 2020a), since early experiments found this to produce comparable results with fewer learned parameters.
We train the baseline models for 300 billion tokens. We train NormFormer models for an equivalent number of GPU hours, which typically results in 2-6% fewer steps and tokens due to the additional overhead of the normalization operations.
|Model Size||GPT-3 Paper||Baseline||NormFormer|
On our dataset, we find that the learning rates proposed in GPT-3 are suboptimally low.111The difference in optimal learning rates may be due partly to architectural differences between our baseline and GPT-3 (e.g., not using locally banded sparse attention). For both baseline and NormFormer at each size besides 2.7B, we tune the learning rate by training models for 50,000 steps and selecting the best performing learning rate among: . The learning rates we obtained from this process, shown in Table 1, are 3-5 times larger than those used in the GPT-3 paper. Additionally, we have verified that the baseline and NormFormer both perform worse at the full training budget with the GPT-3 learning rates than with the higher learning rates. Other hyperparameters do not differ from GPT-3.222See Table 2.1 in Brown et al. (2020).
Standard Post-LN transformers simply sum the previous output (residual) with the new output. Recent work attempts to stabilize Post-LN architectures by weighting the residual connection for each layer (Zhu et al., 2021; Liu et al., 2020). We thus experiment with scaling the residual in each embedding dimension via learned scalar coefficients :
where is elementwise multiplication, and are learned parameters initialized to 1.
While this can be applied at any normalization layer, we find it it most effective for normalizing the feedforward network (FFN) submodule for the smaller sized language models. In this setting,
For 1.3B parameter models and larger, scaling residuals hurts performance (see discussion in Section 4.2), so ResScale is not used in our 1.3B and 2.7B CLM results.
Large scale experiments
We also train three large-scale models with 2.7B parameters. Our first baseline is a replicated version of GPT-3-2.7B with GELU activations, the published learning rate (1.6e-4) and the same number of training steps and tokens (286K steps; 300B tokens). This model slightly exceeds the reference zero shot performance (Brown et al., 2020). Next, we train two variants of GPT3-2.7B with activations (So et al., 2021), but use slightly fewer training steps (20% less) for compute efficiency. The first of these uses the baseline learning rate (1.6e-4) and the second uses NormFormer-2.7B with a higher learning rate of 6e-4. We note that training baseline 2.7B CLMs (i.e., without NormFormer modifications) with a higher 6e-4 learning rate diverged and failed to train. However, as opposed to the smaller architectures, we did not exhaustively tune the learning rate, so it is possible that an intermediate value would perform better.
Zero Shot Evaluation
In addition to validation perplexity, we evaluate CLMs on a subset of the tasks that GPT3 evaluated on in a zero-shot setting (Brown et al., 2020), with the same prompts. We select WinoGrande (Sakaguchi et al., 2020), StoryCloze (Mostafazadeh et al., 2016), OpenBookQA (Mihaylov et al., 2018), HellaSwag (Zellers et al., 2019) and PIQA (Bisk et al., 2020) because GPT3 showed strong performance on these tasks at small scale, as well as consistently improving performance with scale.
Masked Language Models (MLM)
We adopt the RoBERTa-base, Pre-LN architecture and hyperparameters used in Liu et al. (2019). For the baseline, we pretrain for 2 million batches of 1 million tokens, about of the training budget of the original roberta-base. NormFormer runs through 1.92 million batches in the same amount of time.
We pretrain all models on a collection of English language text including the English portion of the CC100 corpus (Conneau et al., 2020) as well as the data from Liu et al. (2019), consisting of BookCorpus (Zhu et al., 2019), English Wikipedia and filtered subsets of Common Crawl. We encode our data with the byte-level Byte Pair Encoding (BPE) vocabulary from Liu et al. (2019), originally introduced in Radford et al. (2019). The combined dataset contains around 450GB of uncompressed text and 110B BPE tokens. We hold out 40M BPE tokens from this data as a validation set on which we report pretraining perplexities.
We train our causal and masked language models in fairseq (Ott et al., 2019; Paszke et al., 2019). Although NormFormer introduces fewer than 0.07% additional parameters, it slows individual training updates and increases memory usage between 2% (2.7B model) to 6% (125M model) due to the FFN LNs. Accordingly, we compare NormFormer to baseline models trained for an equal amount of GPU time, i.e., controlling for compute rather than the number of training updates. Finally, we note that the HeadScale operation can be moved outside the self attention module to allow the use of the very efficient pytorch F.multihead_attention. This change reduces overhead without noticeable performance degradation.
|GPT3-125M (High LR)||124.4||3e-3||-||-||572K||21.09||35.3||67.5||50.5||66.3||35.0||50.9|
|GPT3-355M (High LR)||354.7||1e-3||-||-||572K||14.85||48.4||71.7||53.8||73.3||43.4||58.1|
|GPT3-1.3B (High LR)||1313.5||6e-4||-||-||286K||12.21||57.5||74.3||59.3||76.3||50.8||63.6|
We report pretraining perplexities for CLMs and MLMs as a function of training wall-time (GPU days) in Figure 2. We observe that NormFormer trains significantly faster and achieves better validation perplexities for a given training compute budget. The blue stars mark the first validation step where NormFormer matches the baseline’s lowest perplexity and shows that NormFormer matches Pre-LN models while needing only 60% and 57% as much compute for CLM and MLM models, respectively. This is particularly impressive since NormFormer models take 2-6% longer for each training step and thus see less data than Pre-LN models in this comparison. The left side blue line in Figure 2 shows the failed attempt to add ResScale to NormFormer-1.3B.
We observe a similar trend on downstream tasks. In Table 2 we report zero shot accuracy for causal LMs using the tasks and prompts from Brown et al. (2020). NormFormer outperforms GPT-3 at all sizes. The gains from Normformer extra parameters operations outpace the gains from normal scaling laws. Changing the hidden dimension of a 125M parameter model from 768 to 780, for example, results in a 127 million parameter model that is only 0.08 perplexity better than the baseline whereas NormFormer-125M adds only 100,000 parameters and is 0.83 perplexity better than the baseline.
For MLM models, we report fine-tuned accuracy on GLUE in Table 3. We again find that NormFormer MLM models outperform their Pre-LN counterparts on every task (rows 1 vs 2). Adding ResScale improves improves pre-training performance marginally (3.29 valid PPL vs 3.31), but the gains to do not translate to finetuned performance.
4.1 Analysis of gradient norms by layer
We begin by examining the magnitude of the gradients at different layers for Post-LN, Pre-LN and NormFormer models, since large magnitude differences in gradients across layers can destabilize training, particularly when training in mixed precision (Micikevicius et al., 2018). Figure 3 shows the average L1 norm of the gradients to the second fully connected weight in various layers for a 12 layer, 125M parameter CLM model at the beginning of training. As reported in past work (Xiong et al., 2020), we observe that the gradients to later layers in Post-LN models are much larger than for earlier layers, and that the gradients to early layers quickly vanish in the early stages of training. Pre-LN models have the opposite behavior, with early layers instead receiving significantly larger gradients than later layers. NormFormer brings the average gradient norms closer together for different layers in the network.
In Figure 4 we present the distribution of scaling parameters learned by NormFormer models. For the FFN LN, the parameters are smaller for earlier layers, reducing the magnitude of the inputs to early fully connected parameters, thereby decreasing the magnitude of their gradients. The post attention LN, in the middle of Figure 4, all layers have coefficients below 1, indicating downscaling.333The downscaling is also apparent in Figure 7 in the Appendix, which plots the change in grad norm for each operation at each layer. It shows that adding extra normalization reduces the gradient norm for all attention parameters at every layer. Only FFN parameters at later layers, have increased gradient norms. The HeadScale parameters, shown in the rightmost plot in Figure 4 vary more than the others, and have no relationship with depth in the network. We interpret this as evidence that the HeadScale parameters dynamically increase the importance of well initialized attention heads, as suggested in Chen et al. (2021).
One result of reducing the gradient mismatch, besides better perplexities and downstream task performance, is the ability to train stably with larger learning rates. To measure the stability of an architecture, we train it on a learning rate schedule with a very large peak learning rate, so that the learning rate increases a little each step until the loss explodes. Figure 5 shows that NormFormer models can survive for more updates in this environment than the baseline. For the baseline 125M model (the left most blue dot), the loss eventually explodes, with the activations from multiplying the query and key features at layer 0 overflowing the FP16 range. The down scaling of the attention outputs allows NormFormer to avoid this issue and remain stable with larger learning rates. Figure 5 also shows that reduces the stability improvement at all sizes.
4.2 Residual Scaling
By comparing adjacent NormFormer-125M and NormFormer-355M rows in Table 2 we can see that adding ResScale to NormFormer improves perplexity and zero shot performance for small scale CLMs. For 125M parameter MLM, ResScale improves pre-training perplexity marginally, but hurts fine-tuned performance. At 1.3 billion parameter scale, however, adding ResScale to NormFormer does not improve performance (Figure 2). Although it’s not included in our tables, we find that ResScale without NormFormer is stronger than the baseline at small scale, but not large scale. This suggests that the negative result is caused by scale, rather than interaction with NormFormer.
Figure 6 shows the Avg. weights at each layer of different sized CLMs. We can see that at 125M and 355M parameters, the weights in the later layers are lower, indicating down weighting of the residual connection, whereas at the largest scale, 1.3B, the weights are larger deeper into the network.
parameters to the other (earlier) residual connection in each layer, or using a scalar instead of a vector for each, does not fix the large scale issue, but hurts small scale performance marginally.
This section provides evidence that removing any of our additions to the transformer block degrades performance on language modeling tasks, and that our additions improve language modeling performance across a wide range of hyperparameter settings. Experiments use 125M parameter CLMs, and are run with the default hyperparameters given in Table 7 in the appendix for 470 V100 Hours (100,000 updates for the baseline) unless otherwise mentioned.
Removing any of the added operations hurts performance
Table 4 shows that none of the four introduced operations can be removed without degrading performance. Rows 2-5 remove each operation one at a time. In all cases perplexity increases, with the removal of HeadScale being the most damaging and the removal of the Post-Attn LN being the least damaging. In Row 6 (+ 3 More LN) we try to introduce more normalization inside self attention, applying LN to the query, key and value features in addition to our 3 other operations, for a total of 6 new operations. In this setting, every other parameterized operation inside the transformer layer is an LN. We find that this does not change perplexities at a fixed number of updates, but reduces training speed by another 5%. This result suggests that there is not much upside to adding even more normalization on top of NormFormer.
|- Post-Attn LN||15.92|
|- FFN LN||16.14|
|- Head Scale||16.22|
|- Res Scale||16.20|
|+ 3 More LN||15.88|
Replacing the FFN LN with the FFNGeGlu proposed in Shazeer (2020), which includes scaling but no normalization, degraded performance in our 125M parameter CLM setting, the only place we tried it. We also find that the LN variant proposed in Raffel et al. (2020), which removes the bias and the mean substraction from the normalization, performs equally well to our LN and has fewer trainable parameters, but is about 2x slower than the FusedLayerNorm implementation we use. We therefore do not adopt it.
Ding et al. (2021) propose related stabilization strategies for text to image generation tasks with larger models including a downscaled embedding gradient, a layer norm after the final fully connected layer, and the same post-attention LN. We find that, besides the post attention LN, these techniques do not help in our setting.
Table 5 in the appendix shows language modeling perplexities for 7 different hyperparameter configurations, separated by horizontal lines. NormFormer outperforms the baseline in all settings.
6 Related Work
Layer normalization (Ba et al., 2016) is an important component of the transformer architecture. Xiong et al. (2020) shows that for Post-LN: gradients are too big for later layers and solves this problem with Pre-LN. We build on the Pre-LN architecture to make it even more stable and efficient.
Press et al. (2020b) proposes an architecture where instead of interleaving attention and feed forward sublayers, the attention all happens first. This increases the number of late FFN parameters, rather than increasing their importance and gradient norm, as our FFN LN does, and does not impact stability.
Our HeadScale operation is related to that used in Chen et al. (2021), but used differently. Whereas that work prunes attention heads with low parameters, we use the parameters to improve pretraining performance.
These approaches are also related to techniques for initializing neural networks: GradInit(Zhu et al., 2021)
introduces a set of scalars and biases for initialization based on a variance heuristic, and Admin(Liu et al., 2020) applies a similar heuristic in profiling and initialization stages. These works also use variants of our ResScale operation, which we find helpful at small scale and harmful at large scale.
Similarly, some other approaches targeted initialization as well, in particular ReZero (Bachlechner et al., 2020), FixUp (Huang et al., 2020) and LookLinear (Balduzzi et al., 2017). We note that DALL-E (Ramesh et al., 2021) also added a per residual scaling factor (only during backprop). Our approach, in contrast, only has new learnable parameters without variance heuristics, and has no extra stages or changes in initialization.
We identify a mismatch in the gradients of Pre-LN transformer weights: earlier layers receive much larger gradients than later layers, while the optimal scaling of residuals is larger at earlier layers than at later layers. We propose NormFormer, which alleviates these issues by adding 3 extra operations to each transformer layer. These modifications help the gradient mismatch for fully connected parameters and improve validation perplexity and downstream task performance for both causal and masked language models. None can be removed without degrading performance back towards the baseline, and adding more normalization – at least of the types we have tried – does not improve performance. Since NormFormer primarily addresses the gradient mismatch by increasing the gradients to the last FFN layers while decreasing the gradient magnitudes in other parts of the network, future work could examine whether all 3 operations need to be added to every layer. Additionally, the small computational overhead associated with NormFormer could be alleviated by fusing the FFN LN with the preceding fully connected layer, with or without the mean centering and bias, which do not appear to improve pretraining perplexity. In general, we have shown that adding small numbers of learnable parameters in the right places in our architectures can alleviate certain issues in current state of the art networks. Future work should ascertain if there are additional similarly efficient modifications that can bring gains, while helping us understand current deficiencies further.
|Learning Rate||Setting Changes||Valid PPL|
|Baseline||0.003||Clip Grad Norms at 0.1||16.46|
|NormFormer||0.003||Clip Grad Norms at 0.1||16.14|
Table 6 shows that NormFormer can also provide gains on top of a well tuned language model in settings with much less data. We simply add our three operations to the architecture and hyperparameters of Baevski and Auli (2019). Convergence perplexity improves, and we reach the baseline perplexity in 70% as many steps. In this setting, NormFormer does not improve in the last 30% of training, which suggests that with more tuning the perplexity gap could be widened.
|Steps to Final PPL||PPL|
|Batch Size||524K Tokens|
|LR Warmup Updates||500|
|LR Scheduler||Linear Decay|
|Train Budget||470 V100 Hours|
- Layer normalization. External Links: Cited by: §1, §6.
- Rezero is all you need: fast convergence at large depth. arXiv preprint arXiv:2003.04887. Cited by: §6.
- Adaptive input representations for neural language modeling. In International Conference on Learning Representations, External Links: Cited by: §1, §8, Table 6.
The shattered gradients problem: if resnets are the answer, then what is the question?.
International Conference on Machine Learning, pp. 342–350. Cited by: §6.
PIQA: reasoning about physical commonsense in natural language.
Proceedings of the AAAI Conference on Artificial Intelligence34 (05), pp. 7432–7439. External Links: Cited by: §2.2.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. External Links: Cited by: §1, §2.2, §2.2, §2.2, §2.2, Table 1, Table 2, §3, footnote 2.
- EarlyBERT: efficient bert training via early-bird lottery tickets. External Links: Cited by: §4.1, §6.
- Unsupervised cross-lingual representation learning at scale. External Links: Cited by: §2.2.
- CogView: mastering text-to-image generation via transformers. arXiv preprint arXiv:2105.13290. Cited by: §5.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §2.1.
- Improving transformer optimization through better initialization. In International Conference on Machine Learning, pp. 4475–4483. Cited by: §6.
- Jurassic-1: technical details and evaluation. Technical report AI21 Labs. Cited by: §1.
- Understanding the difficulty of training transformers. arXiv preprint arXiv:2004.08249. Cited by: §2.2, §6.
- RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.2, §2.2, §2.2.
- Mixed precision training. In International Conference on Learning Representations, Cited by: §4.1.
Can a suit of armor conduct electricity? a new dataset for open book question answering.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2381–2391. External Links: Cited by: §2.2.
- A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 839–849. External Links: Cited by: §2.2.
- fairseq: a fast, extensible toolkit for sequence modeling. In naacl_demo, Cited by: §2.2.
PyTorch: an imperative style, high-performance deep learning library. In nips, pp. 8024–8035. Cited by: §2.2.
- Shortformer: better language modeling using shorter inputs. arXiv preprint arXiv:2012.15832. Cited by: §2.2.
- Improving transformer models by reordering their sublayers. External Links: Cited by: §6.
- Language models are unsupervised multitask learners. Technical report OpenAI. Cited by: §1, §2.2.
Exploring the limits of transfer learning with a unified text-to-text transformer. jmlr 21, pp. 1–67. Cited by: §1, §5.
- Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092. Cited by: §6.
- WinoGrande: an adversarial winograd schema challenge at scale. Proceedings of the AAAI Conference on Artificial Intelligence 34 (05), pp. 8732–8740. External Links: Cited by: §2.2.
- GLU variants improve transformer. External Links: Cited by: §5.
- Primer: searching for efficient transformers for language modeling. External Links: Cited by: §2.2.
- Attention is all you need. In Advances in neural information processing systems, Cited by: §1.
- GLUE: a multi-task benchmark and analysis platform for natural language understanding. In iclr, Cited by: §2.2.
- On layer normalization in the transformer architecture. External Links: Cited by: §1, §1, §4.1, §6.
- HellaSwag: can a machine really finish your sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4791–4800. External Links: Cited by: §2.2.
- GradInit: learning to initialize neural networks for stable and efficient training. arXiv preprint arXiv:2102.08098. Cited by: §1, §2.2, §6.
- Aligning books and movies: towards story-like visual explanations by watching movies and reading books.. arXiv preprint arXiv:1506.06724. Cited by: §2.2.