Log In Sign Up

NormFormer: Improved Transformer Pretraining with Extra Normalization

by   Sam Shleifer, et al.

During pretraining, the Pre-LayerNorm transformer suffers from a gradient magnitude mismatch: gradients at early layers are much larger than at later layers. These issues can be alleviated by our proposed NormFormer architecture, which adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. The extra operations incur negligible compute cost (+0.4 downstream task performance for both causal and masked language models ranging from 125 Million to 2.7 Billion parameters. For example, adding NormFormer on top of our strongest 1.3B parameter baseline can reach equal perplexity 24 faster, or converge 0.27 perplexity better in the same compute budget. This model reaches GPT3-Large (1.3B) zero shot performance 60 language modeling, NormFormer improves fine-tuned GLUE performance by 1.9 average. Code to train NormFormer models is available in fairseq .


page 1

page 2

page 3

page 4


What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Large pretrained Transformer language models have been shown to exhibit ...

Pretrained Transformers as Universal Computation Engines

We investigate the capability of a transformer pretrained on natural lan...

Improving Transformer Models by Reordering their Sublayers

Multilayer transformer networks consist of interleaved self-attention an...

DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

Transformer-based QA models use input-wide self-attention – i.e. across ...

Primer: Searching for Efficient Transformers for Language Modeling

Large Transformer models have been central to recent advances in natural...

Scratching Visual Transformer's Back with Uniform Attention

The favorable performance of Vision Transformers (ViTs) is often attribu...

QKVA grid: Attention in Image Perspective and Stacked DETR

We present a new model named Stacked-DETR(SDETR), which inherits the mai...

1 Introduction

The original transformer architecture (Vaswani et al., 2017) applies Layer Normalization (Ba et al., 2016)

after each sublayer’s residual connection (“Post-LN”) in order to reduce the variance of the inputs to the following sublayer, i.e.:


where and are trainable parameters, and is a small constant. Recent work has observed that Post-LN transformers tend to have larger magnitude gradients in later layers compared to earlier layers (Xiong et al., 2020) and has advocated moving the LayerNorm operation to the beginning of each sublayer (“Pre-LN”; see Figure 1, left), i.e.:

In practice Pre-LN transformers can be trained with larger learning rates, shorter learning rate warmup and often yield improved performance compared to Post-LN transformers (Xiong et al., 2020), so most recent, large pretrained language models tend to use Pre-LN transformers (Baevski and Auli, 2019; Radford et al., 2019; Raffel et al., 2020; Brown et al., 2020; Lieber et al., 2021).

In this work we show that, while Pre-LN improves stability over Post-LN, it has the opposite side effect: gradients at earlier layers tend to be larger than gradients at later layers. We propose NormFormer, which alleviates the gradient magnitude mismatch by adding 3 normalization operations to each layer (see Figure 1, middle). These operations reduce gradients to early layers and increase gradients to later layers, bringing their magnitudes closer together.

Compared to compute-matched, well-tuned Pre-LN baselines, NormFormer models reach target pretraining perplexities faster and achieve better pretraining perplexities and downstream task performance.

The rest of this paper is organized as follows: Section 2 describes the proposed modifications, Section 3 shows pretraining and downstream task performance for fully trained NormFormer models against well-tuned, compute-matched baselines. Section 4 shows the gradient mismatch introduced by Pre-LN and how NormFormer alleviates it. Section 4.2 analyzes residual scaling, a related technique proposed to stabilize Post-LN architectures (Xiong et al., 2020; Zhu et al., 2021). Section 5 shows that removing any of the added operations degrades performance and that NormFormer

improves over the baseline at a wide range of hyperparameter configurations.

Figure 1: Left: a baseline Pre-LayerNorm transformer layer. Center: NormFormer, with the three proposed additions in bold. Right: a single attention head with our proposed HeadScale operation applied prior to the output projection with trainable parameters . * When applied, residual scaling impacts the second residual connection in each layer.

2 Approach

2.1 NormFormer

NormFormer includes three modifications to the Pre-LN transformer: First, we apply head-wise scaling inside the attention module and add two additional LayerNorm operations: one after the attention module and a second after the first fully connected layer. The modifications introduce a small number of additional learnable parameters, which provide a cost-effective way for each layer to change the magnitude of its features, and therefore the magnitude of the gradients to subsequent components. The changes are visualized in Figure 1 and described below.

Scaling Attention Heads

The standard multi-head attention operation is defined as:

where is the number of heads, is the attention head index, is the dimensionality of the keys and are learned projection matrices for the output, query, key and value, respectively.

We propose scaling the output of each attention head via learned scalar coefficients :

where are learnable parameters initialized to 1.

Additional Layer Normalization and Putting it All Together

In the Pre-LN transformer each layer modifies an input as follows:

In this work is the GELU non-linear activation introduced in Hendrycks and Gimpel (2016).

Our overall method, NormFormer, instead modifies each input as:

where bolded operations are newly introduced.

2.2 Experiments

Causal Language Models

We pretrain causal LMs (CLM) that roughly match the “Small” (125M parameter), “Medium” (355M), “Large” (1.3B) and “XL” (2.7B) sizes from Brown et al. (2020).

Our model architecture differs from Brown et al. (2020) in two ways: (1) we use only dense attention, while they alternate between dense and locally banded sparse attention; (2) we train our models with sinusoidal positional embeddings, following Shortformer (Press et al., 2020a), since early experiments found this to produce comparable results with fewer learned parameters.

We train the baseline models for 300 billion tokens. We train NormFormer models for an equivalent number of GPU hours, which typically results in 2-6% fewer steps and tokens due to the additional overhead of the normalization operations.

Model Size GPT-3 Paper Baseline NormFormer
125M 6e-4 3e-3 3e-3
355M 3e-4 1e-3 1e-3
1.3B 2e-4 6e-4 6e-4
Table 1: Searching for learning rates on our dataset results in higher values than reported in Brown et al. (2020), providing stronger baselines to compare to our NormFormer architecture.

On our dataset, we find that the learning rates proposed in GPT-3 are suboptimally low.111The difference in optimal learning rates may be due partly to architectural differences between our baseline and GPT-3 (e.g., not using locally banded sparse attention). For both baseline and NormFormer at each size besides 2.7B, we tune the learning rate by training models for 50,000 steps and selecting the best performing learning rate among: . The learning rates we obtained from this process, shown in Table 1, are 3-5 times larger than those used in the GPT-3 paper. Additionally, we have verified that the baseline and NormFormer both perform worse at the full training budget with the GPT-3 learning rates than with the higher learning rates. Other hyperparameters do not differ from GPT-3.222See Table 2.1 in Brown et al. (2020).

Residual Scaling

Standard Post-LN transformers simply sum the previous output (residual) with the new output. Recent work attempts to stabilize Post-LN architectures by weighting the residual connection for each layer (Zhu et al., 2021; Liu et al., 2020). We thus experiment with scaling the residual in each embedding dimension via learned scalar coefficients :

where is elementwise multiplication, and are learned parameters initialized to 1.

While this can be applied at any normalization layer, we find it it most effective for normalizing the feedforward network (FFN) submodule for the smaller sized language models. In this setting,

For 1.3B parameter models and larger, scaling residuals hurts performance (see discussion in Section 4.2), so ResScale is not used in our 1.3B and 2.7B CLM results.

Large scale experiments

We also train three large-scale models with 2.7B parameters. Our first baseline is a replicated version of GPT-3-2.7B with GELU activations, the published learning rate (1.6e-4) and the same number of training steps and tokens (286K steps; 300B tokens). This model slightly exceeds the reference zero shot performance (Brown et al., 2020). Next, we train two variants of GPT3-2.7B with activations (So et al., 2021), but use slightly fewer training steps (20% less) for compute efficiency. The first of these uses the baseline learning rate (1.6e-4) and the second uses NormFormer-2.7B with a higher learning rate of 6e-4. We note that training baseline 2.7B CLMs (i.e., without NormFormer modifications) with a higher 6e-4 learning rate diverged and failed to train. However, as opposed to the smaller architectures, we did not exhaustively tune the learning rate, so it is possible that an intermediate value would perform better.

Zero Shot Evaluation

In addition to validation perplexity, we evaluate CLMs on a subset of the tasks that GPT3 evaluated on in a zero-shot setting (Brown et al., 2020), with the same prompts. We select WinoGrande (Sakaguchi et al., 2020), StoryCloze (Mostafazadeh et al., 2016), OpenBookQA (Mihaylov et al., 2018), HellaSwag (Zellers et al., 2019) and PIQA (Bisk et al., 2020) because GPT3 showed strong performance on these tasks at small scale, as well as consistently improving performance with scale.

Masked Language Models (MLM)

We adopt the RoBERTa-base, Pre-LN architecture and hyperparameters used in Liu et al. (2019). For the baseline, we pretrain for 2 million batches of 1 million tokens, about of the training budget of the original roberta-base. NormFormer runs through 1.92 million batches in the same amount of time.


We fine-tune both the baseline MLM and NormFormer with learning rates and report the best performance on the validation set for each GLUE task (Wang et al., 2019), following Liu et al. (2019). Other fine-tuning hyperparameters match those used for roberta-base in Liu et al. (2019).

Pretraining data

We pretrain all models on a collection of English language text including the English portion of the CC100 corpus (Conneau et al., 2020) as well as the data from Liu et al. (2019), consisting of BookCorpus (Zhu et al., 2019), English Wikipedia and filtered subsets of Common Crawl. We encode our data with the byte-level Byte Pair Encoding (BPE) vocabulary from Liu et al. (2019), originally introduced in Radford et al. (2019). The combined dataset contains around 450GB of uncompressed text and 110B BPE tokens. We hold out 40M BPE tokens from this data as a validation set on which we report pretraining perplexities.

Implementation details

We train our causal and masked language models in fairseq (Ott et al., 2019; Paszke et al., 2019). Although NormFormer introduces fewer than 0.07% additional parameters, it slows individual training updates and increases memory usage between 2% (2.7B model) to 6% (125M model) due to the FFN LNs. Accordingly, we compare NormFormer to baseline models trained for an equal amount of GPU time, i.e., controlling for compute rather than the number of training updates. Finally, we note that the HeadScale operation can be moved outside the self attention module to allow the use of the very efficient pytorch F.multihead_attention. This change reduces overhead without noticeable performance degradation.

3 Results

Random Baseline - - - - - - 25.0 50.0 50.0 50.0 25.0 40.0
GPT3-125M (paper) 124.4 6e-4 - - 572K - 33.7 64.6 52.0 63.3 35.6 49.8
GPT3-125M (replicated) 124.4 6e-4 - - 572K 21.11 33.7 66.5 52.2 66.1 35.4 50.8
GPT3-125M (High LR) 124.4 3e-3 - - 572K 21.09 35.3 67.5 50.5 66.3 35.0 50.9
NormFormer-125M 124.5 3e-3 - - 540K 20.34 34.9 67.1 52.3 66.3 38.0 51.7
NormFormer-125M 124.5 3e-3 - 539K 20.11 34.9 65.9 53.4 67.5 40.0 52.3
GPT3-355M (paper) 354.7 3e-4 - - 572K - 43.6 70.2 52.1 68.5 43.2 55.5
GPT3-355M (replicated) 354.7 3e-4 - - 572K 15.41 46.1 70.8 54.6 71.1 41.2 56.8
GPT3-355M (High LR) 354.7 1e-3 - - 572K 14.85 48.4 71.7 53.8 73.3 43.4 58.1
NormFormer-355M 355.0 1e-3 - - 552K 14.54 49.7 71.8 56.0 73.8 43.6 59.0
NormFormer-355M 355.0 1e-3 - 550K 14.52 49.7 72.0 56.7 73.2 43.8 59.1
GPT3-1.3B (paper) 1313.5 2e-4 - - 286K - 54.7 75.1 58.0 73.4 46.8 61.6
GPT3-1.3B (replicated) 1313.5 2e-4 - - 286K 12.56 58.5 74.6 58.1 76.8 49.4 63.5
GPT3-1.3B (High LR) 1313.5 6e-4 - - 286K 12.21 57.5 74.3 59.3 76.3 50.8 63.6
NormFormer-1.3B 1314.0 6e-4 - - 275K 11.94 60.5 74.5 60.1 77.5 50.8 64.7
GPT3-2.7B (paper) 2648.7 1.6e-4 - - 286K - 62.8 75.6 62.3 77.2 53.0 66.2
GPT3-2.7B (replicated) 2648.7 1.6e-4 - - 286K 10.92 65.9 76.6 61.4 78.2 49.6 66.3
NormFormer-2.7B 2649.5 6e-4 - 277K 10.55 68.1 78.1 64.4 79.4 53.4 68.7


2648.7 1.6e-4 - 230K 10.99 65.9 76.1 63.2 79.3 49.4 66.8
GPT3-2.7B-Relu 2648.7 6e-4 - 28K diverged
NormFormer-2.7B 2649.5 6e-4 - 222K 10.73 67.4 77.2 64.4 78.9 52.6 68.1
Table 2: Zero-Shot Accuracy for Causal LMs for the following tasks: HS: HellaSwag, PI: PIQA, WG: WinoGrande, SC: StoryCloze, OB: OpenBookQA. PPL is validation perplexity during pretraining. GPT-3 (paper) results taken from Brown et al. (2020). Horizontal lines group compute-matched runs. High LR corresponds to using a larger learning rate than reported in Brown et al. (2020). indicates whether residual scaling was used. did not help at 1.3B scale, as shown in 2, but that run is not compute matched so it is not included here. Model size () is reported in millions of parameters.
Figure 2: Pretraining perplexity on held-out validation data for Causal and Masked Language Models as a function of training compute (GPU days). The blue stars show the point where a model matches the baseline’s lowest perplexity.

We report pretraining perplexities for CLMs and MLMs as a function of training wall-time (GPU days) in Figure 2. We observe that NormFormer trains significantly faster and achieves better validation perplexities for a given training compute budget. The blue stars mark the first validation step where NormFormer matches the baseline’s lowest perplexity and shows that NormFormer matches Pre-LN models while needing only 60% and 57% as much compute for CLM and MLM models, respectively. This is particularly impressive since NormFormer models take 2-6% longer for each training step and thus see less data than Pre-LN models in this comparison. The left side blue line in Figure 2 shows the failed attempt to add ResScale to NormFormer-1.3B.

We observe a similar trend on downstream tasks. In Table 2 we report zero shot accuracy for causal LMs using the tasks and prompts from Brown et al. (2020). NormFormer outperforms GPT-3 at all sizes. The gains from Normformer extra parameters operations outpace the gains from normal scaling laws. Changing the hidden dimension of a 125M parameter model from 768 to 780, for example, results in a 127 million parameter model that is only 0.08 perplexity better than the baseline whereas NormFormer-125M adds only 100,000 parameters and is 0.83 perplexity better than the baseline.

Baseline 125.42 - 3.42 74.3 85.9 84.6 91.6 90.7 66.4 92.9 83.77
NormFormer 125.50 - 3.31 82.6 86.3 86.0 91.9 91.3 67.9 93.8 85.69
NormFormer 125.51 3.29 80.9 86.2 85.3 91.5 91.2 62.8 94.2 84.59
Table 3: Masked LM: Pretraining validation perplexity (PPL) and fine-tuned performance on GLUE tasks for Pre-LN and NormFormer models. Note that models are trained for an equal amount of compute, which is less than the publicly-released roberta-base models.

For MLM models, we report fine-tuned accuracy on GLUE in Table 3. We again find that NormFormer MLM models outperform their Pre-LN counterparts on every task (rows 1 vs 2). Adding ResScale improves improves pre-training performance marginally (3.29 valid PPL vs 3.31), but the gains to do not translate to finetuned performance.

4 Analysis

4.1 Analysis of gradient norms by layer

Figure 3: Average L1 norm of gradients to the second fully connected weight for layers 0,1,6,10 and 11, early in training.
Figure 4: Distribution of learned scaling parameters in three of the added operations. For FFN LN, earlier layers receive downscaled inputs, keeping their gradients in the same range as the gradients of later layers. This plot is discussed in detail in Section 4.
Figure 5: LR Stability Test: learning rate starts from 0 and linearly increases by 5e-5 at each training step until training destabilizes. NormFormer reaches a higher learning rate before destabilizing. Each data point is the median of 3 runs with a different random seed.

We begin by examining the magnitude of the gradients at different layers for Post-LN, Pre-LN and NormFormer models, since large magnitude differences in gradients across layers can destabilize training, particularly when training in mixed precision (Micikevicius et al., 2018). Figure 3 shows the average L1 norm of the gradients to the second fully connected weight in various layers for a 12 layer, 125M parameter CLM model at the beginning of training. As reported in past work (Xiong et al., 2020), we observe that the gradients to later layers in Post-LN models are much larger than for earlier layers, and that the gradients to early layers quickly vanish in the early stages of training. Pre-LN models have the opposite behavior, with early layers instead receiving significantly larger gradients than later layers. NormFormer brings the average gradient norms closer together for different layers in the network.

In Figure 4 we present the distribution of scaling parameters learned by NormFormer models. For the FFN LN, the parameters are smaller for earlier layers, reducing the magnitude of the inputs to early fully connected parameters, thereby decreasing the magnitude of their gradients. The post attention LN, in the middle of Figure 4, all layers have coefficients below 1, indicating downscaling.333The downscaling is also apparent in Figure 7 in the Appendix, which plots the change in grad norm for each operation at each layer. It shows that adding extra normalization reduces the gradient norm for all attention parameters at every layer. Only FFN parameters at later layers, have increased gradient norms. The HeadScale parameters, shown in the rightmost plot in Figure 4 vary more than the others, and have no relationship with depth in the network. We interpret this as evidence that the HeadScale parameters dynamically increase the importance of well initialized attention heads, as suggested in Chen et al. (2021).

One result of reducing the gradient mismatch, besides better perplexities and downstream task performance, is the ability to train stably with larger learning rates. To measure the stability of an architecture, we train it on a learning rate schedule with a very large peak learning rate, so that the learning rate increases a little each step until the loss explodes. Figure 5 shows that NormFormer models can survive for more updates in this environment than the baseline. For the baseline 125M model (the left most blue dot), the loss eventually explodes, with the activations from multiplying the query and key features at layer 0 overflowing the FP16 range. The down scaling of the attention outputs allows NormFormer to avoid this issue and remain stable with larger learning rates. Figure 5 also shows that reduces the stability improvement at all sizes.

4.2 Residual Scaling

By comparing adjacent NormFormer-125M and NormFormer-355M rows in Table 2 we can see that adding ResScale to NormFormer improves perplexity and zero shot performance for small scale CLMs. For 125M parameter MLM, ResScale improves pre-training perplexity marginally, but hurts fine-tuned performance. At 1.3 billion parameter scale, however, adding ResScale to NormFormer does not improve performance (Figure 2). Although it’s not included in our tables, we find that ResScale without NormFormer is stronger than the baseline at small scale, but not large scale. This suggests that the negative result is caused by scale, rather than interaction with NormFormer.

Figure 6 shows the Avg. weights at each layer of different sized CLMs. We can see that at 125M and 355M parameters, the weights in the later layers are lower, indicating down weighting of the residual connection, whereas at the largest scale, 1.3B, the weights are larger deeper into the network.

Adding the

parameters to the other (earlier) residual connection in each layer, or using a scalar instead of a vector for each

, does not fix the large scale issue, but hurts small scale performance marginally.

Figure 6: weights at each layer of different sized CLMs in the NormFormer+ setting. Depth is layer number / total layers.

5 Ablations

This section provides evidence that removing any of our additions to the transformer block degrades performance on language modeling tasks, and that our additions improve language modeling performance across a wide range of hyperparameter settings. Experiments use 125M parameter CLMs, and are run with the default hyperparameters given in Table 7 in the appendix for 470 V100 Hours (100,000 updates for the baseline) unless otherwise mentioned.

Removing any of the added operations hurts performance

Table 4 shows that none of the four introduced operations can be removed without degrading performance. Rows 2-5 remove each operation one at a time. In all cases perplexity increases, with the removal of HeadScale being the most damaging and the removal of the Post-Attn LN being the least damaging. In Row 6 (+ 3 More LN) we try to introduce more normalization inside self attention, applying LN to the query, key and value features in addition to our 3 other operations, for a total of 6 new operations. In this setting, every other parameterized operation inside the transformer layer is an LN. We find that this does not change perplexities at a fixed number of updates, but reduces training speed by another 5%. This result suggests that there is not much upside to adding even more normalization on top of NormFormer.

Architecture Valid PPL
NormFormer+ResScale 15.88
   - Post-Attn LN 15.92
   - FFN LN 16.14
   - Head Scale 16.22
   - Res Scale 16.20
   + 3 More LN 15.88
Baseline 16.37
Table 4: 125M parameter Language Modeling Validation perplexities after 470 V100 Hours of pretraining. Removing any of our proposed additions degrades performance (Rows 2-5). Adding more normalization inside the Multi Headed Attention (Row 6) does not impact perplexity at a fixed number of updates, but reduces throughput such that the model can only complete 87,500 updates vs. 92,500 for Rows 1-5 and 100,000 for Row 7. Note that these PPL scores are not directly comparable to other tables – they use a different validation set.

Other Experiments

Replacing the FFN LN with the FFNGeGlu proposed in Shazeer (2020), which includes scaling but no normalization, degraded performance in our 125M parameter CLM setting, the only place we tried it. We also find that the LN variant proposed in Raffel et al. (2020), which removes the bias and the mean substraction from the normalization, performs equally well to our LN and has fewer trainable parameters, but is about 2x slower than the FusedLayerNorm implementation we use. We therefore do not adopt it.

Ding et al. (2021) propose related stabilization strategies for text to image generation tasks with larger models including a downscaled embedding gradient, a layer norm after the final fully connected layer, and the same post-attention LN. We find that, besides the post attention LN, these techniques do not help in our setting.

Table 5 in the appendix shows language modeling perplexities for 7 different hyperparameter configurations, separated by horizontal lines. NormFormer outperforms the baseline in all settings.

6 Related Work

Layer normalization (Ba et al., 2016) is an important component of the transformer architecture. Xiong et al. (2020) shows that for Post-LN: gradients are too big for later layers and solves this problem with Pre-LN. We build on the Pre-LN architecture to make it even more stable and efficient.

Press et al. (2020b) proposes an architecture where instead of interleaving attention and feed forward sublayers, the attention all happens first. This increases the number of late FFN parameters, rather than increasing their importance and gradient norm, as our FFN LN does, and does not impact stability.

Our HeadScale operation is related to that used in Chen et al. (2021), but used differently. Whereas that work prunes attention heads with low parameters, we use the parameters to improve pretraining performance.

These approaches are also related to techniques for initializing neural networks: GradInit

(Zhu et al., 2021)

introduces a set of scalars and biases for initialization based on a variance heuristic, and Admin

(Liu et al., 2020) applies a similar heuristic in profiling and initialization stages. These works also use variants of our ResScale operation, which we find helpful at small scale and harmful at large scale.

Similarly, some other approaches targeted initialization as well, in particular ReZero (Bachlechner et al., 2020), FixUp (Huang et al., 2020) and LookLinear (Balduzzi et al., 2017). We note that DALL-E (Ramesh et al., 2021) also added a per residual scaling factor (only during backprop). Our approach, in contrast, only has new learnable parameters without variance heuristics, and has no extra stages or changes in initialization.

7 Conclusion

We identify a mismatch in the gradients of Pre-LN transformer weights: earlier layers receive much larger gradients than later layers, while the optimal scaling of residuals is larger at earlier layers than at later layers. We propose NormFormer, which alleviates these issues by adding 3 extra operations to each transformer layer. These modifications help the gradient mismatch for fully connected parameters and improve validation perplexity and downstream task performance for both causal and masked language models. None can be removed without degrading performance back towards the baseline, and adding more normalization – at least of the types we have tried – does not improve performance. Since NormFormer primarily addresses the gradient mismatch by increasing the gradients to the last FFN layers while decreasing the gradient magnitudes in other parts of the network, future work could examine whether all 3 operations need to be added to every layer. Additionally, the small computational overhead associated with NormFormer could be alleviated by fusing the FFN LN with the preceding fully connected layer, with or without the mean centering and bias, which do not appear to improve pretraining perplexity. In general, we have shown that adding small numbers of learnable parameters in the right places in our architectures can alleviate certain issues in current state of the art networks. Future work should ascertain if there are additional similarly efficient modifications that can bring gains, while helping us understand current deficiencies further.

8 Appendix

Learning Rate Setting Changes Valid PPL
Baseline 0.001 - 16.80
NormFormer 0.001 - 16.33
Baseline 0.003 - 16.37
NormFormer 0.003 - 15.88
Baseline 0.006 - 16.58
NormFormer 0.006 - 16.22
Baseline 0.003 Longer Warmup 16.50
NormFormer 0.003 Longer Warmup 16.06
Baseline 0.003 GPT3 16.29
NormFormer 0.003 GPT3 15.88
Baseline 0.003 Clip Grad Norms at 0.1 16.46
NormFormer 0.003 Clip Grad Norms at 0.1 16.14
Table 5: Longer Warmup: increase LR Warmup to 6,000 steps (from 500). GPT3: increase sequence length to 2048, increase dropout to 0.1, increase training budget to 1,000 V100 hours. Grad Clip: clip gradient norms at 0.1. NormFormer outperforms the baseline in all settings.


Table 6 shows that NormFormer can also provide gains on top of a well tuned language model in settings with much less data. We simply add our three operations to the architecture and hyperparameters of Baevski and Auli (2019). Convergence perplexity improves, and we reach the baseline perplexity in 70% as many steps. In this setting, NormFormer does not improve in the last 30% of training, which suggests that with more tuning the perplexity gap could be widened.

Steps to Final PPL PPL
Baseline 100% 18.70
NormFormer 70% 18.65
Table 6: Wikitext 103 results following Baevski and Auli (2019). Steps to Final PPL: at what percentage of the 280K steps did the model reach 18.70 perplexity. PPL: Best Perplexity
Figure 7: Change in grad norm with each operation of NormFormer compared to the baseline. Norms are the average between step 950 and 1000, normalized to control for different losses. 2.0 on the Y axis means the gradient to a parameter is twice as large as the baseline, on average. The NormFormer increases the norm to fully connected parameters in later layers, while reducing the gradient norm to attention parameters at all layers. The results are discussed in detail in Section 4.
Learning Rate 0.003
Batch Size 524K Tokens
Parameters 124M+
Layers 12
Layer Dimension 768
Dropout 0
LR Warmup Updates 500
LR Scheduler Linear Decay
Sequence Length 1024
Train Budget 470 V100 Hours
Table 7: Hyperparameters for ablations in Tables 4 and 7. This train budget allows the baseline model to run for 100,000 updates.


  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. External Links: 1607.06450 Cited by: §1, §6.
  • T. Bachlechner, B. P. Majumder, H. H. Mao, G. W. Cottrell, and J. McAuley (2020) Rezero is all you need: fast convergence at large depth. arXiv preprint arXiv:2003.04887. Cited by: §6.
  • A. Baevski and M. Auli (2019) Adaptive input representations for neural language modeling. In International Conference on Learning Representations, External Links: Link Cited by: §1, §8, Table 6.
  • D. Balduzzi, M. Frean, L. Leary, J. Lewis, K. W. Ma, and B. McWilliams (2017) The shattered gradients problem: if resnets are the answer, then what is the question?. In

    International Conference on Machine Learning

    pp. 342–350. Cited by: §6.
  • Y. Bisk, R. Zellers, R. Le bras, J. Gao, and Y. Choi (2020) PIQA: reasoning about physical commonsense in natural language.

    Proceedings of the AAAI Conference on Artificial Intelligence

    34 (05), pp. 7432–7439.
    External Links: Link, Document Cited by: §2.2.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. External Links: Link Cited by: §1, §2.2, §2.2, §2.2, §2.2, Table 1, Table 2, §3, footnote 2.
  • X. Chen, Y. Cheng, S. Wang, Z. Gan, Z. Wang, and J. Liu (2021) EarlyBERT: efficient bert training via early-bird lottery tickets. External Links: 2101.00063 Cited by: §4.1, §6.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2020) Unsupervised cross-lingual representation learning at scale. External Links: 1911.02116 Cited by: §2.2.
  • M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al. (2021) CogView: mastering text-to-image generation via transformers. arXiv preprint arXiv:2105.13290. Cited by: §5.
  • D. Hendrycks and K. Gimpel (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: §2.1.
  • X. S. Huang, F. Perez, J. Ba, and M. Volkovs (2020) Improving transformer optimization through better initialization. In International Conference on Machine Learning, pp. 4475–4483. Cited by: §6.
  • O. Lieber, O. Sharir, B. Lenz, and Y. Shoham (2021) Jurassic-1: technical details and evaluation. Technical report AI21 Labs. Cited by: §1.
  • L. Liu, X. Liu, J. Gao, W. Chen, and J. Han (2020) Understanding the difficulty of training transformers. arXiv preprint arXiv:2004.08249. Cited by: §2.2, §6.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.2, §2.2, §2.2.
  • P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu (2018) Mixed precision training. In International Conference on Learning Representations, Cited by: §4.1.
  • T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal (2018) Can a suit of armor conduct electricity? a new dataset for open book question answering. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 2381–2391. External Links: Link, Document Cited by: §2.2.
  • N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen (2016) A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 839–849. External Links: Link, Document Cited by: §2.2.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) fairseq: a fast, extensible toolkit for sequence modeling. In naacl_demo, Cited by: §2.2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In nips, pp. 8024–8035. Cited by: §2.2.
  • O. Press, N. A. Smith, and M. Lewis (2020a) Shortformer: better language modeling using shorter inputs. arXiv preprint arXiv:2012.15832. Cited by: §2.2.
  • O. Press, N. A. Smith, and O. Levy (2020b) Improving transformer models by reordering their sublayers. External Links: 1911.03864 Cited by: §6.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Technical report OpenAI. Cited by: §1, §2.2.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    jmlr 21, pp. 1–67. Cited by: §1, §5.
  • A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever (2021) Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092. Cited by: §6.
  • K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y. Choi (2020) WinoGrande: an adversarial winograd schema challenge at scale. Proceedings of the AAAI Conference on Artificial Intelligence 34 (05), pp. 8732–8740. External Links: Link, Document Cited by: §2.2.
  • N. Shazeer (2020) GLU variants improve transformer. External Links: 2002.05202 Cited by: §5.
  • D. R. So, W. Mańke, H. Liu, Z. Dai, N. Shazeer, and Q. V. Le (2021) Primer: searching for efficient transformers for language modeling. External Links: 2109.08668 Cited by: §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, Cited by: §1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In iclr, Cited by: §2.2.
  • R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu (2020) On layer normalization in the transformer architecture. External Links: 2002.04745 Cited by: §1, §1, §4.1, §6.
  • R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: can a machine really finish your sentence?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4791–4800. External Links: Link, Document Cited by: §2.2.
  • C. Zhu, R. Ni, Z. Xu, K. Kong, W. R. Huang, and T. Goldstein (2021) GradInit: learning to initialize neural networks for stable and efficient training. arXiv preprint arXiv:2102.08098. Cited by: §1, §2.2, §6.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2019) Aligning books and movies: towards story-like visual explanations by watching movies and reading books.. arXiv preprint arXiv:1506.06724. Cited by: §2.2.