ReZero is All You Need: Fast Convergence at Large Depth

by   Thomas Bachlechner, et al.

Deep networks have enabled significant performance gains across domains, but they often suffer from vanishing/exploding gradients. This is especially true for Transformer architectures where depth beyond 12 layers is difficult to train without large datasets and computational budgets. In general, we find that inefficient signal propagation impedes learning in deep networks. In Transformers, multi-head self-attention is the main cause of this poor signal propagation. To facilitate deep signal propagation, we propose ReZero, a simple change to the architecture that initializes an arbitrary layer as the identity map, using a single additional learned parameter per layer. We apply this technique to language modeling and find that we can easily train ReZero-Transformer networks over a hundred layers. When applied to 12 layer Transformers, ReZero converges 56 Transformers to other residual networks, enabling 1,500 deep fully connected networks and 32 trained on CIFAR 10.


Transformer with Depth-Wise LSTM

Increasing the depth of models allows neural models to model complicated...

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

The general trend in NLP is towards increasing model capacity and perfor...

Miti-DETR: Object Detection based on Transformers with Mitigatory Self-Attention Convergence

Object Detection with Transformers (DETR) and related works reach or eve...

Deep Transformers with Latent Depth

The Transformer model has achieved state-of-the-art performance in many ...

Understanding the Difficulty of Training Transformers

Transformers have been proved effective for many deep learning tasks. Tr...

Effectiveness of Deep Networks in NLP using BiDAF as an example architecture

Question Answering with NLP has progressed through the evolution of adva...

On Layer Normalizations and Residual Connections in Transformers

In the perspective of a layer normalization (LN) position, the architect...

1 Introduction

Figure 1: ReZero

Deep learning has enabled significant improvements in state-of-the-art performance across domains [14, 10, 13, 23]

. The expressivity of neural networks typically grows exponentially with depth

[22], enabling strong generalization performance, but often induces vanishing/exploding gradients and poor signal propagation through the model [9]. Researchers have relied on careful initialization [18, 27] and normalization techniques such as BatchNorm [12] and LayerNorm [2] to mitigate this issue, but these techniques can be costly and limited.

In this work, we propose ReZero111Code for ReZero applied to various neural architectures:

, a small architectural addition that dynamically facilitates well-behaved gradients and arbitrarily deep signal propagation. The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer, we introduce a residual connection for the input signal

and one trainable parameter that modulates the non-trivial transformation of the layer ,


where at the beginning of training. Initially the gradients for all parameters defining vanish, but dynamically evolve to suitable values during initial stages of training. We illustrate the architecture in Figure 1.

ReZero provides two main benefits:

Deeper learning — Signals effectively propagate through deep networks, which allows for learning in otherwise untrainable networks. ReZero successfully trains 10,000 layers of fully-connected networks, and we are the first to train Transformers over 100 layers without learning rate warm-up or LayerNorm. In contrast to [1] we find that to get good results at this depth, it is not necessary to add auxiliary losses.

Faster convergence — We observe significantly accelerated convergence in ReZero networks compared to regular residual networks with normalization. When ReZero is applied to Transformers, we converge 56% faster than the vanilla Transformer to reach 1.2 BPB on the enwiki8 language modeling benchmark. When applied to ResNets, we obtain 32% speed up to reach 85% accuracy on CIFAR 10.

2 Background and related work

(1) Deep Network
(2) Residual Network
(3) Deep Network + Norm
(4) Residual Network + Pre-Norm
(5) Residual Network + Post-Norm
(6) ReZero
Table 1: Various forms of normalization and residual connections. represents the transformation of an arbitrary layer and “Norm” is a normalization (e.g., LayerNorm or BatchNorm).

Networks with a depth of layers and width often have an expressive power that scales exponentially in depth, but not in width [19, 22]. Large depth often comes with difficulty in training via gradient-based methods. During training of a deep model, a signal in the training data has to propagate forward from the input to the output layer, and subsequently, the cost function gradients have to propagate backwards in order to provide a meaningful weight update. If the magnitude of a perturbation is changed by a factor in each layer, both signals and gradients vanish or explode at a rate of , rendering many deep networks untrainable in practice.

To be specific, consider a deep network that propagates an input signal of width through layers that perform the non-trivial, but width preserving functions , where denotes all parameters at layer . The signal propagates through the network according to


There have been many attempts to improve signal propagation through deep networks, and they often fall into one of three categories — initialization schemes, normalization layers, and residual connections. We show some of the popular ways to combine residual networks with normalization in Table 1.

2.1 Careful initialization

In recent years the dynamics of signal propagation in randomly initialized deep and wide neural networks have been formalized via mean field theory [20, 27, 21]. For some deep neural networks, including fully connected and convolutional architectures, the cosine distance of two distinct signals, , approaches a fixed point that either vanishes or approaches unity at large depths. If this fixed point is the behavior of the network is stable and every input is mapped to the same output, leading to vanishing weight updates. If this fixed point is the behavior of the network is chaotic and even similar inputs are mapped to very different outputs, leading to exploding weight updates. To understand whether a network is in a stable or chaotic phase we consider the input-output Jacobian


The mean squared singular values

of this matrix determine the growth/decay of an average input signal perturbation as it propagates through the network. The network exhibits a boundary between the ordered and the chaotic phase, the edge of chaos at . Training proceeds efficiently at the edge of chaos.

This behavior was recognized in [7, 9], which motivated a re-scaling of the weights such that and on average signal strengths are neither enhanced or attenuated.

Pennigton et al. [20, 21]

recognized that a unit mean squared average of the input-output Jacobian is insufficient to guarantee trainability. For example, if the singular vectors of

corresponding to very large/small singular values align well with the perturbations in the data, training will still be inefficient. They proposed the stronger condition of dynamical isometry [24], which requires that all singular values of

are close to one. This means that all perturbations of the input signal propagate through the network equally well. The ReLU activation function maps to zero for some perturbations of the input signal, and it is therefore intuitive that deep networks with ReLU activations cannot possibly satisfy dynamical isometry, as was rigorously established in

[20]. For some activation functions and network architectures, elaborate initialization schemes allow the network to satisfy dynamical isometry at initialization, which significantly improves training dynamics [25, 22, 29, 6].

2.2 Normalization

An alternative approach to improve the trainability of deep neural networks is to incorporate layers that explicitly provide normalization. Many normalization modules have been proposed, with the two most popular ones being BatchNorm [12] and LayerNorm [2]

. In general, normalization aims to ensure that initially, signals have zero mean and unit variance as they propagate through a network, reducing “covariate shift”

[12]. For simplicity we will focus primarily on comparisons against LayerNorm because BatchNorm has additional regularizing effects that are orthogonal to our investigation.

Normalization methods have shown success in accelerating the training of deep networks, but they do incur a computational cost to the network and pose additional hyperparameters to tune (e.g., where to place the normalization). In contrast to normalization methods, our proposed method is simple and cheap to implement. ReZero alone is sufficient to train deeper networks, even in the absence of various norms. Although ReZero makes normalization superfluous for convergence, we have found the regularizing effect of BatchNorm to be complementary to our approach.

2.3 Residual connections

The identity mappings introduced in [10] enabled a deep residual learning framework in the context of convolutional networks for image recognition that significantly increased the trainable depth. The complementary use of BatchNorm and ResNets [10]

has enabled the training of convolutional neural networks with over 100 layers. The same has not been the case for LayerNorm and Transformer architectures. Yang et al.

[29] studied residual fully connected networks and demonstrated that due to the skip connection, signals decay more slowly (polynomially) as they propagate, allowing for effective training of deeper networks.

Concurrently with our work SkipInit [3], an alternative to the BatchNorm, was proposed for ResNet architectures that is similar to ReZero. The authors find that in deep ResNets without BatchNorm, a scalar multiplier is needed to ensure convergence. We arrive at a similar conclusion for the specific case considered in [3], and study more generally signal propagation in deeper networks across multiple architectures and beyond BatchNorm.

3 ReZero

We propose ReZero (residual with zero initialization), a simple change to the architecture of deep residual networks that facilitates dynamical isometry and enables the efficient training of extremely deep networks. Rather than propagating the signal through each of the non-trivial functions at initialization, we add a skip connection and rescale the function by learnable parameters (which we call residual weights) that are initialized to zero. The signal now propagates according to


At initialization the network represents the identity function and it trivially satisfies dynamical isometry. We demonstrate below for a toy model that this architecture can exponentially accelerate training. The architecture modification allows for the training of deep networks even when the individual layers’ Jacobian has vanishing singular values, as is the case for ReLU activation functions or self-attention [26]. The technique also allows us to add arbitrary new layers to existing and trained networks.

3.1 A toy example

Figure 2: Contour log plots of a quadratic cost function (left) and gradient norm (right) over the network weight and the residual weight during the training of the linear function via gradient descent using a training set of . Gradient descent trajectories initialized at are shown in red for five different initial ’s. The trajectory dynamics avoid the poorly conditioned regions around .

To illustrate how the ReZero connection accelerates training let us consider the toy model of a deep neural network described by

single-neuron hidden layers that have no bias and all share the same weight

and . The network then simply maps an input to the output


Fixing the parameter would represent a toy model for a fully connected residual network, while initializing and treating as a learned parameter corresponds to a ReZero network. The input-output Jacobian is given by , indicating that for initialization with and the output signal of a deep (i.e., ) network is extremely sensitive to any small perturbations of the input, while with

the input signal magnitude is preserved. While this example is too simple to exhibit an order/chaos phase transition, it does accurately model the vanishing and exploding gradient problem familiar in deep networks. Assuming a learning rate

and a cost function , gradient descent updates the weights according to


For , convergence of gradient descent with an initial weight requires steps no larger than 1, and hence a learning rate that is exponentially small in depth


where we only retained the parametric dependence on and . For the gradients in Equation 6 explode, while for the gradients vanish. Initializing solves both of these problems: assuming a sufficiently well-conditioned cost function, the first step of gradient descent will update the residual weights to a value that avoids large outputs and keeps the parameter trajectory within a well-conditioned region while retaining the expressive power of the network. The first non-trivial steps of the residual weight are given by


and gradient descent will converge with a learning rate that is polynomial in the depth of the network. In this simple example, the ReZero connection, therefore, allows for convergence with dramatically fewer optimization steps than a vanilla residual network. We illustrate the training dynamics, cost function and gradients in Figure 2.

4 Training deep fully connected networks faster

Figure 3: Cross entropy loss during training of four variants of layer fully connected networks with width and ReLU activations. The bracketed numbers refer to the architectures in the corresponding rows of Table 1. We average over five runs each and show error bands. For all models we use the Adagrad [5] optimizer with learning rate .

We now study the effect of ReZero on deep ReLU networks, and when combined with various approaches that facilitate deep learning listed in the rows of Table 1. Specifically, we will compare a vanilla deep fully connected network (FC, row 1), a deep network with residual connections (FC+Res, row 2) a deep network with LayerNorm (FC+Norm, row 3), and finally our proposal ReZero (row 6). We choose the initial weights normally distributed with variances optimal for training, i.e. for all but the vanilla residual network where , see [9, 29].

As a sample toy task, we train four different network architectures on the CIFAR-10 data set for supervised image classification. We are only interested in the training dynamics and investigate how many iterations it takes for the model to fit the data.

We show the evolution of the training loss in Figure 3. In our simple experiment, a 32 layer network the ReZero architecture converges to fit the training data between 7 and 15 times faster than the other techniques. Note that without an additional normalization layer the residual connection decreases convergence speed compared to a plain fully connected network. We speculate that this is because at initialization the variance of the signal is not independent of depth, see [29].

With increasing depth, the advantages of the ReZero architecture become more apparent. To verify that this architecture ensures trainability to large depths we successfully trained fully connected ReZero networks with up to layers on a laptop with one GPU222To train at these extreme depths we used the Adagrad optimizer with a learning rate of . to overfit the training set.

5 Training deeper Transformers faster

Figure 4: ReZero for Transformers

In this section, we study the signal propagation and application of ReZero to the Transformer architecture [26]. Transformers gained significant popularity and success both in supervised and unsupervised NLP tasks [4, 1]. Transformers are built by stacking modules that first perform self-attention, then a point-wise feed-forward transformation.

The original Transformer [26] implementation can be seen as a residual network with post-normalization (row 5 in Table 1). Inside a Transformer module the output of each sublayer is added via a residual connection and then normalized by LayerNorm,


where , as illustrated in the left panel of Figure 4.

5.1 Signal propagation in Transformers

Two crucial components relevant to the signal propagation in the original Transformer layers include LayerNorm [2] and (multi-head) self attention [26]. We will argue that neither component by itself or in conjunction with a vanilla residual connection can satisfy dynamical isometry for all input signals. This finding motivates the use of a ReZero connection to replace both LayerNorm and the vanilla residual connection.

Layer normalization removes the mean and scales the variance over all neurons of a given layer and introduces learnable parameters and to re-scale the variance and shift the mean according to


It is clear from this definition that perturbing an input by a transformation that purely shifts either its mean or variance will leave the output unchanged. These perturbations, therefore, give rise to two vanishing singular values of the input-output Jacobian. In the Transformer architecture [26] the norm is applied to each of the elements of the input sentence, leading to a total of vanishing singular values of the Jacobian for each Transformer layer.

Figure 5: Histograms for log singular values of the input-output Jacobian matrix for: (a) Transformer encoder network at initialization of depths 4, 12 and 64 layers; (b) ReZero Transformer encoder network with 64 layers before and during training. Deep Transformers are far from dynamical isometry, , while ReZero Transformers remain closer to dynamical isometry with mean singular value .

Self-attention allows the model to relate content located across different positions by computing a weighted sum of an input sequence. Specifically, the matrix contains an input sequence of rows containing -dimensional embedding vectors, from which we can evaluate the query, key and value matrices , where the matrices are . The scaled dot-product attention then is given by


In general, the singular value spectrum of the Jacobian of this attention process is complicated. Rather than studying it in full generality, we now merely argue that for some inputs and weights the Jacobian has a large number of vanishing singular values (a claim we evaluate empirically below). Consider weights or inputs such that each of the arguments of the softmax function is small compared to 1. The softmax function then simply returns a dimensional matrix filled with entries that all approximate . This means that the attention function projects all embedding vectors of the input sequence onto a single diagonal direction. This implies that out of the Jacobian singular values only are non-vanishing and hence much of the input signal is lost. A residual connection can restore some of the lost signals, but even then some perturbations are amplified while others are attenuated. This example demonstrates that self-attention is incompatible with dynamical isometry and unimpeded signal propagation in deep Transformer networks. It is easy to verify that the same conclusion holds for multi-head attention. A careful initialization of the weights might alleviate some of these issues, but we are not aware of any initialization scheme that would render a Transformer layer consistent with dynamical isometry.

We gave a theoretical argument that the vanilla Transformer contains elements that inhibit deep signal propagation. Here, we verify these claims in practice by obtaining the input-output Jacobian for the attention process by evaluating its change under an infinitesimal variation of each of the entries of the input sequence . We show the input-output Jacobian for Transformer encoder layers of various depth with Xavier uniform initialized weights in Figure 5a. While shallow Transformers exhibit a singular value distribution peaked around unity, we clearly observe that the Jacobian of deep architectures has a large number of singular values that vanish to machine precision. While the distribution varies depending on the details of the initialization scheme, the qualitative statement holds more broadly. These results are consistent with the common observation that deep Transformer networks are extremely challenging to train.

We apply ReZero to solve the problem of poor signal propagation in Transformer layers by replacing LayerNorm and re-scaling the self-attention block. Specifically, this modifies equation (9) to


where is the learned residual weight parameter as in the right panel of Figure 4. We share the same parameter for a pair of multi-head self-attention and feed-forward network within a Transformer layer. At initialization, , which allows for unimpeded signal propagation: All singular values of the input-output Jacobian are 1 and the model trivially satisfies dynamical isometry. To verify that the model remains close to dynamical isometry throughout training and for larger , we show a histogram of the Jacobian singular values during the training of a layer model on a toy task of language modeling on WikiText-2 [16] in Figure 5b. During training the weight of the residual connection gradually increases, allowing the Transformer to model extremely complex functions while maintaining signal propagation properties close to dynamical isometry.

5.2 Convergence speed

Model Iterations Speedup
Post-Norm [26] Diverged -
     + Warm-up 13,690 1
Pre-Norm 17,765 0.77
GPT2-Norm [23] 21,187 0.65
ReZero 14,506 0.94
ReZero 8,800 1.56
Table 3: Comparison of Transformers (TX) on the enwiki8 test set.
Model Parameters BPB
Character TX 12L [1] 41M 1.11
TX 12L + Warm-up 38M 1.17
TX 12L + ReZero 34M 1.17
TX 12L + ReZero 34M 1.17
Character TX 64L [1] 219M 1.06
TX 64L 51M Diverged
TX 64L + Warm-up 51M Diverged
TX 64L + ReZero 51M Diverged
TX 64L + ReZero 51M 1.11
TX 128L + ReZero 101M 1.08
Table 2: Comparison of various 12 layer Transformers normalization variants against ReZero and the training iterations required to reach 1.2 BPB on enwiki8 validation set.

We pick language modeling on enwiki8 [15] as a benchmark because strong language models are a good indicator of downstream NLP task performance [23]. Our aim in these experiments is to measure the convergence speed of each method by measuring the number of iterations it takes for a 12 layer Transformer to reach 1.2 bits per byte (BPB) on enwiki8.

Since the introduction of Transformers [26], there have been several competing placements of the LayerNorm within the Transformer to achieve better convergence [23, 28]. We experiment with 3 Transformer normalization methods and compare against the ReZero Transformer. The Post-Norm (Row 5 in Table 1) method is equivalent to the vanilla Transformer in [26], the Pre-Norm (Row 4 in Table 1) method was recently introduced in [28] and the GPT2-Norm () was used in the training of GPT2 [23], which has successfully trained Transformers up to 48 layers. Finally, we experiment with our proposed ReZero method with initialized to either zero or one. The hyperparameters are in the appendix A.

Our results (Table 3) show that Post-Norm diverges during training while all other models are able to converge. This is not surprising as the original Transformer implementation required a learning rate warm-up and this is also confirmed in [28]. To verify this, we re-ran the Post-Norm setup with 100 steps of learning rate warm-up and find that the model is able to converge to 1.2 BPB in 13,690 iterations. Under this setting, we compared other LayerNorm placements schemes against Post-Norm. We find that the other placements led to initially faster convergence, but ultimately Post-Norm catches up in performance, resulting in relatively slower convergence for Pre-Norm and GPT2-Norm. However, other LayerNorm placements have an advantage over Post-Norm in that they do not require learning rate warm-up, thus have fewer hyperparameters to tune. ReZero with does not show an improvement over the vanilla Transformer, indicating the importance of initializing . With our proposed initialization of , ReZero converges 56% faster than the vanilla Transformer.

5.3 Deeper Transformers

Transformer models that achieve state of the art performance in many NLP tasks [4] usually have less than 24 layers. The deepest model as of our work used up to 78 layers [17] and requires 256 GPUs for training. In this section, we will scale beyond hundreds of Transformer layers and still remain trainable on a desktop machine. To examine whether our approach scales to deeper Transformers, we extend our 12 layer ReZero Transformer from Section 5.2 to 64 and 128 layers and compare against the vanilla Transformer (Post-Norm). The hyperparameters are in appendix section B.

Our results (Table 3) indicate that a 12 layer ReZero Transformer attains the same BPB as a regular Transformer after convergence, which shows that we do not lose any representational expressivity in our model by replacing LayerNorm with ReZero. We find that trying to train deep vanilla Transformers lead to either convergence difficulties or slow training times. When scaled to 64 layers, the vanilla Transformer fails to converge even with a warm-up schedule. A ReZero Transformer with initialization of diverges, supporting our theoretically motivated initialization at . The deeper ReZero Transformers are able to attain better performance than the shallower Transformers.

For comparison, we also display results from Character Transformer [1] which had a similar setup for reference. However, Character Transformer uses more parameters and has many additional auxiliary losses to achieve their performance, which is orthogonal to our work. Our 128 layer Transformer achieves similar performance without any intermediate losses, uses half the number of parameters and has larger depth. We did not tune our hyperparameters, and our models can potentially achieve better results with stronger regularization and a learning rate schedule.

Figure 6: Heat map for residual weight evolution during training for 64L ReZero Transformer.

To probe deeper into our model, we examine the behavior of residual weights during training for our 12 layer and 64 layer ReZero Transformer (Figure 6). It is useful to view as the amount of contribution each layer provides to the overall signal of the network. We see that an interesting pattern emerges for both the shallow and the deeper ReZero Transformer. During the early iterations of training, the residual weights quickly increase to a peak value, then slowly decays to a small value throughout its training. Early in training, the higher layers tend to be dominant (they peak earlier) and towards the end of training each layer is utilized to a similar degree. The average at the end of training is and for the 12 and 64 layer models respectively, which is approximately , where is the number of residual layers.

Interestingly, this pattern also occurs in the 12 layer ReZero Transformer when we initialized , except the model spends the first iterations forcing the ’s to small values, before reaching a similar pattern to that shown in Figure 6. This empirical finding supports our proposal that we should initialize even for shallow models.

6 Training ResNets faster

In the previous sections, we saw how ReZero connections enable training of deep networks that contain layers with vanishing Jacobian singular values, such as ReLU activations or self-attention. Some of these architectures are not trainable without ReZero connections or other architectural changes. In this section, we apply ReZero connections to deep residual networks for image recognition [10]. While these networks are trainable without ReZero connections, we observe that the validation error for a ResNet56 model333For our experiments we used the implementation by Yerlan Idelbayev (available at that very closely resembles the original architecture [10].

trained (up to 200 epochs) on the CIFAR-10 dataset improves significantly — from

to — after trading all vanilla residual connections in the model for ReZero connections444Our setup differs from the SkipInit proposal in [3], in that we retain the BatchNorm layer.. The number of epochs to decrease the validation error below 15% also dropped by after implementing ReZero. While these results provide only limited insight by themselves, they point towards broader applicability of ReZero connections and motivate further study.

7 Conclusion

We introduced ReZero, a simple architecture modification that facilitates signal propagation in deep networks and helps the network maintain dynamical isometry. Applying ReZero to various residual architectures – fully connected networks, Transformers and ResNets – we observed significantly improved convergence speeds. Furthermore, we were able to efficiently train Transformers with hundreds of layers, which has been difficult with the original architecture. We believe deeper Transformers will open doors for future exploration.

While training models with ReZero, we discovered interesting patterns in the values of residual weights of each layer over the course of training. These patterns may hint towards some form of curriculum learning and allow for progressive stacking of layers to further accelerate training [8]. Patterns of residual weights can be crucial to understand the training dynamics of such deeper networks and might be important to model performance, which we will explore in future work.


The work of TB was supported in part by DOE under grants no. DE-SC0009919 and by the Simons Foundation SFARI 560536. The work of BPM and HHM was supported by Amazon via the grant of Alexa Prize Grand Challenge 3.


  • [1] R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones (2019) Character-level language modeling with deeper self-attention. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 3159–3166. Cited by: §1, §5.3, Table 3, §5.
  • [2] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §1, §2.2, §5.1.
  • [3] S. De and S. L. Smith (2020) Batch normalization biases deep residual networks towards shallow paths. arXiv preprint arXiv:2002.10444. Cited by: §2.3, footnote 4.
  • [4] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. External Links: Link, Document Cited by: Appendix A, §5.3, §5.
  • [5] J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization.

    Journal of machine learning research

    12 (Jul), pp. 2121–2159.
    Cited by: Figure 3.
  • [6] D. Gilboa, B. Chang, M. Chen, G. Yang, S. S. Schoenholz, E. H. Chi, and J. Pennington (2019) Dynamical isometry and a mean field theory of lstms and grus. arXiv preprint arXiv:1901.08987. Cited by: §2.1.
  • [7] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. Cited by: §2.1.
  • [8] L. Gong, D. He, Z. Li, T. Qin, L. Wang, and T. Liu (2019) Efficient training of BERT by progressively stacking. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 2337–2346. External Links: Link Cited by: §7.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification


    Proceedings of the IEEE international conference on computer vision

    pp. 1026–1034. Cited by: §1, §2.1, §4.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §1, §2.3, §6, footnote 3.
  • [11] D. Hendrycks and K. Gimpel (2016) Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR abs/1606.08415. External Links: Link, 1606.08415 Cited by: Appendix A.
  • [12] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1, §2.2.
  • [13] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter (2017) Self-normalizing neural networks. In Advances in neural information processing systems, pp. 971–980. Cited by: §1.
  • [14] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • [15] M. Mahoney (2009) Large text compression benchmark. External Links: Link Cited by: §5.2.
  • [16] S. Merity, C. Xiong, J. Bradbury, and R. Socher (2017) Pointer sentinel mixture models. In ICLR, External Links: Link Cited by: §5.1.
  • [17] Microsoft (2020) Turing-nlg: a 17-billion-parameter language model. External Links: Link Cited by: §5.3.
  • [18] D. Mishkin and J. Matas (2015) All you need is a good init. arXiv preprint arXiv:1511.06422. Cited by: §1.
  • [19] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio (2014) On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pp. 2924–2932. Cited by: §2.
  • [20] J. Pennington, S. Schoenholz, and S. Ganguli (2017) Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in neural information processing systems, pp. 4785–4795. Cited by: §2.1, §2.1.
  • [21] J. Pennington, S. S. Schoenholz, and S. Ganguli (2018) The emergence of spectral universality in deep networks. arXiv preprint arXiv:1802.09979. Cited by: §2.1, §2.1.
  • [22] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli (2016) Exponential expressivity in deep neural networks through transient chaos. In Advances in neural information processing systems, pp. 3360–3368. Cited by: §1, §2.1, §2.
  • [23] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §1, §5.2, §5.2, Table 3.
  • [24] A. M. Saxe, J. L. McClelland, and S. Ganguli (2013) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120. Cited by: §2.1.
  • [25] S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein (2016) Deep information propagation. arXiv preprint arXiv:1611.01232. Cited by: §2.1.
  • [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3, §5.1, §5.1, §5.2, Table 3, §5, §5.
  • [27] L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and J. Pennington (2018) Dynamical isometry and a mean field theory of cnns: how to train 10,000-layer vanilla convolutional neural networks. arXiv preprint arXiv:1806.05393. Cited by: §1, §2.1.
  • [28] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu (2020) On layer normalization in the transformer architecture. arXiv preprint arXiv:2002.04745. Cited by: §5.2, §5.2.
  • [29] G. Yang and S. Schoenholz (2017) Mean field residual networks: on the edge of chaos. In Advances in neural information processing systems, pp. 7103–7114. Cited by: §2.1, §2.3, §4, §4.
  • [30] Y. You, J. Li, J. Hseu, X. Song, J. Demmel, and C. Hsieh (2019) Reducing BERT pre-training time from 3 days to 76 minutes. CoRR abs/1904.00962. External Links: Link, 1904.00962 Cited by: Appendix A, Appendix B.

Appendix A Convergence speed experimental hyperparameters

For all model variants in Section 5.2, we control the batch size to be 1080, number of layers to 12, feed-forward and attention dropout to 20%, hidden and embedding size to 512 units, context length to 512, the attention heads to 2, and GELU [11] activation in the point-wise feed-forward layer. To accommodate large batch training we use the LAMB optimizer [30] with a fixed learning rate of . Although learning rate schedules tend to improve performance [4], we omit them to simplify our training process.

Appendix B Deep Transformers experimental hyperparameters

In Section 5.3, in order to examine whether our approach scales to deeper Transformers, we scale our 12 layer ReZero Transformer from Section 5.2 to 64 layers and 128 layers and compare it against the vanilla Transformer (Post-Norm). Due to memory constraints, we decreased the hidden size from 512 to 256 and reduced batch size to 304 and 144 for the 64 layer and 128 layer model respectively. Following guidelines from [30] we also adjusted the learning rate to according to . For all models in our experiments we limit training to a maximum of 100 training epochs.