1 Introduction
Neural network training has long been a focus in Deep Learning research area. One of the prominent progress is the application of normalization methods. Initially, Ioffe and Szegedy (2015)
introduce the concept of normalizing layers with the proposed Batch Normalization (BatchNorm). It is widely believed that by controlling the mean and variance of layer inputs across minibatches, BatchNorm stabilizes the distribution and improves training efficiency. Following this work,
Lei Ba et al. (2016)point out its limitation in Recurrent Neural Networks (RNN) and propose Layer Normalization (LayerNorm) that is performed across the neurons in a layer. LayerNorm is adaptive to RNN and selfattentionbased models. A typical example is its application in the stateoftheart framework, Transformer
(Vaswani et al., 2017). LayerNorm enables faster training of Transformer and is irreplaceable in this framework.Despite its great success, it is still unclear why LayerNorm is so effective. The widely accepted explanation is that forward normalization brings distribution stability (Ioffe and Szegedy, 2015; Lei Ba et al., 2016). Recent studies show that the effects of BatchNorm are not related to the stability of input distribution (Zhang et al., 2017; Santurkar et al., 2018). They also propose that the reason why BatchNorm is effective is that normalization smooths the optimization landscape. However, it is still unclear whether these theories can explain the success of LayerNorm.
The main contribution of this paper is to explore how LayerNorm works. Through a series of analyses, we find that the derivatives of the mean and variance are important by recentering and rescaling backward gradients. Furthermore, it is beyond our expectation that the bias and gain do not work in most cases. The details of our findings are illustrated below.
The derivatives of the mean and variance are more important to LayerNorm than forward normalization. Many of the previous studies believe that the forward normalization is the only decisive factor to LayerNorm. It makes the input distribution more stable, thus brings better convergence. Unlike them, our experimental results show that forward normalization has little to do with the effectiveness and the derivatives of the mean and variance play a significant role in LayerNorm. To illustrate how these derivatives work, we propose DetachNorm, which adds an additional detaching operation to LayerNorm to change the mean and variance from variables to constants. It preserves the recentering and rescaling fact but cuts off the derivative of the mean and variance with respect to the input. DetachNorm performs worse than LayerNorm on six out of eight datasets. This proves that the derivatives of the mean and variance are useful to LayerNorm. Furthermore, to investigate the reason for the above observation, we analyze the gradients in LayerNorm and DetachNorm, and find that the derivatives of means recenter gradients and the derivatives of variances rescale gradients.
The parameters of LayerNorm, including the bias and gain, increase the risk of overfitting and do not work in most cases.
The bias and gain are applied for affine transformation on normalized vectors. They are expected to enhance the expressive power by reshaping the distribution. To evaluate their effects on results, we build a simple version of LayerNorm (LayerNormsimple) by removing the bias and gain. Our experimental results show that LayerNormsimple achieves better results than LayerNorm on four datasets. It even achieves the stateoftheart performance on EnVi machine translation. By comparing loss curves of LayerNorm with and without the bias and gain, we find that the bias and gain cause overfitting. We speculate the reason of overfitting is mainly that the bias and gain are learned from the training set and cannot adjust themself towards different input distributions when testing.
Motivated by this assumption, we propose a novel normalization method, Adaptive Normalization (AdaNorm). AdaNorm replaces the bias and gain with a new transformation function. This function adaptively adjusts scaling weights based on input values. We evaluate AdaNorm and LayerNorm on eight datasets, covering tasks of machine translation, language modeling, text classification, image classification, and dependency parsing. Results show that AdaNorm achieves better results on seven datasets.
2 Preliminaries
In this section, we first review the algorithm of LayerNorm and then introduce the datasets and models used in the following analysis sections.
2.1 LayerNorm Algorithm
Let be the vector representation of an input of size to normalization layers. LayerNorm recenters and rescales input x as
(1) 
where h is the output of a LayerNorm layer. is a dot production operation. and
are the mean and standard deviation of input. Bias
b and gain g are parameters with the same dimension .2.2 Experimental Setup
To investigate how LayerNorm works, we conduct a series of experiments in this paper. Since LayerNorm is a default setting in Transformer (Vaswani et al., 2017) and TransformerXL (Dai et al., 2019)
, which have shown stateoftheart results on a variety of tasks (e.g., machine translation), we primarily consider normalization on Transformer and TransformerXL networks. Also, to avoid the impact of model architecture, we evaluate the effects of normalization on feedforward neural networks and convolutional neural networks. Here list the datasets and models. More details can be found at the Appendix.
Machine translation includes three widelyused datasets, WMT EnglishGerman (EnDe), IWSLT 14 GermanEnglish (DeEn) (Cettolo et al., 2014) and IWSLT 15 EnglishVietnamese (EnVi) (Cettolo et al., 2015). For all dataset, we use the setting of PreNorm where normalization is applied before each layer. We reimplement Transformer with the released code of Fairseq (Ott et al., 2019)^{1}^{1}1https://github.com/pytorch/fairseq
. The evaluation metric is BLEU
(Papineni et al., 2002).For EnDe dataset, we use the same dataset splits and the same compound splitting following previous work (Vaswani et al., 2017). BPE is used to get vocabularies. We use the shared embedding setting and the vocabulary size is 32,765. We use “transformer_wmt_en_de_big_t2t” as our basic model. The dropout rate is 0.3. The learning rate is 0.001. The training batch size is 4,096 tokens. We use optimizer Adam with and . The number of warmup steps is 4K.
The DeEn dataset is provided by the IWSLT 2014 Evaluation Campaign. We use the same dataset splits following previous work (Ott et al., 2019; Ranzato et al., 2016; Wiseman and Rush, 2016). It contains 153K sentences for training, 7K sentences for validation, and 7K sentences for testing. BPE is used to get vocabularies. We use the shared embedding setting and the vocabulary size is 10,149. We use “transformer_iwslt_de_en” as our basic model. The dropout rate is 0.3. The attention dropout rate is 0.1. The activation dropout is 0.1. The initialization learning rate is 1e07 and the learning rate is 0.0015. The training batch size is 4,096 tokens. We update gradients for every 2 steps. The number of warmup steps is 8K.
The EnVi dataset contains 133K training sentence pairs provided by the IWSLT 2015 Evaluation Campaign. We use TED tst2012 (1,553 sentences) as the validation set and TED tst2013 (1,268 sentences) as the test set. BPE is used to get input and output vocabularies. The English and Vietnamese vocabulary sizes are 7,669 and 6,669 respectively. The dropout rate is 0.1. The learning rate is 0.001. The training batch size is 4,096 tokens. The number of warmup steps is 8K. We use “transformer_wmt_en_de” as our basic model. We use optimizer Adam with and .
Language modeling includes a large dataset, Enwiki8^{2}^{2}2http://mattmahoney.net/dc/text.html that contains 100M bytes of unprocessed Wikipedia text. We implement a 12layer TransformerXL model. The dimension of each layer is 512. Multihead attention contains 8 heads and the dimension of each head is 64. The dropout rate is 0.1. The batch size is 22. We use optimizer Adam with a learning rate 0.00025. We use the average number of BitsPerCharacter (BPC) as the evaluation metric (AlRfou et al., 2018; Dai et al., 2019).
Text classification includes two sentence classification datasets: RT (Pang and Lee, 2005), and SST5 (Socher et al., 2013). RT is a binary sentiment classification dataset from online movie reviews. We randomly divide all examples into 8,608 for training, 964 for validation, and 1,089 for testing. SST5 is a singlesentence classification dataset built on movie reviews. We run experiments on a five label set. We build a Transformer model with a 4layer encoder. The batch size is 4,096 tokens. The word embedding dimension is 128 and the hidden dimension is 128. The dropout rate is 0.2. We use optimizer Adam with = 0.9, = 0.998. Normalization is applied before each layer. Accuracy is the evaluation metric.
Image classification includes a widelyused dataset, MNIST (LeCun et al., 1998). It consists of 55,000 training images, 5,000 validation images, and additional 10,000 testing images. We implement a 3layer convolutional neural network for classification. The first 2Dconvolution layer has 1 inchannel, 20 outchannels. The second 2Dconvolution layer has 20 inchannels, 50 outchannels. We flatten the output of the second 2Dconvolution layer and send it to a linear layer. The batch size is . We use optimizer Adam with a learning rate of . We apply LayerNorm before the activation in every linear layer. We train the model for epochs. Normalization is applied before each layer. Accuracy is the evaluation metric.
Dependency parsing includes a dataset, English Penn TreeBank (PTB) (Marcus et al., 1993). We follow the standard split of the corpus with sections 221 as the training set (39,832 sentences, 1,900,056 transition examples), section 22 as the validation set (1,700 sentences, 80,234 transition examples), and section 23 as the testing set (2,416 sentences, 113,368 transition examples). We implement a MLPbased parser following the work (Chen and Manning, 2014). The dimension of the hidden state is , the batch size is , the dropout rate is . We use optimizer Adam and initialize the learning rate to . We apply normalization before activation in every linear layer. Following the work (Chen and Manning, 2014), we use Unlabeled Attachment Score (UAS) as the evaluation metric.
3 Understanding LayerNorm
To investigate how LayerNorm facilitates training, we conduct ablation studies to observe each part’s contribution to the performance. In this section, we analyse the effects of the bias and gain, forward normalization, and backward normalization.
Models  Machine Translation  Language Modeling  Classification  Parsing  
EnDe(+)  DeEn(+)  EnVi(+)  Enwiki8()  RT(+)  SST5(+)  MNIST(+)  PTB(+)  
Model Layers  12  12  12  12  4  4  3  3 
w/o Norm  Diverge  34.0  28.4  1.04  76.85  38.55  99.14  88.31 
LayerNorm  28.3  35.5  31.2  1.07  77.21  39.23  99.13  89.12 
LayerNormsimple  28.4  35.5  31.6  1.07  76.66  40.54  99.09  89.19 
3.1 The Effect of the Bias and Gain in LayerNorm
The bias and gain do not work in most cases. From Table 1, it can be found that LayerNorm is an effective approach. It brings large performance improvements on six out of eight datasets compared with the naive baseline without LayerNorm (“w/o Norm”). By comparing LayerNorm and LayerNormsimple, we find that dropping the bias and gain (“LayerNormsimple”) does not decrease the performance on six datasets. Surprisingly, LayerNormsimple outperforms LayerNorm on four datasets, even with a 0.4 BLEU improvement on EnVi and a 1.31 ACC improvement on SST5. Also, it needs to notice that 31.6 achieved by LayerNormsimple is the stateoftheart result on EnVi machine translation.
Furthermore, we find that the bias and gain increase the risk of overfitting. Initially, considering that input information may be lost when normalizing input distributions, the bias and gain are designed for affine transformation on normalized vectors to enhance the expressive power. However, since the bias and gain are learned from the training set and they ignore the input distributions of the testing data, the risk of overfitting may increase in LayerNorm. It is verified by convergence curves in Figure 1. LayerNorm achieves lower training loss (or BPC) but higher validation loss (or BPC) than LayerNormsimple on EnVi, Enwiki8. These results indicate that current affine transformation mechanism has a potential risk of overfitting and needs to be further improved.
3.2 The Effect of Forward Normalization
For easier analysis, we only consider LayerNorm without the bias and gain here. Let be the normalized vector, the calculation process of LayerNorm without the bias and gain can be written as
(2) 
where is the input vector and is the dimension of x. and are the mean and standard deviation of . Then, suppose and are the mean and variance of . It is easy to verify
(3) 
Eq. (3) shows that normalization recenters and rescales input vector x. By now, a widely accepted belief is that the effectiveness of LayerNorm comes from steady layer distributions brought by forward normalization (Lei Ba et al., 2016). To evaluate whether forward normalization explains the effectiveness of LayerNorm, we need to separate the effect on forward layer inputs and that on backward gradients. In this paper, we design a new method, called DetachNorm. The difference between LayerNorm and DetachNorm is that DetachNorm detaches the derivatives of the mean and variance^{3}^{3}3In our implementation, we detach the derivative of standard deviation, the square root of variance. . Detaching derivatives means treating the mean and variance as changeable constants, rather than variables, which do not require gradients in backward propagation. The calculation of DetachNorm can be written as
(4) 
where and are the mean and standard deviation of input x, as calculated in Eq. (2). The function can be seen as a special copy function, which copies the values of and into constants and . In all, DetachNorm keeps the same forward normalization fact as LayerNorm does, but cuts offs the derivatives of the mean and variance.
Models  Machine Translation  Language Modeling  Classification  Parsing  
EnDe  DeEn(+)  EnVi(+)  Enwiki8()  RT(+)  SST5(+)  MNIST(+)  PTB(+)  
Model Layers  12  12  12  12  4  4  3  3 
w/o Norm  Diverge  34.0  28.4  1.04  76.85  38.55  99.14  88.31 
DetachNorm  Diverge  33.9  27.7  1.12  76.40  40.04  99.10  89.79 
Improvement  –  0.1  0.7  0.08  0.45  1.49  0.04  1.48 
Models  Machine Translation  Language Modeling  Classification  Parsing  
EnDe  DeEn(+)  EnVi(+)  Enwiki8()  RT(+)  SST5(+)  MNIST(+)  PTB(+)  
Model Layers  12  12  12  12  4  4  3  3 
DetachNorm  Diverge  33.9  27.7  1.12  76.40  40.04  99.10  89.79 
LayerNormsimple  28.4  35.5  31.6  1.07  76.66  40.54  99.09  89.19 
Improvement  –  1.6  3.9  0.05  0.26  0.50  0.01  0.60 
Since DetachNorm keeps the same recentering and rescaling way in forward propagation as LayerNormsimple does, the gap between DetachNorm and “w/o Norm” shows the effect of forward normalization. As we can see, DetachNorm perform worse than “w/o Norm”, showing that forward normalization has little to do with the success of LayerNorm.
Furthermore, the only difference between DetachNorm and LayerNormsimple lies in that DetachNorm detaches the derivatives of the mean and variance. As shown in Table 2, DetachNorm performs worse than LayerNormsimple on six datasets. It is mainly because that DetachNorm converges to much worse local optima compared with LayerNormsimple, as shown in Figure 2. The gap between DetachNorm and LayerNormsimple shows the effectiveness of the derivatives of the mean and variance. By comparing the achieved improvements, we find that the derivatives of the mean and variance bring higher improvements than forward normalization does.
These results demonstrate that the derivatives of the mean and variance play a significant role. In addition, the extremely worse results of DetachNorm on EnDe, DeEn and EnVi indicate that the derivatives of the mean and variance may be more important for deeper models. In the following section, we will give a detailed analysis of why and how the derivatives of the mean and variance contribute to the performance.
3.3 The Effect of the Derivatives of the Mean and Variance
To understand how the derivatives of the mean and variance work, we analyze the gradients of LayerNormsimple and DetachNorm. According to the chain rule, the gradient of
x is^{4}^{4}4When calculating the gradient, we adopt the denominator layout.(5) 
where
is the loss function,
x is the input vector and y is the normalized vector. We here analyze the effect of detaching the derivatives of the mean and variance on backward gradients. Our results are summarized in the following theorem, whose proof is listed in the Appendix.Theorem 1.
Given , let and be the mean and variance of . For the case of detaching the derivatives of and , suppose is the gradient of x with mean and variance . We have and .
(1) For the case of standard LayerNormsimple, suppose is the gradient of x with mean and variance .
We have and .
(2) For the case of detaching the derivative of , suppose is the gradient of x with mean and variance .
We have and .
(3) For the case of detaching the derivative of , suppose is the gradient of x with mean and variance .
We have and .
By comparing the case of detaching the derivative of with that of LayerNormsimple in Theorem 1, we find that the derivative of recenters to zero. By comparing the case of detaching the derivative of with of LayerNormsimple, we find that the derivative of reduces the variance of , which can be seen a kind of rescaling. We refer to gradient recentering and rescaling as gradient normalization.
To further evaluate the effect of gradient normalization on model performance, we test the derivatives of the mean and variance separately. Table 3 shows that detaching the derivative of variance decreases the performance significantly on deeper networks. Therefore, it is necessary to control the variance of gradients for deeper networks.
In conclusion, LayerNorm normalizes forward layer inputs and backward gradients. The derivatives of the mean and variance play more important roles than forward normalization in LayerNorm. Furthermore, unlike previous work (Santurkar et al., 2018) only noticing that normalization smooths gradients, this paper provides deeper insight about how normalization impacts backward gradients.
Models  Machine Translation  Language Model  Classification  Parsing  
EnDe(+)  DeEn(+)  EnVi(+)  Enwiki8()  RT(+)  SST5(+)  MNIST(+)  PTB(+)  
Model Layers  12  12  12  12  4  4  3  3 
LayerNormsimple  28.4  35.5  31.6  1.07  76.66  40.54  99.09  89.19 
Detach Mean  28.3  35.6  31.3  1.07  75.02  40.99  99.25  89.45 
Detach Variance  Diverge  34.2  29.8  1.10  77.04  41.74  99.10  89.80 
4 AdaNorm
AdaNorm adopts a new transformation function which can adaptively control scaling weights towards different inputs.^{5}^{5}5Our code is released at https://github.com/lancopku/AdaNorm
4.1 AdaNorm Algorithm
Formally, let be the normalized vector where and are the mean and variance of the input . We use , a function with respect to input x, to replace the bias and gain with the following equation:
(6) 
where is the output of AdaNorm and is a dot product operation. Unlike the bias and gain being fixed in LayerNorm, can adaptively adjust scaling weights based on inputs. To keep the stability of training, we expect that has some features. First, must be differentiable. Second, we expect that the average scaling weight is fixed, namely the average of is a constant where . Third, we expect that the average of z is bounded, which can avoid the problem of exploding loss. Namely, we require that there exists a constant such that . Theorem 2 proves that there exists a unique solution which can satisfy these requirements. The proof is listed in the Appendix.
Theorem 2.
Suppose is derivable, , , and , where is the hidden size. There exists only one solution:
which can satisfy these requirements.
Since will undesirably change the direction of vector, we expect that holds, which means must hold. Due to the symmetry of , is required to hold too. Based on Chebyshev’s Inequality, we have
(7) 
where is the variance of and is the dimension of y. Based on Eq. (3), we can verify . If we expect that
holds with a probability higher than
, should be choose based on Eq. (7). Namely, we choose(8) 
Given an input vector x, the complete calculation process of AdaNorm is
(9) 
where is a hyperparameter. is a dot product operation. is recommended to set as . To prevent the introduced term dismissing the feature of gradient recentering and rescaling, we detach the gradient of and only treat it as a changeable constant in implementation.
Models  Machine Translation  Language Model  Classification  Parsing  
EnDe(+)  DeEn(+)  EnVi(+)  Enwiki8()  RT(+)  SST5(+)  MNIST(+)  PTB(+)  
w/o Norm  Diverge  34.0  28.4  1.04  76.85  38.55  99.14  88.31 
LayerNorm  28.3  35.5  31.2  1.07  77.21  39.23  99.13  89.12 
LayerNormsimple  28.4  35.5  31.6  1.07  76.66  40.54  99.09  89.19 
AdaNorm  28.5  35.6  31.4  1.07  77.50  40.54  99.35  89.23 
4.2 Comparison between AdaNorm and LayerNorm
The comparison between LayerNorm and AdaNorm is shown in Table 4.^{6}^{6}6For AdaNorm implementation, Kaiming initialization and the setting of prenorm are recommended. AdaNorm outperforms LayerNorm on seven datasets, with 0.2 BLEU on EnDe, 0.1 BLEU on DeEn, 0.2 BLEU on EnVi, 0.29 ACC on RT, 1.31 ACC on SST, 0.22 ACC on MNIST, and 0.11 UAC on PTB. Unlike LayerNormsimple only performing well on bigger models, AdaNorm achieves more balanced results. Figure 3 shows the loss curves of LayerNorm and AdaNorm on the validation set of EnVi, PTB, and DeEn. Compared to AdaNorm, LayerNorm has lower training loss but higher validation loss. Lower validation loss proves that AdaNorm has better convergence.
5 Related Work
Deep neural networks have outperformed shallow models in a variety of fields, such as natural language processing
(Sutskever et al., 2014; Bahdanau et al., 2015; Devlin et al., 2018)(He et al., 2016; Huang et al., 2017), etc. The improvement mainly comes from the stronger expressive power of deep layers. However, with the increase of depth, the network training process becomes complicated and requires advanced architectural techniques. One of the important techniques of such advances is normalization.Currently, it is widely accepted that normalization layers assist training by smoothing gradients, enabling large learning rates, accelerating convergence, and improving generalization results (Zhang et al., 2019). First introduced by Ioffe and Szegedy (2015), BatchNorm fixes layer distributions to reduce ICS (Internal Covariate Shift), a phenomenon that the upper layers need to continuously adapt to the new distributions of lower layers. Following this work, several normalization methods have been proposed, like instance normalization (Ulyanov et al., 2016) and group normalization (Wu and He, 2018)
. In addition, there are several studies exploring better activation functions
(Klambauer et al., 2017) or initialization methods (Zhang et al., 2019) to avoid the dependency on normalization layers.LayerNorm is proposed to expand BatchNorm into RNN. LayerNorm normalizes the mean and variance of all summed inputs to the neurons in one layer. Unlike BatchNorm that depends on the size of minibatch, LayerNorm has fewer limitations. LayerNorm is adaptive to RNN and selfattentionbased models. It has been applied to the stateoftheart frameworks such as Transformer (Vaswani et al., 2017), BERT (Devlin et al., 2018), and TransformerXL (Dai et al., 2019). LayerNorm brings better performance and is irreplaceable in these frameworks.
Despite the good performance, it is still unclear how layer normalization works. Ioffe and Szegedy (2015) claim that the effectiveness of BatchNorm comes from reducing ICS. It has been a popular belief about BatchNorm (Santurkar et al., 2018). However, some recent studies point out that the success of BatchNorm relates to the smoother gradients and has little to do with reducing ICS (Santurkar et al., 2018; Bjorck et al., 2018). Although these studies provide a pioneering perspective to understand BatchNorm, there still remain some unanswered questions, such as how BatchNorm helps smooth gradients. Also, there are little work studying whether these theories can explain the success of LayerNorm. In this paper, we take a further step to a better understanding of LayerNorm.
6 Conclusion
In this paper, we investigate how layer normalization works. Based on a series of experiments and theoretical analysis, we summarize some interesting conclusions. We find that the derivatives of the mean and variance are important to the success of LayerNorm by recentering and rescaling backward gradients. Furthermore, experiments show that the bias and gain increase the risk of overfitting and do not work in most cases. To address this problem, we propose a normalization method AdaNorm. It replaces the bias and gain in LayerNorm with a new adaptive transformation function that can update scaling weights based on input values. Experiments show that AdaNorm outperforms LayerNorm on seven datasets. In the future work, we would like to explore more alternatives to LayerNorm from the perspective of gradient normalization.
Acknowledgments
We thank all reviewers for providing the thoughtful and constructive suggestions. This work was supported in part by National Natural Science Foundation of China (No. 61673028).
References
 Characterlevel language modeling with deeper selfattention. CoRR abs/1808.04444. Cited by: §2.2.
 Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Cited by: §5.
 Understanding batch normalization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada., pp. 7705–7716. Cited by: §5.
 The iwslt 2015 evaluation campaign. In IWSLT 2015, International Workshop on Spoken Language Translation, Cited by: §2.2.
 The iwslt 2015 evaluation campaign. In IWSLT 2014, International Workshop on Spoken Language Translation, Cited by: §2.2.
 A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 2529, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 740–750. Cited by: §2.2, §7.1.4.
 Hierarchical multiscale recurrent neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §7.1.2.
 Transformerxl: attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860. Cited by: §2.2, §2.2, §5.
 Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §5, §5.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §5.  Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §5.

Batch normalization: accelerating deep network training by reducing internal covariate shift.
In
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015
, pp. 448–456. Cited by: §1, §1, §5, §5.  Selfnormalizing neural networks. In Advances in neural information processing systems, pp. 971–980. Cited by: §5.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.2, §7.1.3.
 Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §1, §1, §3.2.
 Building a large annotated corpus of english: the penn treebank. Computational Linguistics 19 (2), pp. 313–330. Cited by: §2.2, §7.1.4.
 Fairseq: a fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038. Cited by: §2.2, §2.2, §7.1.1, §7.1.1.
 Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the Association for Computational Linguistics (ACL), pp. 115–124. Cited by: §2.2, §7.1.3.
 Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 612, 2002, Philadelphia, PA, USA., pp. 311–318. Cited by: §2.2.
 Online and lineartime attention by enforcing monotonic alignments. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pp. 2837–2846. Cited by: §7.1.1.
 Sequence level training with recurrent neural networks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Cited by: §2.2, §7.1.1.
 How does batch normalization help optimization?. In Advances in Neural Information Processing Systems, pp. 2483–2493. Cited by: §1, §3.3, §5.
 Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §2.2, §7.1.3.
 Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §5.
 Instance normalization: the missing ingredient for fast stylization. CoRR abs/1607.08022. Cited by: §5.
 Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pp. 6000–6010. Cited by: §1, §2.2, §2.2, §5, §7.1.1.
 Sequencetosequence learning as beamsearch optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 14, 2016, pp. 1296–1306. Cited by: §2.2, §7.1.1.
 Group normalization. In Computer Vision  ECCV 2018  15th European Conference, Munich, Germany, September 814, 2018, Proceedings, Part XIII, pp. 3–19. Cited by: §5.
 Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Conference Track Proceedings, Cited by: §1.
 Fixup initialization: residual learning without normalization. CoRR abs/1901.09321. Cited by: §5.
7 Appendix
7.1 Experimental Settings
7.1.1 Neural Machine Translation
For neural machine translation tasks, we reimplement Transformer with the released code of Fairseq [Ott et al., 2019]^{7}^{7}7https://github.com/pytorch/fairseq.
IWSLT 2015 EnglishVietmanese Translation It contains 133K training
sentence pairs provided by the IWSLT 2015 Evaluation Campaign. Following
the preprocessing steps in the work of Raffel et al. [2017], we use TED
tst2012 (1,553 sentences) as the validation set and TED tst2013 (1,268 sentences)
as the test set. BPE is used to get input and output vocabularies. The English and Vietnamese vocabulary sizes are 7,669 and 6,669 respectively. The dropout rate is 0.1. The learning rate is 0.001. The training batch size is 4,096 tokens. The number of warmup steps is 8K. We use “transformer_wmt_en_de” as our basic model. The setting of PreNorm is adopted. We use optimizer Adam with and . For AdaNorm, the hyperparameter is set to 1.
We average the last 10 checkpoints for evaluation and set the beam size to 5.
IWSLT 2014 GermanEnglish Translation It is provided by the IWSLT 2014 Evaluation Campaign. We use the same dataset splits following previous work [Ott et al., 2019, Ranzato et al., 2016, Wiseman and Rush, 2016]. It contains 153K sentences for training, 7K sentences for validation, and 7K sentences for testing. BPE is used to get vocabularies. We use the shared embedding setting and the vocabulary size is 10,149. We use “transformer_iwslt_de_en” as our basic model. The setting of PreNorm is adopted. The dropout rate is 0.3. The attention dropout rate is 0.1. The activation dropout is 0.1. The initialization learning rate is 1e07 and the learning rate is 0.0015. The training batch size is 4,096 tokens. We use optimizer Adam with and . We update the gradients for every 2 steps. The number of warmup steps is 8K. For AdaNorm, the hyperparameter is set to 2. We average the last 10 checkpoints for evaluation and set the beam size to 5.
WMT EnglishGerman Translation Following previous work [Vaswani et al., 2017], we use the same dataset splits and the same compound splitting. The preprocessing code is provided by Fairseq. BPE is used to get vocabularies. We use the shared embedding setting and the vocabulary size is 32,765. We use “transformer_wmt_en_de_big_t2t” as our basic model. The setting of PreNorm is adopted. The dropout rate is 0.3. The learning rate is 0.001. The training batch size is 4,096 tokens. We use optimizer Adam with and . The number of warmup steps is 4K. For AdaNorm, the hyperparameter is set to 2. We average the last 10 checkpoints for evaluation and set the beam size to 4.
7.1.2 Language Modeling
Enwiki8^{8}^{8}8http://www.mattmahoney.net/dc/text.html This is a characterlevel language model dataset with 100M bytes. We use the same preprocessed dataset as in the work [Chung et al., 2017]. We use the code provided by TransformerXL^{9}^{9}9https://github.com/kimiyoung/transformerxl . We use the default hyperparameters in the code. The model contains 12 decoder layers and the dimension of each layer is 512. Multihead attention contains 8 heads and the dimension of each head is 64. The dropout rate is 0.1. The batch size is 22. We use optimizer Adam with a learning rate 0.00025. For AdaNorm, the hyperparameter is set to 1. We choose the best checkpoint on the validation set to evaluate the result on the test set.
7.1.3 Classification
RT The rating inference dataset [Pang and Lee, 2005] is a binary sentiment classification dataset from online movie reviews. Due to the lack of the standard split, we randomly divide all examples into 8,608 for training, 964 for validation, and 1,089 for testing. We implement a 4layer Transformer encoder. The setting of PreNorm is adopted. The batch size is 4,096 tokens. The word embedding dimension is 128, the hidden dimension is 128. The dropout rate is 0.2. The optimization method is Adam optimizer with = 0.9, = 0.998. For AdaNorm, the hyperparameter is set to 0.3.
SST The Stanford sentiment treebank [Socher et al., 2013] is a singlesentence classification dataset built on movie reviews. We run experiments on a five label set. It provides the standard spit, with 8,544 for training, 1,101 for validation, and 2,210 for testing. We use the same model structure in RT. For AdaNorm, the hyperparameter is set to 0.3. The rest of parameters are set exactly the same as in RT settings.
MNIST Image Recognition The MNIST handwritten digit dataset [LeCun et al., 1998]
consists of 55,000 training images, 5,000 validation images, and additional 10,000 testing images. This task aims to recognize the numerical digit (09) of each image. We implement a CNN based classifier. The first 2Dconvolution layer has 1 inchannel, 20 outchannels. The second 2Dconvolution layer has 20 inchannels, 50 outchannels. We flatten the output of the second 2Dconvolution layer and send it to a linear layer. The batch size is
. We use Adam optimizer with a learning rate of . We apply LayerNorm before activation in every linear layer. When applying AdaNorm, we set hyperparameter to 2. We train the model for epochs. We choose the best checkpoint on the validation set for evaluation.7.1.4 Dependency Parsing
Transitionbased Dependency Parsing Following previous work, we use English Penn TreeBank (PTB) [Marcus et al., 1993] for experiments. We follow the standard split of the corpus with sections 221 as the training set (39,832 sentences, 1,900,056 transition examples), section 22 as the validation set (1,700 sentences, 80,234 transition examples), and section 23 as the testing set (2,416 sentences, 113,368 transition examples). We implement a MLPbased parser following the work [Chen and Manning, 2014]. The dimension of the hidden state is , the batch size is , the dropout rate is . We use optimizer Adam and initialize the learning rate to . We apply LayerNorm before activation in every linear layer. When applying AdaNorm, we set hyperparameter to 1. We train epochs on the training set. We evaluate the model on the development set every epoch and find the best checkpoint to evaluate the test results.
7.2 Proof of Theorem 1
Proof.
Define . It is easy to verify
(10)  
The forward propagation
(11) 
Calculating the gradient in backward propagation
(12)  
To conclude
(13) 
If we detach the gradient of and , in backward propagation
(14) 
namely
(15) 
Calculating and
(16)  
To conclude, and .
Proof of (1)
(1) In standard layernorm, we do not detach the gradients of and , in backward propagation
(17) 
Define , we can verify that
(18) 
Therefore,
(19) 
For any vector u vertical to and y ( is vertical to y), we have
(20)  
We expand and y to a standard orthogonal basis , then for any vector , we have
(21)  
Therefore,
(22)  
To conclude, and .
Proof of (2) (2) If we detach the gradients of , in backward propagation
(23) 
Define , then
(24) 
Therefore,
(25) 
Consider
(26)  
Therefore,
(27)  
To conclude, , .
Proof of (3)
(3) If we detach the gradient of , in backward propagation
(28) 
Define , we can verify that
(29) 
Therefore,
(30) 
For any vector u vertical to
(31)  
Note that , namely is vertical to and