Understanding the Failure of Batch Normalization for Transformers in NLP

10/11/2022
by   Jiaxi Wang, et al.
0

Batch Normalization (BN) is a core and prevalent technique in accelerating the training of deep neural networks and improving the generalization on Computer Vision (CV) tasks. However, it fails to defend its position in Natural Language Processing (NLP), which is dominated by Layer Normalization (LN). In this paper, we are trying to answer why BN usually performs worse than LN in NLP tasks with Transformer models. We find that the inconsistency between training and inference of BN is the leading cause that results in the failure of BN in NLP. We define Training Inference Discrepancy (TID) to quantitatively measure this inconsistency and reveal that TID can indicate BN's performance, supported by extensive experiments, including image classification, neural machine translation, language modeling, sequence labeling, and text classification tasks. We find that BN can obtain much better test performance than LN when TID keeps small through training. To suppress the explosion of TID, we propose Regularized BN (RBN) that adds a simple regularization term to narrow the gap between batch statistics and population statistics of BN. RBN improves the performance of BN consistently and outperforms or is on par with LN on 17 out of 20 settings, involving ten datasets and two common variants of Transformer Our code is available at https://github.com/wjxts/RegularizedBN.

READ FULL TEXT

page 4

page 9

research
03/17/2020

Rethinking Batch Normalization in Transformers

The standard normalization method for neural network (NN) models used in...
research
05/20/2020

Applying the Transformer to Character-level Transduction

The transformer has been shown to outperform recurrent neural network-ba...
research
08/02/2022

Unified Normalization for Accelerating and Stabilizing Transformers

Solid results from Transformers have made them prevailing architectures ...
research
06/19/2020

Towards an Adversarially Robust Normalization Approach

Batch Normalization (BatchNorm) is effective for improving the performan...
research
01/19/2020

Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization

Batch Normalization (BN) is one of the most widely used techniques in De...
research
06/07/2023

Normalization Layers Are All That Sharpness-Aware Minimization Needs

Sharpness-aware minimization (SAM) was proposed to reduce sharpness of m...
research
02/04/2021

SelfNorm and CrossNorm for Out-of-Distribution Robustness

Normalization techniques are crucial in stabilizing and accelerating the...

Please sign up or login with your details

Forgot password? Click here to reset