BranchNorm: Robustly Scaling Extremely Deep Transformers

05/04/2023
by   Yijin Liu, et al.
0

Recently, DeepNorm scales Transformers into extremely deep (i.e., 1000 layers) and reveals the promising potential of deep scaling. To stabilize the training of deep models, DeepNorm (Wang et al., 2022) attempts to constrain the model update to a constant value. Although applying such a constraint can benefit the early stage of model training, it may lead to undertrained models during the whole training procedure. In this paper, we propose BranchNorm, which dynamically rescales the non-residual branch of Transformer in accordance with the training period. BranchNorm not only theoretically stabilizes the training with smooth gradient norms at the early stage, but also encourages better convergence in the subsequent training stage. Experiment results on multiple translation tasks demonstrate that BranchNorm achieves a better trade-off between training stability and converge performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/17/2020

Understanding the Difficulty of Training Transformers

Transformers have been proved effective for many deep learning tasks. Tr...
research
03/01/2022

DeepNet: Scaling Transformers to 1,000 Layers

In this paper, we propose a simple yet effective method to stabilize ext...
research
06/01/2022

On Layer Normalizations and Residual Connections in Transformers

In the perspective of a layer normalization (LN) position, the architect...
research
06/04/2021

Scalable Transformers for Neural Machine Translation

Transformer has been widely adopted in Neural Machine Translation (NMT) ...
research
11/08/2019

Why Deep Transformers are Difficult to Converge? From Computation Order to Lipschitz Restricted Parameter Initialization

The Transformer translation model employs residual connection and layer ...
research
10/12/2022

Foundation Transformers

A big convergence of model architectures across language, vision, speech...
research
02/24/2022

Auto-scaling Vision Transformers without Training

This work targets automated designing and scaling of Vision Transformers...

Please sign up or login with your details

Forgot password? Click here to reset