Stabilizing Transformer Training by Preventing Attention Entropy Collapse

03/11/2023
by   Shuangfei Zhai, et al.
5

Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy, corresponding to highly concentrated attention scores, as entropy collapse. As a remedy, we propose σReparam, a simple and efficient solution where we reparametrize all linear layers with spectral normalization and an additional learned scalar. We demonstrate that the proposed reparameterization successfully prevents entropy collapse in the attention layers, promoting more stable training. Additionally, we prove a tight lower bound of the attention entropy, which decreases exponentially fast with the spectral norm of the attention logits, providing additional motivation for our approach. We conduct experiments with σReparam on image classification, image self-supervised learning, machine translation, automatic speech recognition, and language modeling tasks, across Transformer architectures. We show that σReparam provides stability and robustness with respect to the choice of hyperparameters, going so far as enabling training (a) a Vision Transformer to competitive performance without warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to competitive performance without warmup and adaptive optimizers.

READ FULL TEXT

page 26

page 31

page 32

research
02/08/2021

End-to-End Multi-Channel Transformer for Speech Recognition

Transformers are powerful neural architectures that allow integrating di...
research
11/08/2020

Stochastic Attention Head Removal: A Simple and Effective Method for Improving Automatic Speech Recognition with Transformers

Recently, Transformers have shown competitive automatic speech recogniti...
research
10/12/2022

Foundation Transformers

A big convergence of model architectures across language, vision, speech...
research
09/28/2020

Deep Transformers with Latent Depth

The Transformer model has achieved state-of-the-art performance in many ...
research
04/18/2022

Entropy-based Stability-Plasticity for Lifelong Learning

The ability to continuously learn remains elusive for deep learning mode...
research
09/13/2022

Analysis of Self-Attention Head Diversity for Conformer-based Automatic Speech Recognition

Attention layers are an integral part of modern end-to-end automatic spe...
research
01/23/2017

Regularizing Neural Networks by Penalizing Confident Output Distributions

We systematically explore regularizing neural networks by penalizing low...

Please sign up or login with your details

Forgot password? Click here to reset