Multi-head or Single-head? An Empirical Comparison for Transformer Training

06/17/2021
by   Liyuan Liu, et al.
0

Multi-head attention plays a crucial role in the recent success of Transformer models, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that this effectiveness stems from the ability of jointly attending multiple positions. In this paper, we first demonstrate that jointly attending multiple positions is not a unique feature of multi-head attention, as multi-layer single-head attention also attends multiple positions and is more effective. Then, we suggest the main advantage of the multi-head attention is the training stability, since it has less number of layers than the single-head attention, when attending the same number of positions. For example, 24-layer 16-head Transformer (BERT-large) and 384-layer single-head Transformer has the same total attention head number and roughly the same model size, while the multi-head one is significantly shallower. Meanwhile, we show that, with recent advances in deep learning, we can successfully stabilize the training of the 384-layer Transformer. As the training difficulty is no longer a bottleneck, substantially deeper single-head Transformer achieves consistent performance improvements without tuning hyper-parameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2019

A Multiscale Visualization of Attention in the Transformer Model

The Transformer is a sequence model that forgoes traditional recurrent a...
research
12/12/2022

A Neural ODE Interpretation of Transformer Layers

Transformer layers, which use an alternating pattern of multi-head atten...
research
06/03/2023

Memorization Capacity of Multi-Head Attention in Transformers

In this paper, we investigate the memorization capabilities of multi-hea...
research
02/17/2020

Low-Rank Bottleneck in Multi-head Attention Models

Attention based Transformer architecture has enabled significant advance...
research
05/25/2019

Are Sixteen Heads Really Better than One?

Attention is a powerful and ubiquitous mechanism for allowing neural mod...
research
09/20/2020

Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference

The neural attention mechanism plays an important role in many natural l...
research
04/15/2019

Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection

It is known that a deep neural network model pre-trained with large-scal...

Please sign up or login with your details

Forgot password? Click here to reset