Are Sixteen Heads Really Better than One?

05/25/2019
by   Paul Michel, et al.
0

Attention is a powerful and ubiquitous mechanism for allowing neural models to focus on particular salient pieces of information by taking their weighted average when making predictions. In particular, multi-headed attention is a driving force behind many recent state-of-the-art NLP models such as Transformer-based MT models and BERT. These models apply multiple attention mechanisms in parallel, with each attention "head" potentially focusing on different parts of the input, which makes it possible to express sophisticated functions beyond the simple weighted average. In this paper we make the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. In fact, some layers can even be reduced to a single head. We further examine greedy algorithms for pruning down models, and the potential speed, memory efficiency, and accuracy improvements obtainable therefrom. Finally, we analyze the results with respect to which parts of the model are more reliant on having multiple heads, and provide precursory evidence that training dynamics play a role in the gains provided by multi-head attention.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/17/2021

Multi-head or Single-head? An Empirical Comparison for Transformer Training

Multi-head attention plays a crucial role in the recent success of Trans...
research
08/10/2021

Differentiable Subset Pruning of Transformer Heads

Multi-head attention, a collection of several attention mechanisms that ...
research
08/03/2021

A Dynamic Head Importance Computation Mechanism for Neural Machine Translation

Multiple parallel attention mechanisms that use multiple attention heads...
research
09/20/2020

Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference

The neural attention mechanism plays an important role in many natural l...
research
11/02/2020

How Far Does BERT Look At:Distance-based Clustering and Analysis of BERT's Attention

Recent research on the multi-head attention mechanism, especially that i...
research
05/11/2021

EL-Attention: Memory Efficient Lossless Attention for Generation

Transformer model with multi-head attention requires caching intermediat...
research
01/22/2021

The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT

Multi-headed attention heads are a mainstay in transformer-based models....

Please sign up or login with your details

Forgot password? Click here to reset