Cascaded Head-colliding Attention

05/31/2021
by   Lin Zheng, et al.
0

Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks. At the cornerstone of the Transformer architecture is the multi-head attention (MHA) mechanism which models pairwise interactions between the elements of the sequence. Despite its massive success, the current framework ignores interactions among different heads, leading to the problem that many of the heads are redundant in practice, which greatly wastes the capacity of the model. To improve parameter efficiency, we re-formulate the MHA as a latent variable model from a probabilistic perspective. We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution. We conduct extensive experiments and demonstrate that CODA outperforms the transformer baseline, by 0.6 perplexity on in language modeling, and by 0.6 BLEU on in machine translation, due to its improvements on the parameter efficiency.[Our implementation is publicly available at <https://github.com/LZhengisme/CODA>.]

READ FULL TEXT

page 7

page 13

research
04/24/2020

Lite Transformer with Long-Short Range Attention

Transformer has become ubiquitous in natural language processing (e.g., ...
research
10/16/2021

Transformer with a Mixture of Gaussian Keys

Multi-head attention is a driving force behind state-of-the-art transfor...
research
06/18/2020

Multi-branch Attentive Transformer

While the multi-branch architecture is one of the key ingredients to the...
research
09/20/2020

Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference

The neural attention mechanism plays an important role in many natural l...
research
07/19/2023

Exploring Transformer Extrapolation

Length extrapolation has attracted considerable attention recently since...
research
10/11/2022

Mixture of Attention Heads: Selecting Attention Heads Per Token

Mixture-of-Experts (MoE) networks have been proposed as an efficient way...
research
06/03/2023

Memorization Capacity of Multi-Head Attention in Transformers

In this paper, we investigate the memorization capabilities of multi-hea...

Please sign up or login with your details

Forgot password? Click here to reset