DeepAI
Log In Sign Up

Cascaded Head-colliding Attention

05/31/2021
by   Lin Zheng, et al.
0

Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks. At the cornerstone of the Transformer architecture is the multi-head attention (MHA) mechanism which models pairwise interactions between the elements of the sequence. Despite its massive success, the current framework ignores interactions among different heads, leading to the problem that many of the heads are redundant in practice, which greatly wastes the capacity of the model. To improve parameter efficiency, we re-formulate the MHA as a latent variable model from a probabilistic perspective. We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution. We conduct extensive experiments and demonstrate that CODA outperforms the transformer baseline, by 0.6 perplexity on in language modeling, and by 0.6 BLEU on in machine translation, due to its improvements on the parameter efficiency.[Our implementation is publicly available at <https://github.com/LZhengisme/CODA>.]

READ FULL TEXT

page 7

page 13

04/24/2020

Lite Transformer with Long-Short Range Attention

Transformer has become ubiquitous in natural language processing (e.g., ...
10/16/2021

Transformer with a Mixture of Gaussian Keys

Multi-head attention is a driving force behind state-of-the-art transfor...
06/18/2020

Multi-branch Attentive Transformer

While the multi-branch architecture is one of the key ingredients to the...
09/20/2020

Repulsive Attention: Rethinking Multi-head Attention as Bayesian Inference

The neural attention mechanism plays an important role in many natural l...
09/15/2021

Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Transformer architecture has become ubiquitous in the natural language p...
07/09/2022

QKVA grid: Attention in Image Perspective and Stacked DETR

We present a new model named Stacked-DETR(SDETR), which inherits the mai...
05/13/2020

A Mixture of h-1 Heads is Better than h Heads

Multi-head attentive neural architectures have achieved state-of-the-art...