Multi-Head Attention: Collaborate Instead of Concatenate

06/29/2020
by   Jean-Baptiste Cordonnier, et al.
0

Attention layers are widely used in natural language processing (NLP) and are beginning to influence computer vision architectures. However, they suffer from over-parameterization. For instance, it was shown that the majority of attention heads could be pruned without impacting accuracy. This work aims to enhance current understanding on how multiple heads interact. Motivated by the observation that trained attention heads share common key/query projections, we propose a collaborative multi-head attention layer that enables heads to learn shared projections. Our scheme improves the computational cost and number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture. For instance, by allowing heads to collaborate on a neural machine translation task, we can reduce the key dimension by a factor of eight without any loss in performance. We also show that it is possible to re-parametrize a pre-trained multi-head attention layer into our collaborative attention layer. Even without retraining, collaborative multi-head attention manages to reduce the size of the key and query projections by half without sacrificing accuracy. Our code is public.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/18/2020

Multi-branch Attentive Transformer

While the multi-branch architecture is one of the key ingredients to the...
research
05/22/2023

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Multi-query attention (MQA), which only uses a single key-value head, dr...
research
08/07/2023

RCMHA: Relative Convolutional Multi-Head Attention for Natural Language Modelling

The Attention module finds common usage in language modeling, presenting...
research
04/12/2021

GAttANet: Global attention agreement for convolutional neural networks

Transformer attention architectures, similar to those developed for natu...
research
03/05/2021

IOT: Instance-wise Layer Reordering for Transformer Structures

With sequentially stacked self-attention, (optional) encoder-decoder att...
research
04/15/2019

Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection

It is known that a deep neural network model pre-trained with large-scal...
research
02/17/2020

Low-Rank Bottleneck in Multi-head Attention Models

Attention based Transformer architecture has enabled significant advance...

Please sign up or login with your details

Forgot password? Click here to reset