Compositional Attention: Disentangling Search and Retrieval

10/18/2021
by   Sarthak Mittal, et al.
17

Multi-head, key-value attention is the backbone of the widely successful Transformer model and its variants. This attention mechanism uses multiple parallel key-value attention blocks (called heads), each performing two fundamental computations: (1) search - selection of a relevant entity from a set via query-key interactions, and (2) retrieval - extraction of relevant features from the selected entity via a value matrix. Importantly, standard attention heads learn a rigid mapping between search and retrieval. In this work, we first highlight how this static nature of the pairing can potentially: (a) lead to learning of redundant parameters in certain tasks, and (b) hinder generalization. To alleviate this problem, we propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure. The proposed mechanism disentangles search and retrieval and composes them in a dynamic, flexible and context-dependent manner through an additional soft competition stage between the query-key combination and value pairing. Through a series of numerical experiments, we show that it outperforms standard multi-head attention on a variety of tasks, including some out-of-distribution settings. Through our qualitative analysis, we demonstrate that Compositional Attention leads to dynamic specialization based on the type of retrieval needed. Our proposed mechanism generalizes multi-head attention, allows independent scaling of search and retrieval, and can easily be implemented in lieu of standard attention heads in any network architecture.

READ FULL TEXT

page 18

page 20

page 24

research
05/22/2023

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Multi-query attention (MQA), which only uses a single key-value head, dr...
research
09/30/2020

Learning Hard Retrieval Cross Attention for Transformer

The Transformer translation model that based on the multi-head attention...
research
03/20/2022

Vision Transformer with Convolutions Architecture Search

Transformers exhibit great advantages in handling computer vision tasks....
research
03/02/2020

Transformer++

Recent advancements in attention mechanisms have replaced recurrent neur...
research
10/08/2020

Improving Attention Mechanism with Query-Value Interaction

Attention mechanism has played critical roles in various state-of-the-ar...
research
08/30/2023

Denoising Attention for Query-aware User Modeling in Personalized Search

The personalization of search results has gained increasing attention in...
research
01/26/2021

Attention Can Reflect Syntactic Structure (If You Let It)

Since the popularization of the Transformer as a general-purpose feature...

Please sign up or login with your details

Forgot password? Click here to reset