Centroid Transformers: Learning to Abstract with Attention

02/17/2021
by   Lemeng Wu, et al.
7

Self-attention, as the key block of transformers, is a powerful mechanism for extracting features from the inputs. In essence, what self-attention does to infer the pairwise relations between the elements of the inputs, and modify the inputs by propagating information between input pairs. As a result, it maps inputs to N outputs and casts a quadratic O(N^2) memory and time complexity. We propose centroid attention, a generalization of self-attention that maps N inputs to M outputs (M≤ N), such that the key information in the inputs are summarized in the smaller number of outputs (called centroids). We design centroid attention by amortizing the gradient descent update rule of a clustering objective function on the inputs, which reveals an underlying connection between attention and clustering. By compressing the inputs to the centroids, we extract the key information useful for prediction and also reduce the computation of the attention module and the subsequent layers. We apply our method to various applications, including abstractive text summarization, 3D vision, and image processing. Empirical results demonstrate the effectiveness of our method over the standard transformers.

READ FULL TEXT
research
01/05/2023

Skip-Attention: Improving Vision Transformers by Paying Less Attention

This work aims to improve the efficiency of vision transformers (ViT). W...
research
04/14/2023

Optimal inference of a generalised Potts model by single-layer transformers with factored attention

Transformers are the type of neural networks that has revolutionised nat...
research
04/23/2022

Grad-SAM: Explaining Transformers via Gradient Self-Attention Maps

Transformer-based language models significantly advanced the state-of-th...
research
07/01/2022

Rethinking Query-Key Pairwise Interactions in Vision Transformers

Vision Transformers have achieved state-of-the-art performance in many v...
research
02/18/2022

DataMUX: Data Multiplexing for Neural Networks

In this paper, we introduce data multiplexing (DataMUX), a technique tha...
research
02/22/2021

Linear Transformers Are Secretly Fast Weight Memory Systems

We show the formal equivalence of linearised self-attention mechanisms a...
research
04/14/2021

Pose Recognition with Cascade Transformers

In this paper, we present a regression-based pose recognition method usi...

Please sign up or login with your details

Forgot password? Click here to reset