Primal-Attention: Self-attention through Asymmetric Kernel SVD in Primal Representation

05/31/2023
by   Yingyi Chen, et al.
0

Recently, a new line of works has emerged to understand and improve self-attention in Transformers by treating it as a kernel machine. However, existing works apply the methods for symmetric kernels to the asymmetric self-attention, resulting in a nontrivial gap between the analytical understanding and numerical implementation. In this paper, we provide a new perspective to represent and optimize self-attention through asymmetric Kernel Singular Value Decomposition (KSVD), which is also motivated by the low-rank property of self-attention normally observed in deep layers. Through asymmetric KSVD, i) a primal-dual representation of self-attention is formulated, where the optimization objective is cast to maximize the projection variances in the attention outputs; ii) a novel attention mechanism, i.e., Primal-Attention, is proposed via the primal representation of KSVD, avoiding explicit computation of the kernel matrix in the dual; iii) with KKT conditions, we prove that the stationary solution to the KSVD optimization in Primal-Attention yields a zero-value objective. In this manner, KSVD optimization can be implemented by simply minimizing a regularization loss, so that low-rank property is promoted without extra decomposition. Numerical experiments show state-of-the-art performance of our Primal-Attention with improved efficiency. Moreover, we demonstrate that the deployed KSVD optimization regularizes Primal-Attention with a sharper singular value decay than that of the canonical self-attention, further verifying the great potential of our method. To the best of our knowledge, this is the first work that provides a primal-dual representation for the asymmetric kernel in self-attention and successfully applies it to modeling and optimization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/24/2023

Easy attention: A simple self-attention mechanism for Transformers

To improve the robustness of transformer neural networks used for tempor...
research
07/05/2022

Softmax-free Linear Transformers

Vision transformers (ViTs) have pushed the state-of-the-art for various ...
research
05/08/2020

Modeling Document Interactions for Learning to Rank with Regularized Self-Attention

Learning to rank is an important task that has been successfully deploye...
research
06/12/2023

Nonlinear SVD with Asymmetric Kernels: feature learning and asymmetric Nyström method

Asymmetric data naturally exist in real life, such as directed graphs. D...
research
09/09/2021

Is Attention Better Than Matrix Decomposition?

As an essential ingredient of modern deep learning, attention mechanism,...
research
11/08/2022

Linear Self-Attention Approximation via Trainable Feedforward Kernel

In pursuit of faster computation, Efficient Transformers demonstrate an ...
research
08/25/2023

QKSAN: A Quantum Kernel Self-Attention Network

Self-Attention Mechanism (SAM) is skilled at extracting important inform...

Please sign up or login with your details

Forgot password? Click here to reset