Implicit Kernel Attention

by   Kyungwoo Song, et al.

Attention compute the dependency between representations, and it encourages the model to focus on the important selective features. Among the attention methods, the scaled dot-product attention is widely utilized in many models. This paper suggests a generalized structure of the scaled dot-product attention with similarity and magnitude terms. We derive that the scaled dot-product attention is a product of two parts: 1) the RBF kernel to measure the similarity of two instances and 2) the exponential L^2 norm to compute the importance of individual instances. From this decomposition, we improve the attention in two ways: implicit modeling on the kernel spectral density and generalized L^p norm, which results in a learnable and flexible attention structure. First, we estimate the spectral density of kernel with implicit probabilistic models to estimate the appropriate kernel for a given dataset without kernel selection manually. Second, we introduce a generalized L^p norm on the hidden feature space, where p is a hyper-parameter that affects the scale of individual importance and the sparsity of attention weights. Also, we show how to expand this implicit kernel modeling to multi-head attention in conjunction with a copula augmentation. Our generalized attention shows better performance on text classification, translation, regression, and node classification tasks.


page 1

page 2

page 3

page 4


An L1 Representer Theorem for Multiple-Kernel Regression

The theory of RKHS provides an elegant framework for supervised learning...

Learning Hard Retrieval Cross Attention for Transformer

The Transformer translation model that based on the multi-head attention...

Solving Attention Kernel Regression Problem via Pre-conditioner

Large language models have shown impressive performance in many tasks. O...

KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer

Scaled dot-product attention applies a softmax function on the scaled do...

Horizontal and Vertical Attention in Transformers

Transformers are built upon multi-head scaled dot-product attention and ...

Advancing Connectionist Temporal Classification With Attention Modeling

In this study, we propose advancing all-neural speech recognition by dir...

Please sign up or login with your details

Forgot password? Click here to reset