Implicit Kernel Attention

06/11/2020
by   Kyungwoo Song, et al.
0

Attention compute the dependency between representations, and it encourages the model to focus on the important selective features. Among the attention methods, the scaled dot-product attention is widely utilized in many models. This paper suggests a generalized structure of the scaled dot-product attention with similarity and magnitude terms. We derive that the scaled dot-product attention is a product of two parts: 1) the RBF kernel to measure the similarity of two instances and 2) the exponential L^2 norm to compute the importance of individual instances. From this decomposition, we improve the attention in two ways: implicit modeling on the kernel spectral density and generalized L^p norm, which results in a learnable and flexible attention structure. First, we estimate the spectral density of kernel with implicit probabilistic models to estimate the appropriate kernel for a given dataset without kernel selection manually. Second, we introduce a generalized L^p norm on the hidden feature space, where p is a hyper-parameter that affects the scale of individual importance and the sparsity of attention weights. Also, we show how to expand this implicit kernel modeling to multi-head attention in conjunction with a copula augmentation. Our generalized attention shows better performance on text classification, translation, regression, and node classification tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2018

An L1 Representer Theorem for Multiple-Kernel Regression

The theory of RKHS provides an elegant framework for supervised learning...
research
09/30/2020

Learning Hard Retrieval Cross Attention for Transformer

The Transformer translation model that based on the multi-head attention...
research
08/28/2023

Solving Attention Kernel Regression Problem via Pre-conditioner

Large language models have shown impressive performance in many tasks. O...
research
02/22/2023

KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer

Scaled dot-product attention applies a softmax function on the scaled do...
research
07/10/2022

Horizontal and Vertical Attention in Transformers

Transformers are built upon multi-head scaled dot-product attention and ...
research
03/15/2018

Advancing Connectionist Temporal Classification With Attention Modeling

In this study, we propose advancing all-neural speech recognition by dir...

Please sign up or login with your details

Forgot password? Click here to reset