Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel

08/30/2019
by   Yao-Hung Hubert Tsai, et al.
32

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the attention mechanism, which concurrently processes all inputs in the streams. In this paper, we present a new formulation of attention via the lens of the kernel. To be more precise, we realize that the attention can be seen as applying kernel smoother over the inputs with the kernel scores being the similarities between inputs. This new formulation gives us a better way to understand individual components of the Transformer's attention, such as the better way to integrate the positional embedding. Another important advantage of our kernel-based formulation is that it paves the way to a larger space of composing Transformer's attention. As an example, we propose a new variant of Transformer's attention which models the input as a product of symmetric kernels. This approach achieves competitive performance to the current state of the art model with less computation. In our experiments, we empirically study different kernel construction strategies on two widely used tasks: neural machine translation and sequence prediction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/03/2021

Random Feature Attention

Transformers are state-of-the-art models for a variety of sequence model...
research
02/24/2020

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Transformer-based models have brought a radical change to neural machine...
research
03/16/2020

TRANS-BLSTM: Transformer with Bidirectional LSTM for Language Understanding

Bidirectional Encoder Representations from Transformers (BERT) has recen...
research
11/07/2020

Rethinking the Value of Transformer Components

Transformer becomes the state-of-the-art translation model, while it is ...
research
11/23/2021

SimpleTron: Eliminating Softmax from Attention Computation

In this paper, we propose that the dot product pairwise matching attenti...
research
11/06/2022

Parallel Attention Forcing for Machine Translation

Attention-based autoregressive models have achieved state-of-the-art per...
research
06/02/2021

Transformers are Deep Infinite-Dimensional Non-Mercer Binary Kernel Machines

Despite their ubiquity in core AI fields like natural language processin...

Please sign up or login with your details

Forgot password? Click here to reset