DA-Transformer: Distance-aware Transformer

10/14/2020
by   Chuhan Wu, et al.
0

Transformer has achieved great success in the NLP field by composing various advanced models like BERT and GPT. However, Transformer and its existing variants may not be optimal in capturing token distances because the position or distance embeddings used by these methods usually cannot keep the precise information of real distances, which may not be beneficial for modeling the orders and relations of contexts. In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance. We propose to incorporate the real distances between tokens to re-scale the raw self-attention weights, which are computed by the relevance between attention query and key. Concretely, in different self-attention heads the relative distance between each pair of tokens is weighted by different learnable parameters, which control the different preferences on long- or short-term information of these heads. Since the raw weighted real distances may not be optimal for adjusting self-attention weights, we propose a learnable sigmoid function to map them into re-scaled coefficients that have proper ranges. We first clip the raw self-attention weights via the ReLU function to keep non-negativity and introduce sparsity, and then multiply them with the re-scaled coefficients to encode real distance information into self-attention. Extensive experiments on five benchmark datasets show that DA-Transformer can effectively improve the performance of many tasks and outperform the vanilla Transformer and its several variants.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/20/2021

Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer

Transformer has achieved great success in NLP. However, the quadratic co...
research
05/02/2020

Quantifying Attention Flow in Transformers

In the Transformer model, "self-attention" combines information from att...
research
02/18/2020

Conditional Self-Attention for Query-based Summarization

Self-attention mechanisms have achieved great success on a variety of NL...
research
01/30/2022

Graph Self-Attention for learning graph representation with Transformer

We propose a novel Graph Self-Attention module to enable Transformer mod...
research
07/10/2020

BISON:BM25-weighted Self-Attention Framework for Multi-Fields Document Search

Recent breakthrough in natural language processing has advanced the info...
research
04/21/2020

Attention Module is Not Only a Weight: Analyzing Transformers with Vector Norms

Because attention modules are core components of Transformer-based model...
research
05/20/2022

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Relative positional embeddings (RPE) have received considerable attentio...

Please sign up or login with your details

Forgot password? Click here to reset