Adaptive Attention Span in Transformers

05/19/2019
by   Sainbayar Sukhbaatar, et al.
0

We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we achieve state-of-the-art performances on text8 and enwiki8 by using a maximum context of 8k characters.

READ FULL TEXT
research
04/10/2020

Longformer: The Long-Document Transformer

Transformer-based models are unable to process long sequences due to the...
research
05/13/2021

Not All Memories are Created Equal: Learning to Forget by Expiring

Attention mechanisms have shown promising results in sequence modeling t...
research
09/14/2021

Sum-Product-Attention Networks: Leveraging Self-Attention in Probabilistic Circuits

Probabilistic circuits (PCs) have become the de-facto standard for learn...
research
10/11/2022

Robustify Transformers with Robust Kernel Density Estimation

Recent advances in Transformer architecture have empowered its empirical...
research
08/30/2022

ASpanFormer: Detector-Free Image Matching with Adaptive Span Transformer

Generating robust and reliable correspondences across images is a fundam...
research
11/26/2019

Single Headed Attention RNN: Stop Thinking With Your Head

The leading approaches in language modeling are all obsessed with TV sho...
research
11/08/2022

DepthFormer: Multimodal Positional Encodings and Cross-Input Attention for Transformer-Based Segmentation Networks

Most approaches for semantic segmentation use only information from colo...

Please sign up or login with your details

Forgot password? Click here to reset