Staircase Attention for Recurrent Processing of Sequences

06/08/2021
by   Da Ju, et al.
0

Attention mechanisms have become a standard tool for sequence modeling tasks, in particular by stacking self-attention layers over the entire input sequence as in the Transformer architecture. In this work we introduce a novel attention procedure called staircase attention that, unlike self-attention, operates across the sequence (in time) recurrently processing the input by adding another step of processing. A step in the staircase comprises of backward tokens (encoding the sequence so far seen) and forward tokens (ingesting a new part of the sequence), or an extreme Ladder version with a forward step of zero that simply repeats the Transformer on each step of the ladder, sharing the weights. We thus describe a family of such models that can trade off performance and compute, by either increasing the amount of recurrence through time, the amount of sequential processing via recurrence in depth, or both. Staircase attention is shown to be able to solve tasks that involve tracking that conventional Transformers cannot, due to this recurrence. Further, it is shown to provide improved modeling power for the same size model (number of parameters) compared to self-attentive Transformers on large language modeling and dialogue tasks, yielding significant perplexity gains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/04/2021

TransfoRNN: Capturing the Sequential Information in Self-Attention Representations for Language Modeling

In this paper, we describe the use of recurrent neural networks to captu...
research
05/22/2023

FIT: Far-reaching Interleaved Transformers

We present FIT: a transformer-based architecture with efficient self-att...
research
07/05/2023

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Scaling sequence length has become a critical demand in the era of large...
research
07/10/2018

Universal Transformers

Self-attentive feed-forward sequence models have been shown to achieve i...
research
03/11/2022

Block-Recurrent Transformers

We introduce the Block-Recurrent Transformer, which applies a transforme...
research
10/01/2019

Dialogue Transformers

We introduce a dialogue policy based on a transformer architecture, wher...
research
06/04/2021

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

We challenge a common assumption underlying most supervised deep learnin...

Please sign up or login with your details

Forgot password? Click here to reset