OmniNet: Omnidirectional Representations from Transformers

03/01/2021
by   Yi Tay, et al.
19

This paper proposes Omnidirectional Representations from Transformers (OmniNet). In OmniNet, instead of maintaining a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network. This process can also be interpreted as a form of extreme or intensive attention mechanism that has the receptive field of the entire width and depth of the network. To this end, the omnidirectional attention is learned via a meta-learner, which is essentially another self-attention based model. In order to mitigate the computationally expensive costs of full receptive field attention, we leverage efficient self-attention models such as kernel-based (Choromanski et al.), low-rank attention (Wang et al.) and/or Big Bird (Zaheer et al.) as the meta-learner. Extensive experiments are conducted on autoregressive language modeling (LM1B, C4), Machine Translation, Long Range Arena (LRA), and Image Recognition. The experiments show that OmniNet achieves considerable improvements across these tasks, including achieving state-of-the-art performance on LM1B, WMT'14 En-De/En-Fr, and Long Range Arena. Moreover, using omnidirectional representation in Vision Transformers leads to significant improvements on image recognition tasks on both few-shot learning and fine-tuning setups.

READ FULL TEXT

page 8

page 12

page 13

page 14

page 15

page 16

research
10/21/2022

Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences

Efficient Transformers have been developed for long sequence modeling, d...
research
09/12/2021

Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Transformers have sprung up in the field of computer vision. In this wor...
research
11/30/2021

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Built on top of self-attention mechanisms, vision transformers have demo...
research
03/15/2022

Long Document Summarization with Top-down and Bottom-up Inference

Text summarization aims to condense long documents and retain key inform...
research
10/09/2022

Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

Autoregressive Transformers are strong language models but incur O(T) co...
research
01/27/2023

On the Connection Between MPNN and Graph Transformer

Graph Transformer (GT) recently has emerged as a new paradigm of graph l...
research
07/19/2023

Exploring Transformer Extrapolation

Length extrapolation has attracted considerable attention recently since...

Please sign up or login with your details

Forgot password? Click here to reset