Synthesizer: Rethinking Self-Attention in Transformer Models

05/02/2020
by   Yi Tay, et al.
0

The dot product self-attention is known to be central and indispensable to state-of-the-art Transformer models. But is it really required? This paper investigates the true importance and contribution of the dot product-based self-attention mechanism on the performance of Transformer models. Via extensive experiments, we find that (1) random alignment matrices surprisingly perform quite competitively and (2) learning attention weights from token-token (query-key) interactions is not that important after all. To this end, we propose Synthesizer, a model that learns synthetic attention weights without token-token interactions. Our experimental results show that Synthesizer is competitive against vanilla Transformer models across a range of tasks, including MT (EnDe, EnFr), language modeling (LM1B), abstractive summarization (CNN/Dailymail), dialogue generation (PersonaChat) and Multi-task language understanding (GLUE, SuperGLUE).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/24/2019

A Tensorized Transformer for Language Modeling

Latest development of neural models has connected the encoder and decode...
research
01/29/2020

Interpretable Rumor Detection in Microblogs by Attending to User Interactions

We address rumor detection by learning to differentiate between the comm...
research
05/25/2023

Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer

Transformer architecture has shown impressive performance in multiple re...
research
05/18/2021

Effective Attention Sheds Light On Interpretability

An attention matrix of a transformer self-attention sublayer can provabl...
research
06/19/2023

Vision Transformer with Attention Map Hallucination and FFN Compaction

Vision Transformer(ViT) is now dominating many vision tasks. The drawbac...
research
05/24/2023

Predicting Token Impact Towards Efficient Vision Transformer

Token filtering to reduce irrelevant tokens prior to self-attention is a...
research
05/20/2023

Autoregressive Modeling with Lookahead Attention

To predict the next token, autoregressive models ordinarily examine the ...

Please sign up or login with your details

Forgot password? Click here to reset