Attention Is Not All You Need Anymore

08/15/2023
by   Zhe Chen, et al.
0

In recent years, the popular Transformer architecture has achieved great success in many application areas, including natural language processing and computer vision. Many existing works aim to reduce the computational and memory complexity of the self-attention mechanism in the Transformer by trading off performance. However, performance is key for the continuing success of the Transformer. In this paper, a family of drop-in replacements for the self-attention mechanism in the Transformer, called the Extractors, is proposed. Four types of the Extractors, namely the super high-performance Extractor (SHE), the higher-performance Extractor (HE), the worthwhile Extractor (WE), and the minimalist Extractor (ME), are proposed as examples. Experimental results show that replacing the self-attention mechanism with the SHE evidently improves the performance of the Transformer, whereas the simplified versions of the SHE, i.e., the HE, the WE, and the ME, perform close to or better than the self-attention mechanism with less computational and memory complexity. Furthermore, the proposed Extractors have the potential or are able to run faster than the self-attention mechanism since their critical paths of computation are much shorter. Additionally, the sequence prediction problem in the context of text generation is formulated using variable-length discrete-time Markov chains, and the Transformer is reviewed based on our understanding.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2020

Linformer: Self-Attention with Linear Complexity

Large transformer models have shown extraordinary success in achieving s...
research
08/03/2021

Armour: Generalizable Compact Self-Attention for Vision Transformers

Attention-based transformer networks have demonstrated promising potenti...
research
10/11/2022

Robustify Transformers with Robust Kernel Density Estimation

Recent advances in Transformer architecture have empowered its empirical...
research
11/20/2022

Convexifying Transformers: Improving optimization and understanding of transformer networks

Understanding the fundamental mechanism behind the success of transforme...
research
06/14/2022

Transformers are Meta-Reinforcement Learners

The transformer architecture and variants presented remarkable success a...
research
06/19/2022

Learning Multiscale Transformer Models for Sequence Generation

Multiscale feature hierarchies have been witnessed the success in the co...
research
12/06/2021

Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention

Most of today's AI systems focus on using self-attention mechanisms and ...

Please sign up or login with your details

Forgot password? Click here to reset