Pay Attention when Required

09/09/2020
by   Swetha Mandava, et al.
0

Transformer-based models consist of interleaved feed-forward blocks - that capture content meaning, and relatively more expensive self-attention blocks - that capture context meaning. In this paper, we explored trade-offs and ordering of the blocks to improve upon the current Transformer architecture and proposed PAR Transformer. It needs 35 achieved by replacing  63 blocks, and retains the perplexity on WikiText-103 language modelling benchmark. We further validated our results on text8 and enwiki8 datasets, as well as on the BERT model.

READ FULL TEXT

page 3

page 4

research
05/22/2023

Parallel Attention and Feed-Forward Net Design for Pre-training and Inference on Transformers

In this paper, we introduce Parallel Attention and Feed-Forward Net Desi...
research
07/02/2019

Augmenting Self-attention with Persistent Memory

Transformer networks have lead to important progress in language modelin...
research
02/01/2023

Feed-Forward Blocks Control Contextualization in Masked Language Models

Understanding the inner workings of neural network models is a crucial s...
research
04/19/2021

UVCE-IIITT@DravidianLangTech-EACL2021: Tamil Troll Meme Classification: You need to Pay more Attention

Tamil is a Dravidian language that is commonly used and spoken in the so...
research
08/07/2023

Dual Aggregation Transformer for Image Super-Resolution

Transformer has recently gained considerable popularity in low-level vis...
research
11/14/2022

CabViT: Cross Attention among Blocks for Vision Transformer

Since the vision transformer (ViT) has achieved impressive performance i...
research
01/12/2021

Of Non-Linearity and Commutativity in BERT

In this work we provide new insights into the transformer architecture, ...

Please sign up or login with your details

Forgot password? Click here to reset