Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

06/05/2020
by   Zihang Dai, et al.
0

With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only require a single-vector presentation of the sequence. With this intuition, we propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. More importantly, by re-investing the saved FLOPs from length reduction in constructing a deeper or wider model, we further improve the model capacity. In addition, to perform token-level predictions as required by common pretraining objectives, Funnel-Transformer is able to recover a deep representation for each token from the reduced hidden sequence via a decoder. Empirically, with comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks, including text classification, language understanding, and reading comprehension. The code and pretrained checkpoints are available at https://github.com/laiguokun/Funnel-Transformer.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/24/2022

Token Dropping for Efficient BERT Pretraining

Transformer-based models generally allocate the same amount of computati...
research
06/30/2021

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

Recent pretraining models in Chinese neglect two important aspects speci...
research
10/14/2020

Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search

Although transformers have achieved impressive accuracies in various tas...
research
12/31/2020

AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding

Advances in English language representation enabled a more sample-effici...
research
05/24/2022

History Compression via Language Models in Reinforcement Learning

In a partially observable Markov decision process (POMDP), an agent typi...
research
06/26/2023

Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression

Pretrained transformers exhibit the remarkable ability of in-context lea...
research
05/20/2023

Autoregressive Modeling with Lookahead Attention

To predict the next token, autoregressive models ordinarily examine the ...

Please sign up or login with your details

Forgot password? Click here to reset