Finetuning Pretrained Transformers into RNNs

03/24/2021
by   Jungo Kasai, et al.
14

Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. This comes with a significant computational overhead, as the attention mechanism scales with a quadratic complexity in sequence length. Efficient transformer variants have received increasing interest from recent works. Among them, a linear-complexity recurrent variant has proven well suited for autoregressive generation. It approximates the softmax attention with randomized or heuristic feature maps, but can be difficult to train or yield suboptimal accuracy. This work aims to convert a pretrained transformer into its efficient recurrent counterpart, improving the efficiency while retaining the accuracy. Specifically, we propose a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, we replace the softmax attention with its linear-complexity recurrent alternative and then finetune. With a learned feature map, our approach provides an improved tradeoff between efficiency and accuracy over the standard transformer and other recurrent variants. We also show that the finetuning process needs lower training cost than training these recurrent variants from scratch. As many recent models for natural language tasks are increasingly dependent on large-scale pretrained transformers, this work presents a viable approach to improving inference efficiency without repeating the expensive pretraining process.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/23/2022

Linearizing Transformer with Key-Value Memory Bank

Transformer has brought great success to a wide range of natural languag...
research
07/28/2022

Neural Architecture Search on Efficient Transformers and Beyond

Recently, numerous efficient Transformers have been proposed to reduce t...
research
10/15/2021

On Learning the Transformer Kernel

In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, d...
research
03/09/2021

Pretrained Transformers as Universal Computation Engines

We investigate the capability of a transformer pretrained on natural lan...
research
03/02/2023

Learning to Grow Pretrained Models for Efficient Transformer Training

Scaling transformers has led to significant breakthroughs in many domain...
research
06/13/2021

Thinking Like Transformers

What is the computational model behind a Transformer? Where recurrent ne...
research
11/07/2022

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

The attention mechanism is considered the backbone of the widely-used Tr...

Please sign up or login with your details

Forgot password? Click here to reset