Finetuning Pretrained Transformers into RNNs

by   Jungo Kasai, et al.

Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. This comes with a significant computational overhead, as the attention mechanism scales with a quadratic complexity in sequence length. Efficient transformer variants have received increasing interest from recent works. Among them, a linear-complexity recurrent variant has proven well suited for autoregressive generation. It approximates the softmax attention with randomized or heuristic feature maps, but can be difficult to train or yield suboptimal accuracy. This work aims to convert a pretrained transformer into its efficient recurrent counterpart, improving the efficiency while retaining the accuracy. Specifically, we propose a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, we replace the softmax attention with its linear-complexity recurrent alternative and then finetune. With a learned feature map, our approach provides an improved tradeoff between efficiency and accuracy over the standard transformer and other recurrent variants. We also show that the finetuning process needs lower training cost than training these recurrent variants from scratch. As many recent models for natural language tasks are increasingly dependent on large-scale pretrained transformers, this work presents a viable approach to improving inference efficiency without repeating the expensive pretraining process.


page 1

page 2

page 3

page 4


Linearizing Transformer with Key-Value Memory Bank

Transformer has brought great success to a wide range of natural languag...

Neural Architecture Search on Efficient Transformers and Beyond

Recently, numerous efficient Transformers have been proposed to reduce t...

Thinking Like Transformers

What is the computational model behind a Transformer? Where recurrent ne...

On Learning the Transformer Kernel

In this work we introduce KERNELIZED TRANSFORMER, a generic, scalable, d...

Predicting Attention Sparsity in Transformers

A bottleneck in transformer architectures is their quadratic complexity ...

Investigating Efficiently Extending Transformers for Long Input Summarization

While large pretrained Transformer models have proven highly capable at ...

A Practical Survey on Faster and Lighter Transformers

Recurrent neural networks are effective models to process sequences. How...

Code Repositories


Running massive simulations using RNNs on CPUs for building bots and all kinds of things.

view repo