Transkimmer: Transformer Learns to Layer-wise Skim

05/15/2022
by   Yue Guan, et al.
7

Transformer architecture has become the de-facto model for many machine learning tasks from natural language processing and computer vision. As such, improving its computational efficiency becomes paramount. One of the major computational inefficiency of Transformer-based models is that they spend the identical amount of computation throughout all layers. Prior works have proposed to augment the Transformer model with the capability of skimming tokens to improve its computational efficiency. However, they suffer from not having effectual and end-to-end optimization of the discrete skimming predictor. To address the above limitations, we propose the Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer. The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers. The key idea in Transkimmer is to add a parameterized predictor before each layer that learns to make the skimming decision. We also propose to adopt reparameterization trick and add skim loss for the end-to-end training of Transkimmer. Transkimmer achieves 10.97x average speedup on GLUE benchmark compared with vanilla BERT-base baseline with less than 1 degradation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2021

Block-Skim: Efficient Question Answering for Transformer

Transformer models have achieved promising results on natural language p...
research
03/03/2022

Multi-Tailed Vision Transformer for Efficient Inference

Recently, Vision Transformer (ViT) has achieved promising performance in...
research
05/07/2023

Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens

Transformer models are foundational to natural language processing (NLP)...
research
07/14/2022

Forming Trees with Treeformers

Popular models such as Transformers and LSTMs use tokens as its unit of ...
research
02/11/2020

Superbloom: Bloom filter meets Transformer

We extend the idea of word pieces in natural language models to machine ...
research
06/01/2021

DoT: An efficient Double Transformer for NLP tasks with tables

Transformer-based approaches have been successfully used to obtain state...
research
04/07/2021

FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization

Transducer-based models, such as RNN-Transducer and transformer-transduc...

Please sign up or login with your details

Forgot password? Click here to reset