AdapLeR: Speeding up Inference by Adaptive Length Reduction

by   Ali Modarressi, et al.

Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance. Our method dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost. To determine the importance of each token representation, we train a Contribution Predictor for each layer using a gradient-based saliency method. Our experiments on several diverse classification tasks show speedups up to 22x during inference time without much sacrifice in performance. We also validate the quality of the selected tokens in our method using human annotations in the ERASER benchmark. In comparison to other widely used strategies for selecting important tokens, such as saliency and attention, our proposed method has a significantly lower false positive rate in generating rationales. Our code is freely available at .


page 8

page 15


Token Dropping for Efficient BERT Pretraining

Transformer-based models generally allocate the same amount of computati...

Learning Dynamic BERT via Trainable Gate Variables and a Bi-modal Regularizer

The BERT model has shown significant success on various natural language...

Make A Long Image Short: Adaptive Token Length for Vision Transformers

The vision transformer splits each image into a sequence of tokens with ...

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

Existing pre-trained language models (PLMs) are often computationally ex...

Diet Code is Healthy: Simplifying Programs for Pre-Trained Models of Code

Pre-trained code representation models such as CodeBERT have demonstrate...

SSMix: Saliency-Based Span Mixup for Text Classification

Data augmentation with mixup has shown to be effective on various comput...

Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search

Although transformers have achieved impressive accuracies in various tas...