QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

10/31/2022
by   Shira Guskin, et al.
3

Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, the performance of these models drops as we reduce the number of layers, notably in advanced NLP tasks such as span question answering. In addition, a separate model must be trained for each inference scenario with its distinct computational budget. Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss. In this work, we expand the Dynamic-TinyBERT approach to generate a much more highly efficient model. We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization. Our quantized length-adaptive MiniLM model (QuaLA-MiniLM) is trained only once, dynamically fits any inference scenario, and achieves an accuracy-efficiency trade-off superior to any other efficient approaches per any computational budget on the SQuAD1.1 dataset (up to x8.8 speedup with <1 accuracy loss). The code to reproduce this work will be publicly released on Github soon.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2021

Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic Sequence Length

Limited computational budgets often prevent transformers from being used...
research
10/14/2020

Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search

Although transformers have achieved impressive accuracies in various tas...
research
08/05/2021

Decoupled Transformer for Scalable Inference in Open-domain Question Answering

Large transformer models, such as BERT, achieve state-of-the-art results...
research
12/16/2021

Block-Skim: Efficient Question Answering for Transformer

Transformer models have achieved promising results on natural language p...
research
07/28/2022

SDBERT: SparseDistilBERT, a faster and smaller BERT model

In this work we introduce a new transformer architecture called SparseDi...
research
10/28/2022

BEBERT: Efficient and robust binary ensemble BERT

Pre-trained BERT models have achieved impressive accuracy on natural lan...
research
06/02/2021

On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers

How much information do NLP tasks really need from a transformer's atten...

Please sign up or login with your details

Forgot password? Click here to reset