TangoBERT: Reducing Inference Cost by using Cascaded Architecture

04/13/2022
by   Jonathan Mamou, et al.
6

The remarkable success of large transformer-based models such as BERT, RoBERTa and XLNet in many NLP tasks comes with a large increase in monetary and environmental cost due to their high computational load and energy consumption. In order to reduce this computational load in inference time, we present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient but less accurate first tier model, and only part of those instances are additionally processed by a less efficient but more accurate second tier model. The decision of whether to apply the second tier model is based on a confidence score produced by the first tier model. Our simple method has several appealing practical advantages compared to standard cascading approaches based on multi-layered transformer models. First, it enables higher speedup gains (average lower latency). Second, it takes advantage of batch size optimization for cascading, which increases the relative inference cost reductions. We report TangoBERT inference CPU speedup on four text classification GLUE tasks and on one reading comprehension task. Experimental results show that TangoBERT outperforms efficient early exit baseline models; on the the SST-2 task, it achieves an accuracy of 93.9 8.2x.

READ FULL TEXT
research
08/05/2021

Decoupled Transformer for Scalable Inference in Open-domain Question Answering

Large transformer models, such as BERT, achieve state-of-the-art results...
research
10/05/2021

MoEfication: Conditional Computation of Transformer Models for Efficient Inference

Transformer-based pre-trained language models can achieve superior perfo...
research
12/16/2021

Block-Skim: Efficient Question Answering for Transformer

Transformer models have achieved promising results on natural language p...
research
12/13/2019

WaLDORf: Wasteless Language-model Distillation On Reading-comprehension

Transformer based Very Large Language Models (VLLMs) like BERT, XLNet an...
research
05/24/2022

BabyBear: Cheap inference triage for expensive language models

Transformer language models provide superior accuracy over previous mode...
research
10/26/2020

FastFormers: Highly Efficient Transformer Models for Natural Language Understanding

Transformer-based models are the state-of-the-art for Natural Language U...
research
04/16/2020

The Right Tool for the Job: Matching Model and Instance Complexities

As NLP models become larger, executing a trained model requires signific...

Please sign up or login with your details

Forgot password? Click here to reset