Decoupled Transformer for Scalable Inference in Open-domain Question Answering

08/05/2021
by   Haytham ElFadeel, et al.
0

Large transformer models, such as BERT, achieve state-of-the-art results in machine reading comprehension (MRC) for open-domain question answering (QA). However, transformers have a high computational cost for inference which makes them hard to apply to online QA systems for applications like voice assistants. To reduce computational cost and latency, we propose decoupling the transformer MRC model into input-component and cross-component. The decoupling allows for part of the representation computation to be performed offline and cached for online use. To retain the decoupled transformer accuracy, we devised a knowledge distillation objective from a standard transformer model. Moreover, we introduce learned representation compression layers which help reduce by four times the storage requirement for the cache. In experiments on the SQUAD 2.0 dataset, a decoupled transformer reduces the computational cost and latency of open-domain MRC by 30-40 standard transformer.

READ FULL TEXT
research
06/12/2019

Neural Arabic Question Answering

This paper tackles the problem of open domain factual Arabic question an...
research
04/13/2022

TangoBERT: Reducing Inference Cost by using Cascaded Architecture

The remarkable success of large transformer-based models such as BERT, R...
research
01/15/2022

Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems

Large transformer models can highly improve Answer Sentence Selection (A...
research
05/16/2022

Chemical transformer compression for accelerating both training and inference of molecular modeling

Transformer models have been developed in molecular science with excelle...
research
11/10/2020

Don't Read Too Much into It: Adaptive Computation for Open-Domain Question Answering

Most approaches to Open-Domain Question Answering consist of a light-wei...
research
09/26/2021

Improving Question Answering Performance Using Knowledge Distillation and Active Learning

Contemporary question answering (QA) systems, including transformer-base...
research
10/31/2022

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

Limited computational budgets often prevent transformers from being used...

Please sign up or login with your details

Forgot password? Click here to reset