Block-Skim: Efficient Question Answering for Transformer

12/16/2021
by   Yue Guan, et al.
5

Transformer models have achieved promising results on natural language processing (NLP) tasks including extractive question answering (QA). Common Transformer encoders used in NLP tasks process the hidden states of all input tokens in the context paragraph throughout all layers. However, different from other tasks such as sequence classification, answering the raised question does not necessarily need all the tokens in the context paragraph. Following this motivation, we propose Block-skim, which learns to skim unnecessary context in higher hidden layers to improve and accelerate the Transformer performance. The key idea of Block-Skim is to identify the context that must be further processed and those that could be safely discarded early on during inference. Critically, we find that such information could be sufficiently derived from the self-attention weights inside the Transformer model. We further prune the hidden states corresponding to the unnecessary positions early in lower layers, achieving significant inference-time speedup. To our surprise, we observe that models pruned in this way outperform their full-size counterparts. Block-Skim improves QA models' accuracy on different datasets and achieves 3 times speedup on BERT-base model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/02/2020

DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

Transformer-based QA models use input-wide self-attention – i.e. across ...
research
06/01/2021

DoT: An efficient Double Transformer for NLP tasks with tables

Transformer-based approaches have been successfully used to obtain state...
research
05/15/2022

Transkimmer: Transformer Learns to Layer-wise Skim

Transformer architecture has become the de-facto model for many machine ...
research
10/26/2022

DyREx: Dynamic Query Representation for Extractive Question Answering

Extractive question answering (ExQA) is an essential task for Natural La...
research
04/13/2022

TangoBERT: Reducing Inference Cost by using Cascaded Architecture

The remarkable success of large transformer-based models such as BERT, R...
research
08/24/2021

Auto-Parsing Network for Image Captioning and Visual Question Answering

We propose an Auto-Parsing Network (APN) to discover and exploit the inp...
research
10/31/2022

QuaLA-MiniLM: a Quantized Length Adaptive MiniLM

Limited computational budgets often prevent transformers from being used...

Please sign up or login with your details

Forgot password? Click here to reset