Blockwise Self-Attention for Long Document Understanding

11/07/2019
by   Jiezhong Qiu, et al.
0

We present BlockBERT, a lightweight and efficient BERT model that is designed to better modeling long-distance dependencies. Our model extends BERT by introducing sparse block structures into the attention matrix to reduce both memory consumption and training time, which also enables attention heads to capture either short- or long-range contextual information. We conduct experiments on several benchmark question answering datasets with various paragraph lengths. Results show that BlockBERT uses 18.7-36.1 reduces the training time by 12.0-25.1 better prediction accuracy, compared to an advanced BERT-based model, RoBERTa.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2022

BERTVision – A Parameter-Efficient Approach for Question Answering

We present a highly parameter efficient approach for Question Answering ...
research
09/13/2020

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

Transformer has become ubiquitous in the deep learning field. One of the...
research
06/28/2019

Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks

Open-ended video question answering aims to automatically generate the n...
research
12/31/2020

ERNIE-DOC: The Retrospective Long-Document Modeling Transformer

Transformers are not suited for processing long document input due to it...
research
08/06/2020

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Pre-trained language models like BERT and its variants have recently ach...
research
05/27/2022

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Transformers are slow and memory-hungry on long sequences, since the tim...

Please sign up or login with your details

Forgot password? Click here to reset