ERNIE-DOC: The Retrospective Long-Document Modeling Transformer

by   Siyu Ding, et al.

Transformers are not suited for processing long document input due to its quadratically increasing memory and time consumption. Simply truncating a long document or applying the sparse attention mechanism will incur the context fragmentation problem or inferior modeling capability with comparable model size. In this paper, we propose ERNIE-DOC, a document-level language pretraining model based on Recurrence Transformers. Two well-designed techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism enable ERNIE-DOC with much longer effective context length to capture the contextual information of a whole document. We pretrain ERNIE-DOC to explicitly learn the relationship among segments with an additional document-aware segment reordering objective. Various experiments on both English and Chinese document-level tasks are conducted. ERNIE-DOC achieves SOTA language modeling result of 16.8 ppl on WikiText-103 and outperforms competitive pretraining models on most language understanding tasks such as text classification, question answering by a large margin.


page 1

page 2

page 3

page 4


XLNet: Generalized Autoregressive Pretraining for Language Understanding

With the capability of modeling bidirectional contexts, denoising autoen...

Revisiting Transformer-based Models for Long Document Classification

The recent literature in text classification is biased towards short tex...

Language Model Pre-training for Hierarchical Document Representations

Hierarchical neural architectures are often used to capture long-distanc...

Luna: Linear Unified Nested Attention

The quadratic computational and memory complexities of the Transformer's...

Blockwise Self-Attention for Long Document Understanding

We present BlockBERT, a lightweight and efficient BERT model that is des...

LongT5: Efficient Text-To-Text Transformer for Long Sequences

Recent work has shown that either (1) increasing the input length or (2)...

Recurrent Chunking Mechanisms for Long-Text Machine Reading Comprehensio

In this paper, we study machine reading comprehension (MRC) on long text...