Cross-Document Language Modeling

01/02/2021
by   Avi Caciularu, et al.
10

We introduce a new pretraining approach for language models that are geared to support multi-document NLP tasks. Our cross-document language model (CD-LM) improves masked language modeling for these tasks with two key ideas. First, we pretrain with multiple related documents in a single input, via cross-document masking, which encourages the model to learn cross-document and long-range relationships. Second, extending the recent Longformer model, we pretrain with long contexts of several thousand tokens and introduce a new attention pattern that uses sequence-level global attention to predict masked tokens, while retaining the familiar local attention elsewhere. We show that our CD-LM sets new state-of-the-art results for several multi-text tasks, including cross-document event and entity coreference resolution, paper citation recommendation, and documents plagiarism detection, while using a significantly reduced number of training parameters relative to prior works.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/10/2020

Longformer: The Long-Document Transformer

Transformer-based models are unable to process long sequences due to the...
research
10/03/2020

Multilevel Text Alignment with Cross-Document Attention

Text alignment finds application in tasks such as citation recommendatio...
research
06/23/2023

Long-range Language Modeling with Self-retrieval

Retrieval-augmented language models (LMs) have received much attention r...
research
05/05/2023

HiPool: Modeling Long Documents Using Graph Neural Networks

Encoding long sequences in Natural Language Processing (NLP) is a challe...
research
02/08/2015

Improving Term Frequency Normalization for Multi-topical Documents, and Application to Language Modeling Approaches

Term frequency normalization is a serious issue since lengths of documen...
research
03/16/2022

FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction

Sequence modeling has demonstrated state-of-the-art performance on natur...
research
03/16/2017

Improving Document Clustering by Eliminating Unnatural Language

Technical documents contain a fair amount of unnatural language, such as...

Please sign up or login with your details

Forgot password? Click here to reset