Document-level Representation Learning using Citation-informed Transformers

04/15/2020
by   Arman Cohan, et al.
0

Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, the embeddings power strong performance on end tasks. We propose SPECTER, a new method to generate document-level embedding of scientific documents based on pretraining a Transformer language model on a powerful signal of document-level relatedness: the citation graph. Unlike existing pretrained language models, SPECTER can be easily applied to downstream applications without task-specific fine-tuning. Additionally, to encourage further research on document-level models, we introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. We show that SPECTER outperforms a variety of competitive baselines on the benchmark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/15/2020

SPECTER: Document-level Representation Learning using Citation-informed Transformers

Representation learning is a critical ingredient for natural language pr...
research
05/07/2023

MIReAD: Simple Method for Learning High-quality Representations from Scientific Documents

Learning semantically meaningful representations from scientific documen...
research
10/20/2021

Contrastive Document Representation Learning with Graph Attention Networks

Recent progress in pretrained Transformer-based language models has show...
research
01/23/2020

Navigation-Based Candidate Expansion and Pretrained Language Models for Citation Recommendation

Citation recommendation systems for the scientific literature, to help a...
research
11/23/2022

SciRepEval: A Multi-Format Benchmark for Scientific Document Representations

Learned representations of scientific documents can serve as valuable in...
research
09/08/2023

Encoding Multi-Domain Scientific Papers by Ensembling Multiple CLS Tokens

Many useful tasks on scientific documents, such as topic classification ...
research
03/14/2023

Finding the Needle in a Haystack: Unsupervised Rationale Extraction from Long Text Classifiers

Long-sequence transformers are designed to improve the representation of...

Please sign up or login with your details

Forgot password? Click here to reset