SPECTER: Document-level Representation Learning using Citation-informed Transformers

04/15/2020
by   Arman Cohan, et al.
0

Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, the embeddings power strong performance on end tasks. We propose SPECTER, a new method to generate document-level embedding of scientific documents based on pretraining a Transformer language model on a powerful signal of document-level relatedness: the citation graph. Unlike existing pretrained language models, SPECTER can be easily applied to downstream applications without task-specific fine-tuning. Additionally, to encourage further research on document-level models, we introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. We show that SPECTER outperforms a variety of competitive baselines on the benchmark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/15/2020

Document-level Representation Learning using Citation-informed Transformers

Representation learning is a critical ingredient for natural language pr...
research
05/07/2023

MIReAD: Simple Method for Learning High-quality Representations from Scientific Documents

Learning semantically meaningful representations from scientific documen...
research
01/23/2020

Navigation-Based Candidate Expansion and Pretrained Language Models for Citation Recommendation

Citation recommendation systems for the scientific literature, to help a...
research
10/03/2020

Multilevel Text Alignment with Cross-Document Attention

Text alignment finds application in tasks such as citation recommendatio...
research
11/23/2022

SciRepEval: A Multi-Format Benchmark for Scientific Document Representations

Learned representations of scientific documents can serve as valuable in...
research
10/24/2020

ReadOnce Transformers: Reusable Representations of Text for Transformers

While large-scale language models are extremely effective when directly ...
research
09/08/2023

Encoding Multi-Domain Scientific Papers by Ensembling Multiple CLS Tokens

Many useful tasks on scientific documents, such as topic classification ...

Please sign up or login with your details

Forgot password? Click here to reset