CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding

05/23/2021
by   Dustin Wright, et al.
24

Scientific document understanding is challenging as the data is highly domain specific and diverse. However, datasets for tasks with scientific text require expensive manual annotation and tend to be small and limited to only one or a few fields. At the same time, scientific documents contain many potential training signals, such as citations, which can be used to build large labelled datasets. Given this, we present an in-depth study of cite-worthiness detection in English, where a sentence is labelled for whether or not it cites an external source. To accomplish this, we introduce CiteWorth, a large, contextualized, rigorously cleaned labelled dataset for cite-worthiness detection built from a massive corpus of extracted plain-text scientific documents. We show that CiteWorth is high-quality, challenging, and suitable for studying problems such as domain adaptation. Our best performing cite-worthiness detection model is a paragraph-level contextualized sentence labelling model based on Longformer, exhibiting a 5 F1 point improvement over SciBERT which considers only individual sentences. Finally, we demonstrate that language model fine-tuning with cite-worthiness as a secondary task leads to improved performance on downstream scientific document understanding tasks.

READ FULL TEXT

page 6

page 7

research
06/12/2021

A Sentence-level Hierarchical BERT Model for Document Classification with Limited Labelled Data

Training deep learning models with limited labelled data is an attractiv...
research
03/13/2022

SciNLI: A Corpus for Natural Language Inference on Scientific Text

Existing Natural Language Inference (NLI) datasets, while being instrume...
research
07/04/2023

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Document understanding refers to automatically extract, analyze and comp...
research
12/14/2020

Primer AI's Systems for Acronym Identification and Disambiguation

The prevalence of ambiguous acronyms make scientific documents harder to...
research
04/07/2022

Sequence-Based Extractive Summarisation for Scientific Articles

This paper presents the results of research on supervised extractive tex...
research
08/26/2021

SAUCE: Truncated Sparse Document Signature Bit-Vectors for Fast Web-Scale Corpus Expansion

Recent advances in text representation have shown that training on large...
research
12/22/2020

Acronym Identification and Disambiguation Shared Tasks for Scientific Document Understanding

Acronyms are the short forms of longer phrases and they are frequently u...

Please sign up or login with your details

Forgot password? Click here to reset