Fact or Fiction: Verifying Scientific Claims

04/30/2020 ∙ by David Wadden, et al. ∙ University of Washington Allen Institute for Artificial Intelligence 0

We introduce the task of scientific fact-checking. Given a corpus of scientific articles and a claim about a scientific finding, a fact-checking model must identify abstracts that support or refute the claim. In addition, it must provide rationales for its predictions in the form of evidentiary sentences from the retrieved abstracts. For this task, we introduce SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts, and annotated with labels and rationales. We present a baseline model and assess its performance on SciFact. We observe that, while fact-checking models trained on Wikipedia articles or political news have difficulty generalizing to our task, simple domain adaptation techniques represent a promising avenue for improvement. Finally, we provide initial results showing how our model can be used to verify claims relevant to COVID-19 on the CORD-19 corpus. Our dataset will be made publicly available at https://scifact.apps.allenai.org.



There are no comments yet.


page 12

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A SciFact claim refuted by evidence. To refute this claim, the system must recognize that (1) “CSF” is an acronym for “cerebral spinal fluid”, found in the brain, (2) “Citalopram” is a type of antidepressant, but “placebo” is not, and (3) “Slowing by 37%” indicates a reversal in effect relative to the claim.

Fact-checking – a task in which the veracity of an input claim is verified against a corpus of documents that support or refute the claim – has seen increased attention as an important research area. This attention is motivated by the proliferation of misinformation in political news, social media, and on the web. In turn, interest in fact-checking has spurred the creation of many datasets across different domains to support research and development of automated fact-checking systems. Yet, to our knowledge, no such dataset exists to facilitate research on another important domain for fact-checking – scientific literature. The ability to verify claims about scientific concepts, especially those related to biomedicine, is an important application area for fact-checking. Furthermore, this line of research also offers a unique opportunity to explore the capabilities of modern neural models, since successfully verifying most scientific claims requires expert background knowledge, complex language understanding, and reasoning capability, as demonstrated in Figure 1.

In this paper, we introduce the task of scientific fact-checking. To facilitate research on this task, we construct SciFact, a dataset of 1,409 scientific claims accompanied by scientific abstracts that support or refute each claim, and annotated with rationales justifying each support / refute decision. To curate this dataset, we use a novel annotation protocol that takes advantage of a plentiful source of naturally-occurring claims in the scientific literature – citation sentences, or “citances” Nakov et al. .

To establish performance baselines on this new task, we develop a pipeline model following the “BERT-to-BERT” approach from DeYoung et al. (2019), which achieves strong performance on Fever. Our model, which we call VeriSci, retrieves abstracts related to a given claim, uses a BERT-based Devlin et al. (2019) sentence selector to identify rationale sentences, and then labels each claim as Supports, Refutes, or NotEnoughInfo with respect to the claim. Our system is able to identify correctly-labeled and rationalized evidence abstracts with performance of 46.5 F1, indicating that the task is doable but leaving ample room for improvement. Despite its small size, training VeriSci on SciFact leads to better performance than training on fact-checking datasets constructed from Wikipedia articles Thorne et al. (2018) and political news Hanselowski et al. (2019). The strongest performance is achieved using a simple domain adaptation strategy, pretraining on Fever and then finetuning on SciFact.

To evaluate the real-world applicability of our dataset and approach, we showcase the ability of our model to verify expert-written claims concerning the novel coronavirus COVID-19 against the newly-released CORD-19 corpus Wang et al. (2020). Medical student reviewers judge the retrieved evidence to be plausible in 23 of the 36 claims.111While we find these results promising, we emphasize that our model is a research prototype and should not be used to make any medical decisions whatsoever. Our data and models will be released publicly at https://scifact.apps.allenai.org, along with a demo for fact-checking claims against the CORD-19 corpus.

2 Related work

We discuss SciFact in relation to existing fact-checking datasets and other related scientific NLP tasks.

2.1 Fact checking

Fact-checking datasets include PolitiFact Vlachos and Riedel (2014), Emergent Ferreira and Vlachos (2016), LIAR Wang (2017), SemEval 2017 Task 8 RumorEval Derczynski et al. (2017), Snopes Popat et al. (2017), CLEF-2018 CheckThat! Barrón-Cedeño et al. (2018), Verify ”Baly et al. (2018), Fever Thorne et al. (2018), and UKP Snopes Hanselowski et al. (2019). Notably, the latter two datasets are additionally annotated with sentence-level rationales; we refer the reader to Hanselowski et al. (2019) for a thorough review. Yet, to our knowledge, there is no prior work on scientific fact checking. We summarize key characteristics of other fact-checking datasets and explain how they differ from those in SciFact.

Natural vs synthetic claims

We distinguish between synthetic and natural claims. Fever uses synthetic claims created by annotators by mutating Wikipedia sentences selected as related evidence. Most other prior work uses natural claims curated from fact checking sites, Twitter, debates, or news articles. The claims in SciFact natural, since they are derived from citation sentences that occur naturally in scientific articles, and annotators to not see the evidence at time of claim writing. We discuss this claim-writing process further in §3.2.

Labeling claims vs claim-document pairs

In fact-checking, a claim is a statement of actuality whose veracity is a fixed target for investigation. Therefore, claims can be assigned a global supported or refuted label. For example in Fever, the claim “Barack Obama was the President of the United States” can be verified as globally supported given sufficient evidence.

While SciFact claims are indeed factual assertions, we do not attempt to assign them global labels because the asserted “fact” may still be under active scientific research. Instead of labeling claims, we label claim-document pairs with support or refute relations. This is similar to the task in Perspectrum Chen et al. (2019), which identifies evidence-backed “perspective” statements as agreeing or disagreeing with an opinion-based claim, such as “Animals should have lawful rights.” We discuss this claim-document labeling process further in §3.3.

2.2 Related scientific NLP tasks

The SciFact task is closely related to two other scientific NLP tasks – citation contextualization and evidence inference. The goal of citation contextualization is to identify all spans in a cited document that are relevant to a particular citation in a citing document Cohan et al. (2015). A dataset of 20 biomedical articles annotated with contextualized citations was released at TAC 2014 for this task.222https://tac.nist.gov/2014/BiomedSumm/index.html While the dataset was annotated by domain experts, the average inter-annotator agreement rate on annotated spans was only 21.7% 333Aggregating the dataset’s span-level annotations to the sentence level gives an agreement of 0.27 Fleiss’ .. More recently, the SciSummNet dataset Yasunaga et al. (2019) was released, focusing on NLP papers rather than biomedicine. Similar to these datasets, the annotation in SciFact involves contextualizing citances in the cited document, but in SciFact, citances are first converted into claims, and evidence is restricted to the abstracts of the cited documents.

The evidence inference task Lehman et al. (2019), involves predicting the effect of a medical intervention on a specified outcome. Like SciFact, the evidence inference task requires the model to identify evidence justifying its label predictions. Unlike the full-sentence claims given as input to SciFact, the inputs for evidence inference are individual text spans specifying an intervention, comparator, and treatment outcome.

3 The SciFact dataset

For this task, we introduce SciFact, a dataset of 1,409 scientific claims fact-checked against a corpus of 5,183 abstracts. Abstracts that support or refute a claim are additionally annotated with rationales. We describe our corpus creation and annotation protocol.

3.1 Data source

To construct SciFact, we use S2ORC Lo et al. (2020), a publicly-available corpus of millions of scientific articles. We restrict articles to those with at least 10 citations and with full text freely available444While we focus on abstracts, this choice leaves open the opportunity to extend our work to full text.. To ensure that documents in our dataset are of high quality, we randomly sample articles from a manually curated collection of well-regarded journals spanning domains from basic science (e.g., Cell, Nature) to clinical medicine (e.g., JAMA, BMJ). The full list is in Appendix B. We refer to the resulting collection of articles as our seed set.

We use the S2ORC citation graph to sample citances (from citing articles) that cite these seed articles. If a citance cites other articles not in the seed set, we refer to these as co-cited articles.

3.2 Claim writing


In SciFact, a scientific claim is an atomic factual statement expressing a finding about one aspect of a scientific entity or process, which can be verified from a single source.555Requiring annotators to search multiple sources to verify a single claim increases cognitive burden and decreases the quality of annotation. For instance, “The of the novel coronavirus is 2.5” is considered a valid scientific claim. Opinion-based statements like “The government should require people to stand six feet apart to slow the spread of coronavirus” are not considered scientific claims. Compound claims like “Aerosolized coronavirus droplets can travel at least 6 feet and can remain in the air for 3 hours” should be split into two atomic claims.


Citances Nakov et al. are an ideal source for claims since they contain expert-written assertions about important findings reported in related research articles, and, unlike claims found on the web, they specify the documents where supporting evidence can be found.

Annotators are shown a citance – the source citance – in the context of its source article, and are asked to write up to three claims based on the content of the citance while ensuring the produced claims conform to our claim definition. This results in natural claims because the annotator does not see the cited article’s abstract – the cited abstract – at the time of claim writing. Figure 2 shows an example. See Appendix C for screenshots of the claim and evidence interfaces.

Figure 2: A claim written based on a citance. Material unrelated to the citation is removed. The acronym “CVD” is expanded to “cardiovascular disease”.


The annotators include four experts with background in scientific NLP, fifteen undergraduates studying life sciences, and four graduate students (doctoral or medical) in the life sciences. Student claim writers attend an in-person training session where they are introduced to the task and receive feedback from the four experts. Following training, student annotators continue writing claims remotely. The expert annotators monitor these claims for quality and provide feedback when necessary. As a final check, all submitted claims are proofread by an undergraduate whose claims are deemed especially high-quality by the expert annotators.

Claim negation

Unless the authors of the source citance were mistaken, cited articles should provide supporting evidence for the claims made in a citance. To obtain examples where an abstract Refutes a claim, we create claim negations. Performing this task improperly can introduce biases into the dataset; for instance, a model could learn to associate the word “not” with a Refuted label Schuster et al. (2019). To mitigate these effects, a scientific NLP expert performed the negations, skipping claims that could not be negated without introducing obvious dataset artifacts. The majority of claim negations involved a reversal of effect direction; for instance “A high microerythrocyte count protects against severe anemia” can be negated as “A high microerythrocyte count raises vulnerability to severe anemia”.

3.3 Claim verification


Annotators are shown a claim, together with one of the claim’s cited abstracts, and asked to label the claim-abstract pair as Supports, Refutes, or NotEnoughInfo. If the abstract is not relevant to the claim, they are instructed to label it NotEnoughInfo. If the annotator assigns a Supports or Refutes label, they must also identify all valid rationales justifying the label. A rationale is a minimal collection of sentences sufficient to justify the label. An abstract may have multiple rationales,666This mirrors the setup in Fever, where a single claim may be supported by multiple rationales. as in Figure 3, but they must be mutually exclusive – i.e. they may not share any sentences.

Figure 3: A claim supported by two rationales from the same abstract. The text of each rationale on its own provides sufficient evidence to verify the claim.


The annotators include three NLP experts, five undergraduates studying life sciences, and five graduate students studying life sciences. Annotations are performed remotely through a web interface. Annotators are required to pass a 10-question “quiz” before annotating their own claims. After passing the quiz, subsequent submissions are reviewed by an NLP expert until that expert deems the annotator reliable. Approved annotators are then assigned to review each others’ submissions. In general, graduate students are assigned to review annotations from undergraduates.


We assign 232 claim-abstract pairs for independent re-annotation. The label agreement is 0.75 Cohen’s , comparable with the 0.68 Fleiss’ reported in Thorne et al. (2018), and 0.70 Cohen’s reported in Hanselowski et al. (2019)

. To measure rationale agreement, we treat each sentence as either classified as “part of a rationale” or “not part of a rationale” and compute sentence-level agreement on abstracts where annotators agreed on the entailment label. The resulting Cohen’s

is 0.71. Additional statistics on the dataset can be found in Appendix B.

3.4 Adding distractors to the corpus

Our initial corpus is defined as the union of the seed and co-cited abstract sets from §3.1. To simulate a more realistic corpus for retrieval, we introduce additional distractor abstracts. In doing so, we observe a tradeoff. Adding too many distractors (e.g., all biomedical papers in S2ORC) increases the likelihood of false negatives – that is, when a distractor actually contains evidence relevant to a written claim, but may have been unknown to the authors who wrote the source citance. However, adding a small number of uniformly-sampled distractors does not pose a retrieval challenge, since these documents may not share much lexical overlap with the claims.

We address this problem as follows: for each citance, we sample articles that are cited in the same document as the citance, but in a different paragraph (see Figure 4). These articles should have cover topics related to the evidence articles. At the same time, the citance authors were clearly aware of these articles, and presumably would have mentioned them in the citance if they were relevant. We add five distractor articles per citance.

Figure 4: Citance and abstract selection. Citing abstracts are identified for each seed document. A claim is written based on the citation in the citing abstract. Co-cited and distractor abstracts are added to the corpus.

4 The SciFact task

We formalize our definition of the SciFact task and define how we perform evaluation.

4.1 Task Formulation

The inputs to our fact-checking task are a scientific claim and a corpus of abstracts . All abstracts are labeled as Supports, Refutes, NotEnoughInfo with respect to a claim . The abstracts that either Support or Refute are referred to as evidence abstracts for . We denote the set of evidence abstracts . Each evidence abstract is annotated with rationales. A single rationale is a collection of sentences sufficient to justify the label , where is the number of sentences in rationale . We denote the set of all rationales as , where is the number of rationales.

Given a claim and a corpus , the system must predict a set of evidence abstracts . For each abstract , it must predict a label , and a collection of rationale sentences . Note that although the gold annotations may contain multiple separate rationales, to simplify the prediction task we simply require the model to predict a single collection of rationale sentences; these sentences may come from multiple gold rationales.

4.2 Task Evaluation

Abstract-level evaluation

is inspired by the Fever Score and measures the system’s ability to correctly identify evidence abstracts. A predicted abstract is correctly identified if (1) is a gold evidence abstract for , (2) The predicted label is correct: , (3) the predicted rationale sentences contain a gold rationale, i.e., there exists some gold rationale . Like Fever, which limits the maximum number of predicted rationale sentences to five, SciFact limits to three predicted rationale sentences.777The longest rationale in SciFact is three sentences.

Overall performance is measured by the F1 of the precision and recall of correctly-identified evidence abstracts, which we refer to as


Sentence-level evaluation

measures the system performance at identifying individual rationale sentences. We consider this evaluation in addition to the abstract-level evaluation because the abstract-level evaluation does not penalize the prediction of extra rationale sentences. To address this, we define an additional evaluation criterion at the level of individual rationale sentences. When the model correctly identifies all the sentences in a gold rationale, it is rewarded for each sentence in that rationale, but it is also penalized for all other sentences it predicts. More formally, a rationale sentence is correctly identified if (1) the abstract is correctly labeled, (2) is a member of a gold rationale , and (3) all other members of are among the predicted .

Denote the set of correctly predicted rationale sentences for claim and abstract as . We compute rationale sentence precision and recall as

Overall performance is measured as the F1 of the precision and recall, denoted as . For sentence-level evaluation, we do not limit the number of predicted rationale sentences, since the evaluation penalizes models that over-predict.

5 VeriSci: Baseline model

We develop a baseline for scientific fact checking by adapting the “BERT-to-BERT” model for “hard” rationale selection presented in DeYoung et al. (2019) for a number of rationalized NLP tasks including Fever; this approach is also similar to the fact-checking model presented in Soleimani et al. (2019). Our baseline (called VeriSci) takes a claim and corpus as input, identifies evidence abstracts , and predicts a label and rationale sentences for each . VeriSci is a pipeline of three components:

  1. [leftmargin=*,noitemsep]

  2. AbstractRetrieval, which retrieves abstracts with highest TF-IDF similarity to the input claim.

  3. RationaleSelection, which identifies rationals for each candidate abstract (§5.1).

  4. LabelPrediction, which makes the final label prediction 5.2).

5.1 Rationale selection

Given a claim and candidate abstract , we train a model to predict for each abstract sentence , where . For each sentence, we encode the concatenated sequence using BERT 888We use BERT to refer to the class of models with the BERT architecture. Our final system uses RoBERTa-large Liu et al. (2019). and predict a score , where

is the sigmoid function,

is a linear layer and is the CLS token from the BERT encoding of . We minimize cross-entropy loss between and during training. We train the model on pairs of claims and their cited abstracts from our corpus. For each claim, we use cited abstracts labeled NotEnoughInfo, as well as non-rationale sentences from abstracts labeled Supports and Refutes as negative examples. We threshold the sigmoid values when performing selection.

5.2 Label prediction

Sentences identified by the rationale selector are passed to a separate BERT model to make the final labeling decision. Given a claim and abstract , we concatenate the claim and the rationale sentences ,999We truncate the rationale input if it exceeds the BERT token limit. is never truncated. and predict , where is the softmax function, and is a linear layer with three outputs representing the {Supports, Refutes, NotEnoughInfo } labels. We minimize the cross-entropy loss between and the true label .

We train the model on pairs of claims and their cited abstracts using gold rationales as input. For abstracts labeled NotEnoughInfo, we randomly choose sentences from the cited abstract as input rationales.101010 is chosen from

with 0.5 probability.

When making predictions, we use the predicted rationale sentences as input and predict . The system predicts NotEnoughInfo when given an abstract with no rationale sentences.

6 Experiments

In our experiments, we (1) establish a performance baseline on SciFact using VeriSci, (2) analyze the performance of the three components of VeriSci, (3) demonstrate the importance of in-domain training data, and (4) present promising qualitative results on verifying claims about COVID-19 using VeriSci.

6.1 Results

Table 1 shows the full-pipeline performance of VeriSci on the SciFact test set, evaluated using the abstract-level and sentence-level metrics defined in §4. The value of 46.5 indicates that, for roughly half of the claim-abstract pairs, VeriSci correctly identifies the Supports or Refutes label and provides reasonable evidence to justify the decision. Given the difficulty of the task and limited in-domain training data, we consider this a promising result, while leaving plenty of room for improvement.

Oracle experiments Sentence-level Abstract-level
Abstract- Rationale- Label-
Retrieval Selection Prediction P R F1 P R F1
VeriSci Oracle Oracle 100.0 68.1 81.0 100.0 68.5 81.3
Oracle VeriSci Oracle 75.1 75.7 75.4 100.0 84.2 91.4
Oracle Oracle VeriSci 89.6 72.2 79.9 90.1 77.5 83.3
Oracle VeriSci VeriSci 66.5 55.7 60.6 84.9 63.5 72.7
VeriSci Oracle VeriSci 87.6 49.5 63.2 88.9 54.1 67.2
VeriSci VeriSci Oracle 73.5 54.6 62.6 100.0 59.9 74.9
Final system
VeriSci 38.6 40.5 39.5 46.6 46.4 46.5
Table 1: Test set performance of our VeriSci system on SciFact, as measured by the sentence-level and abstract-level performance evaluations defined in §4.2. For the Oracle experiments, the first three columns in the table indicate whether each module in the pipeline has been replaced by an oracle, or uses a VeriSci system component. In Final system, we present the results of the full VeriSci pipeline.

Oracle experiments

To examine the performance of each system component, we run the VeriSci pipeline, replacing some components with “oracles” that always make correct predictions when given correct inputs.111111For instance, the “Label” oracle always predicts the correct label as long as it is provided with at least one full gold rationale as input. The first three rows in Table 1 are double-oracle; in these experiments, we isolate the performance of a single model component together with two oracles. The next three rows are single-oracle, and examine performance using two of the three model components combined with one oracle.

Interestingly, the three pipeline components share similar levels of responsibility for model errors as measured by . The double-oracle models all have values around 80. The single-oracle models have values around 60, and the final system is roughly 40. Thus, replacing a single oracle component introduces a loss of roughly 20 . These results suggest that no single module is serving as a performance “bottleneck”; improvements at each stage of the pipeline are likely to improve overall performance.

Training datasets

During model development, we train the RationaleSelection and LabelPrediction modules on four different datasets: Fever, UKP Snopes, SciFact, and Fever pretraining followed by SciFact fine-tuning. The RationaleSelection module is evaluated on its ability to identify rationale sentences given gold abstracts.121212Our Fever-trained RationaleSelection module achieves 79.9 sentence-level F1 on the Fever test set, virtually identical to the value of 79.6 reported in DeYoung et al. (2019) The LabelPrediction module is evaluated on its classification accuracy given gold rationales from evidence abstracts (including evidence documents labeled NotEnoughInfo). The results of these experiments are shown in Table 2. For RationaleSelection, training on SciFact alone produces good results, perhaps because domain-specific lexical cues are sufficient in most cases for identifying rationale sentences. For the more complex reasoning involved in LabelPrediction, domain adaptation was the most effective approach, training first on the large Fever dataset and then the smaller in-domain SciFact training set. Based on these results, we use the RationaleSelection module trained on SciFact only, and the LabelPrediction module trained on Fever + SciFact for our final end-to-end system VeriSci. Additional implementation details can be found in Appendix A.

Rationale- Label-
Selection Prediction
P R F1 Accuracy
Fever 41.5 57.9 48.4 67.6
UKP Snopes 42.5 62.3 50.5 71.3
SciFact 73.7 70.5 72.1 75.7
Fever + SciFact 72.4 67.2 69.7 81.9
Table 2: Comparison of different training datasets for RationaleSelection and LabelPrediction, evaluated on the SciFact dev set.

6.2 Verifying claims about COVID-19

We conduct exploratory experiments using our system to fact-check claims concerning COVID-19. We task a medical student to write 36 COVID-related claims. For each claim , we use VeriSci to predict evidence abstracts . The same medical student annotator assigns a label to each pair. A pair is labeled plausible if at least half of the evidence abstracts in are judged to have reasonable rationales and labels. It is labeled missed if . Finally, it is labeled implausible if the majority of the abstracts in have irrelevant rationales or incorrect labels.

Table 3 shows two example claims, both with supporting and refuting evidence identified by VeriSci. For the majority of these COVID-related claims (23 out of 36), the rationales produced by VeriSci was deemed plausible by our annotator, demonstrating that VeriSci is able to successfully retrieve and classify evidence in many cases. An examination of errors reveals that the system can be confused by context, where abstracts are labeled Supports or Refutes even though the rationale sentences reference a different disease or drug from the claim. An example of this is also provided in Table 3.

Claim: Lopinavir / ritonavir have exhibited favorable clinical responses when used as a treatment for coronavirus.
Supports: The 54-year old male is the third patient diagnosed with COVID-19 in Korea … Interestingly, after lopinavir/ritonavir (Kaletra, AbbVie) was administered, -coronavirus viral loads significantly decreased and no or little coronavirus titers were observed.
Refutes: The focused drug repurposing of known approved drugs (such as lopinavir/ritonavir) has been reported failed for curing SARS-CoV-2 infected patients.. It is urgent to generate new chemical entities against this virus …
Wrong context: There are no approved treatments for MERS-CoV infection although a combination of lopinavir, ritonavir and interferon beta … In mice, both prophylactic and therapeutic RDV improve pulmonary function and reduce lung viral loads and severe lung pathology.
Claim: The coronavirus cannot thrive in warmer climates.
Supports: …most outbreaks display a pattern of clustering in relatively cool and dry areas…This is because the environment can mediate human-to-human transmission of SARS-CoV-2, and unsuitable climates can cause the virus to destabilize quickly…
Refutes: …significant cases in the coming months are likely to occur in more humid (warmer) climates, irrespective of the climate-dependence of transmission and that summer temperatures will not substrantially limit pandemic growth.
Table 3: Results of our system on several claims concerning COVID-19. In some cases, the label is predicted given the wrong context, e.g. the third evidence sentence for the first claim is a finding about Lopinavir, but for the wrong disease (MERS-CoV).

7 Discussion and Future Directions

Though SciFact represents progress in scientific fact-checking, we look forward to making further improvements. In several cases described below, we attempt to collect more fine-grained data for certain subtasks, but are impeded by annotation challenges. We also discuss how the task of scientific fact-checking can be naturally extended to involve evidence synthesis.

7.1 Partial evidence

During pilot annotations for entailment labeling, annotators are instructed to label abstracts as one of Supports, PartiallySupports, NotEnoughInfo, PartiallyRefutes, or Refutes. The Perspectrum dataset Chen et al. (2019) features a similar annotation scheme for annotating evidence in online debates. The Partial categorization is useful in cases like the one shown in Figure 5, where the abstract contains relevant evidence, but the context is different (mouse vs. human). When an annotator selects a Partial label, they are also instructed to edit the claim being verified, making as few changes as possible, such that the evidence would provide full support / contradiction for the edited claim.

Figure 5: An abstract that partially supports a claim. The edited claim is fully supported.

Unfortunately, inter-annotator label agreement is only 0.48 Cohen’s on this more granular annotation task, largely due to disagreement over the Partial label. This is unsurprising given the subjectivity of the task, and is consistent with the findings from  Chen et al. (2019). Based on this low agreement, we completely remove partially-supported claims from the task dataset.131313Though we make these claims and their edits available as a supplement to the dataset. Improving agreement on partial labels is part of ongoing work.

7.2 Modeling contextual information

Similarly, for claim verification, we initially instruct annotators to identify primary and supplemental rationale sentences for each rationale. Primary sentences are those that are needed to verify the claim, while supplemental sentences provide important context missing from primary sentences that are still necessary for appropriately selecting the Supports or Refutes label. For example, in Figure 1, the claim specifies “in experimental animals,” yet no part of the rationale sentence indicates that its content applies to experimental animals. In this case, another sentence in the rationale abstract supplying information that the experiment was conducted in mice would qualify as a supplemental sentence for this rationale.

We provide some guidance on when and how to select supplemental sentences, such as defining context to be aspects of the claim such as country or population, or instructing annotators to select the first sentence in an abstract that provides the supplementary information. However, agreement on supplemental rationale sentences is low among annotators (Cohen’s = 0.45). Consequently, we remove supplemental rationale sentences from the task dataset, though we continue to work with annotators on improving agreement.

7.3 Evidence synthesis

Evidence synthesis Marshall et al. (2017) is the task of combining relevant information across different sources to inform decision making. Evaluating the veracity of a scientific statement is challenging, even for human experts. It requires assessing the strength of conflicting evidence from documents of varying degrees of support, credibility, and recency, and synthesizing the results in a meaningful and actionable way.

Evidence synthesis is not a current part of our task definition. Though we do not ask our system to make corpus-level decisions about a claim’s veracity, the extracted evidence and entailment labels produced by VeriSci can naturally be extended for evidence synthesis. However, because performance degrades with each additional pipeline component, further understanding of the scientific fact-checking task and its subtasks is necessary before such a system could be useful in practice. Accurate representations of partial evidence and contextual knowledge are necessary steps towards this goal.

8 Conclusion

Fact checking is important in the scientific domain because it allows us to trace the sources and measure the veracity of scientific claims. These abilities have emerged as particularly important in the context of the reproducibility crisis in science and the rise of disinformation in society. In this article, we formalize the definition of scientific fact checking, and release a dataset (SciFact) and models (VeriSci) to support work on this task.

Scientific fact checking poses a set of unique challenges, pushing the limits of neural models on complex language understanding and reasoning. Domain-adaptation techniques show promise, but our findings suggest that additional work is necessary to improve the performance of end-to-end fact-checking systems. We also demonstrate how fact checking might work in practice, by applying our system to the real-world problem of verifying claims related to COVID-19. We hope that these resources encourage others to pursue and expand upon our work, and to further shed light on the broader and more challenging goal of scientific document understanding.


  • R. ”Baly, M. Mohtarami, J. Glass, L. Màrquez, A. Moschitti, and P. Nakov (2018) Integrating stance detection and fact checking in a unified corpus. In North American Association for Computational Linguistics (NAACL), Cited by: §2.1.
  • A. Barrón-Cedeño, T. Elsayed, R. Suwaileh, L. M. i Villodre, P. Atanasova, W. Zaghouani, S. Kyuchukov, G. D. S. Martino, and P. Nakov (2018) Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims. task 2: factuality. In CLEF, Cited by: §2.1.
  • [3] I. Beltagy, K. Lo, and A. Cohan SciBERT: a pretrained language model for scientific text. In

    Empirical Methods in Natural Language Processing (EMNLP)

    Cited by: §A.1.
  • S. Chen, D. Khashabi, W. Yin, C. Callison-Burch, and D. Roth (2019) Seeing things from a different angle: discovering diverse perspectives about claims. In North American Association for Computational Linguistics (NAACL), Cited by: §2.1, §7.1, §7.1.
  • A. Cohan, L. Soldaini, and N. Goharian (2015) Matching citation text and cited spans in biomedical literature: a search-oriented approach. In North American Association for Computational Linguistics (NAACL), Cited by: §2.2.
  • L. Derczynski, K. Bontcheva, M. Liakata, R. Procter, G. Wong Sak Hoi, and A. Zubiaga (2017) SemEval-2017 task 8: RumourEval: determining rumour veracity and support for rumours. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), External Links: Link, Document Cited by: §2.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In North American Association for Computational Linguistics (NAACL), Cited by: §1.
  • J. DeYoung, S. Jain, N. F. Rajani, E. Lehman, C. Xiong, R. Socher, and B. C. Wallace (2019) ERASER: a benchmark to evaluate rationalized nlp models. ArXiv abs/1911.03429. Cited by: §1, §5, footnote 12.
  • W. Ferreira and A. Vlachos (2016) Emergent: a novel data-set for stance classification. In North American Association for Computational Linguistics (NAACL), External Links: Link, Document Cited by: §2.1.
  • S. Gururangan, A. Marasovic, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t stop pretraining: adapt language models to domains and tasks.. Association for Computational Linguistics (ACL). Cited by: §A.1.
  • A. Hanselowski, C. Stab, C. Schulz, Z. Li, and I. Gurevych (2019) A richly annotated corpus for different tasks in automated fact-checking. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Cited by: §1, §2.1, §3.3.
  • E. Lehman, J. DeYoung, R. Barzilay, and B. C. Wallace (2019) Inferring which medical treatments work from reports of clinical trials. In North American Association for Computational Linguistics (NAACL), Cited by: §2.2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: footnote 8.
  • K. Lo, L. L. Wang, M. Neumann, R. Kinney, and D. S. Weld (2020) S2ORC: The Semantic Scholar Open Research Corpus. In Association for Computational Linguistics (ACL), External Links: Link Cited by: §3.1.
  • I. J. Marshall, J. Kuiper, E. Banner, and B. C. Wallace (2017) Automating biomedical evidence synthesis: robotreviewer. Association for Computational Linguistics (ACL). Cited by: §7.3.
  • [16] P. I. Nakov, A. S. Schwartz, and M. Hearst Citances: citation sentences for semantic analysis of bioscience text. In Proceedings of the SIGIR’04 workshop on Search and Discovery in Bioinformatics, Cited by: §1, §3.2.
  • K. Popat, S. Mukherjee, J. Strötgen, and G. Weikum (2017) Where the truth lies: explaining the credibility of emerging claims on the web and social media. In Proceedings of the International World Wide Web Conference (WWW), Cited by: §2.1.
  • T. Schuster, D. J. Shah, Y. J. S. Yeo, D. Filizzola, E. Santus, and R. Barzilay (2019) Towards debiasing fact verification models. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.2.
  • A. Soleimani, C. Monz, and M. Worring (2019) BERT for evidence retrieval and claim verification. ArXiv abs/1910.02655. Cited by: §5.
  • J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and verification. In North American Association for Computational Linguistics (NAACL), Cited by: §1, §2.1, §3.3.
  • A. Vlachos and S. Riedel (2014) Fact checking: task definition and dataset construction. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, External Links: Link, Document Cited by: §2.1.
  • L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. Kinney, Z. Liu, William. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson, A. D. Wade, K. Wang, C. Wilhelm, B. Xie, D. M. Raymond, D. S. Weld, O. Etzioni, and S. Kohlmeier (2020) CORD-19: the covid-19 open research dataset. ArXiv abs/2004.10706. Cited by: §1.
  • W. Y. Wang (2017) “liar, liar pants on fire”: a new benchmark dataset for fake news detection. In Association for Computational Linguistics (ACL), External Links: Link, Document Cited by: §2.1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: Appendix A.
  • M. Yasunaga, J. Kasai, R. Zhang, A. R. Fabbri, I. Li, D. Friedman, and D. R. Radev (2019) ScisummNet: a large annotated corpus and content-impact models for scientific paper summarization with citation networks. In AAAI, Cited by: §2.2.

Appendix A Model implementation details

All models are implemented using the Huggingface Transformers package Wolf et al. (2019).

a.1 Parameters for the final VeriSci system

For the AbstractRetrieval module, VeriSci retrieves the top documents ranked by TF-IDF similarity using unigram + bigram features. These parameters are tuned on the SciFact development set.

For both the RationaleSelection and LabelPrediction modules, we experiment with SciBERT Beltagy et al. , BioMedRoBERTa Gururangan et al. (2020), RoBERTa-base, and RoBERTa-large. RoBERTa-large achieves the best development set performance for both subtasks and is used in the final model.

When making predictions using the RationaleSelection module described in §5.1, we find that the usual decision rule of predicting when works well for models trained on SciFact. However, for models trained on Fever and UKP Snopes, we achieve better performance by tuning the classification threshold , such that when , on the SciFact dev set. The best threshold was for VeriSci trained on Fever, and for VeriSci trained on UKP Snopes.

a.2 Training the RationaleSelection module

We experiment with various learning rates when training SciBERT, BioMedRoBERTa, RoBERTa-base, and RoBERTa-large. Below we describe the setting for training RoBERTa-large.

For models trained on SciFact, we use an initial learning rate of 1e-5 on the transformer base and 1e-3 on the linear layer. For Fever + SciFact, the learning rate is set to 1e-5 for the entire model for pre-training on Fever and fine-tuning on SciFact

. We use a batch size of 256 through gradient accumulation and apply cosine learning rate decay over 20 epochs to find the best performing model on the dev set.

For models trained on Fever, we set the learning rate to 0.5e-5 for the transformer base and 0.5e-4 for the linear layer. For models trained on UKP Snopes, we set the learning rate 1e-5 for the transformer base and 1e-4 for the linear layer. We find that these learning rates help the models converge. We only train the model for 5 epochs because Fever and UKP Snopes are larger datasets and the models converged within the first 5 epochs.

a.3 Training the LabelPrediction module

We adopt similar settings as we used for the RationaleSelection module and only change the learning rate to 1e-5 for the transformer base and 1e-4 for the linear layer for models trained on SciFact, Fever, and UKP Snopes.

Appendix B Detailed corpus statistics

We compute statistics separately for structured abstracts, abstracts that are organized into well-defined sections, and for unstructured abstracts.

Table 4 provides statistics summarizing the lengths of abstracts and rationales. Table 5 shows the counts for each claim-abstract label category in the train, dev, and test sets. Table 6 shows the number of evidence documents supporting each claim. The majority of claims are supported by a single document set.

Figure 5(a) shows the distribution of the number of rationales in structured and unstructured abstracts. Structured abstracts are more likely to have two evidence sets – for instance, one in the “results” section, and one in the “conclusions” section. Figure 5(b) shows the distribution of sentences per rationale.

Figure 7 shows the fraction of sentences in each abstract that are part of a rationale. Unstructured abstracts have a heavier “right tail”, representing cases where the abstract is short and the entire abstract supports the claim.

MeSH terms for evidence documents appear in Figure 8. Terms like Human, Risk factors, and Treatment outcome are common to randomized control trial reports. Terms like DNA, RNA, and Cell differentiation indicate molecular biology research.

Unstructured Structured
# evidence abstracts 340 146
Abstract length (median) 7 13
Number of rationales (mean) 1.53 1.89
Rationale fraction (median) 0.14 0.17
Table 4: Summary statistics on the abstracts in the corpus. The Abstract length is measured in number of sentences. The Rationale fraction is the fraction of sentences in each abstract that are rationales.
Supports NotEnoughInfo Refutes Total
Train 332 304 173 809
Dev 124 112 64 300
Test 100 100 100 300
Table 5: Distribution of claim-abstract labels by train-dev-test split in SciFact.

[autobooktabular, separator=tab, table head= # evidence abstracts & Count
] tables/n-evidence-docs.tsvindex=, 0=&

Table 6: The number of evidence documents supporting each claim. For instance, 37 claims are supported by 2 documents.
(a) The number of rationales supporting each claim. For instance, roughly 150 unstructured abstracts contain 2 rationales.
(b) The number of sentences per rationale.
Figure 6: Statistics on rationales. Most claims are supported by a single rationale, but structured abstracts are more likely to have two rationales. The great majority of rationales are a single sentence.
Figure 7: Distribution showing the fraction of each abstract (in sentences) that is part of a rationale.
Figure 8: Fraction of evidence abstracts in which each MESH term occurs.

[autobooktabular, separator=tab, table head= Journal & Structured & Unstructured & Total
] tables/journal-counts.tsvjournal=, Structured=, Unstructured=, Total=& & &

Table 7: Number of cited documents by journal.

Appendix C Annotation interfaces

The claim and evidence interfaces are shown in Figures 9 and 10.

Figure 9: The claim-writing interface. The citation sentence is highlighted in blue on the top left. Additional context is provided on bottom left. The right side shows two claims that could be written based on this citation sentence.
Figure 10: The evidence collection interface.