Misinformation has High Perplexity

06/08/2020 ∙ by Nayeon Lee, et al. ∙ The Hong Kong University of Science and Technology 18

Debunking misinformation is an important and time-critical task as there could be adverse consequences when misinformation is not quashed promptly. However, the usual supervised approach to debunking via misinformation classification requires human-annotated data and is not suited to the fast time-frame of newly emerging events such as the COVID-19 outbreak. In this paper, we postulate that misinformation itself has higher perplexity compared to truthful statements, and propose to leverage the perplexity to debunk false claims in an unsupervised manner. First, we extract reliable evidence from scientific and news sources according to sentence similarity to the claims. Second, we prime a language model with the extracted evidence and finally evaluate the correctness of given claims based on the perplexity scores at debunking time. We construct two new COVID-19-related test sets, one is scientific, and another is political in content, and empirically verify that our system performs favorably compared to existing systems. We are releasing these datasets publicly to encourage more research in debunking misinformation on COVID-19 and other topics.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Debunking misinformation is a process of exposing the falseness of given claims based on relevant evidence. The failure to debunk misinformation in a timely manner can result in catastrophic consequences, as illustrated by the recent death of a man who tried to self-medicate with chloroquine phosphate to prevent COVID-19 Vigdor (2020). Amid the COVID-19 pandemic and infodemic, the need for an automatic debunking system has never been more dire.

Figure 1: Proposed approach of using the language model (LM) as a debunker. We prime an LM with extracted evidence relevant to the whole set of claims, and then we compute the perplexities during the debunking stage.

A lack of data is the major obstacle for debunking misinformation related to newly emerged events like the COVID-19 pandemic. There are no labeled data available to adopt supervised deep-learning approaches

Etzioni et al. (2008); Wu et al. (2014); Ciampaglia et al. (2015); Popat et al. (2018); Alhindi et al. (2018), and an insufficient amount of social- or meta-data to apply feature-based approaches Long et al. (2017); Karimi et al. (2018); Kirilin and Strube (2018). To overcome this bottleneck, we propose an unsupervised approach of using a large language model (LM) as a debunker.

Misinformation is a piece of text that contains false information regarding its subject, resulting in a rare co-occurrence of the subject

and its neighboring words in a truthful corpus. When a language model primed with truthful knowledge is asked to reconstruct the missing words in a claim, such as “5G communication transmits … ”, it would predict word “information” with the highest probability (

). On the other hand, it would predict the word ”COVID-19” in a false claim such as “5G communication transmits COVID-19” with very low probability (i.e., ). It follows that truthful statements would give low perplexity whereas false claims tend to have high perplexity, when scored by a truth-grounded language model. Since perplexity is a score for quantifying the likelihood of a given sentence based on previously encountered distribution, we propose a novel interpretation of perplexity as a degree of falseness.

To further address the problem of data scarcity, we leverage the large pre-trained LM, such as GPT-2

Radford et al. (2019)

, which are shown to be helpful in a low-resource setting by allowing the transfer learning of features learned from their huge training corpus 

Devlin et al. (2018); Radford et al. (2018); Liu et al. (2019); Lewis et al. (2019). It is also illustrated the potential of LMs in learning useful knowledge without any explicit task-specific supervision to perform well on tasks such as question answering and summarization Radford et al. (2019); Petroni et al. (2019).

Moreover, it is crucial to ensure that the LM is populated with “relevant and clean evidence” before assessing the claims, especially when these are related to newly emerging events. There are two main ways of obtaining evidence in fact-checking tasks. One way is to rely on evidence from the structured knowledge base such as Wikipedia and knowledge-graph 

Wu et al. (2014); Ciampaglia et al. (2015); Thorne et al. (2018); Yoneda et al. (2018b); Nie et al. (2019). Another approach is to obtain evidence directly from unstructured data online Etzioni et al. (2008); Magdy and Wanas (2010). However, the former approach faces a challenge in maintaining up-to-date knowledge, making it vulnerable to unprecedented outbreaks. On the other hand, the latter approach suffers from the credibility issue and the noise of the evidence. In our work, we attempt to combine the best of both worlds in the evidence selection step by extracting evidence from unstructured-data and ensuring quality by filtering noise.

The contribution of our work is threefold. First, we propose a novel dimension of using language model perplexity to debunk false claims, as illustrated in Figure 1. It is not only data efficient but also achieves promising results comparable to supervised baseline models. We also carry out qualitative analysis to understand the optimal ways of exploiting perplexity as a debunker. Second, we present an additional evidence-filtering step to improve the evidence quality, which consequentially improves the overall debunking performance. Finally, we construct and release two new COVID-19-pandemic-related debunking test sets.

False Claims Perplexity
Ordering or buying products shipped from overseas will make a person get COVID-19. 556.2
Sunlight actually can kill the novel COVID-19. 385.0
5G helps COVID-19 spread. 178.2
Home remedies can cure or prevent the COVID-19. 146.2
True Claims Perplexity
The main way the COVID-19 spreads is through respiratory droplets. 5.8
The most common symptoms of COVID-19 are fever, tiredness, and dry cough. 6.0
The source of SARS-CoV-2, the coronavirus (CoV) causing COVID-19 is unknown. 8.1
Currently, there is no vaccine and no specific antiviral medicine to prevent or treat COVID-19. 8.4
Table 1: Relationship between claims and perplexity. False claims have higher perplexity compared to True claims.

2 Motivation

Language Modeling

Given a sequence of tokens , language models are trained to compute the probability of the sequence

. This probability is factorized as a product of conditional probability by applying the chain rules

Manning et al. (1999); Bengio et al. (2003):


In recent years, large transformer-based Vaswani et al. (2017) language models have been trained to minimize the negative log-likelihood over a large collection of documents.

Leveraging Perplexity

Perplexity, a commonly used metric for measuring the performance of an LM, is defined as the inverse of the probability of the test set normalized by the number of words:


Another way of interpreting perplexity is the measure of the likelihood of a given test sentence in reference to the training corpus. Based on this intuition, we hypothesize the following:

“When a language model is primed with a collection of relevant evidence about given claims, the perplexity can serve as an indicator for falseness.”

The rationale behind is that the extracted evidence sentences for a True claim would share more similarities (e.g., common terms or synonyms) with its associated claim. This leads to True claims to have lower perplexity while the False claims remain having higher perplexity.

3 Methodology

Task Definition

Debunking is the task of exposing the falseness of a given claim by extracting relevant evidence and verifying the claims upon it. More formally, given a claim with its corresponding source document , the task of debunking is to assigning a label from by retrieving and utilizing a relevant set of evidence from .

Our approach involves three steps in the inference phase: 1) Evidence selection to retrieve the most relevant evidence from . 2) Evidence grounding step to obtain our evidence-primed language model (LM Debunker). 3) Debunking step to obtain perplexity scores from the evidence-primed LM Debunker and debunking labels.

3.1 Evidence Selection

Given a claim , our Evidence Selector retrieves the top-3 relevant evidence in the following two steps.

Evidence Candidate Selection

Given the source documents , we select the top-10 most relevant evidence sentences for each claim. Depending on the domain of the claim and source documents, we rely on generic TF-IDF method to select the tuples of evidence candidates with their corresponding relevancy scores . Note that any evidence extractor can be used here.

Evidence Filtering

After selecting the top candidate tuples for the claim , we attempt to filter out the noise and unreliable evidence based on the following rulings:

1) When an evidence candidate is a quote from a low-credibility speaker such as an Internet meme222An idea, image, or video that is spread very quickly on the internet. or social-media-post, we discard it (e.g., “quote according to a social media post.”). Note that this approach leverages the speaker information inherent in the extracted evidence; 2) If a speaker of the claim is available, any quote or statement by the speaker him/herself is inadmissible to the evidence candidate pool; 3) Any evidence identical to the given claim is considered to be “invalid” evidence (i.e. direct quotation of true/false claim); 4) We filter out reciprocal questions, which only add noise but have no supporting or contradicting information to the claim from our observation. The examples of before and after this filtration is shown in the Appendix.

The final top-3 evidence is selected after the filtering based on the provided extractor score . An example of a claim and its corresponding extracted evidence are shown in Table 2.

Claim: The main way the COVID-19 spreads is through respiratory droplets. Label: True
Evidence 1: The main mode of COVID-19 transmission is via respiratory droplets, although the
potential of transmission by opportunistic airborne routes via aerosol-generating procedures
in health care facilities, and environmental factors, as in the case of Amoy Gardens, is known.
Evidence 2: The main way that influenza viruses are spread is from person to person via
virus-laden respiratory droplets (particles with size ranging from 0.1 to 100 m in diameter) that
are generated when infected persons cough or sneeze.
Evidence 3: The respiratory droplets spread can occur only through direct person-to-person
contact or at a close distance.
Table 2: Illustration of evidence extracted using our Evidence Selector

3.2 Grounding LM with Evidence

For the purpose of priming the language model, all the extracted evidence for a batch of claims are aggregated as . We obtain our evidence-grounded language model (LM Debunker) by minimizing the following loss :


where the denotes a tokens in the evidence , and the parameters of the language model. It is important to highlight that none of the debunking labels or claims are involved in this evidence grounding step and that our proposed methodology is model agnostic.

3.3 Debunking with Perplexity

The last step is to obtain debunking labels based on the perplexity values from the LM Debunker. As shown in Table 1, perplexity values reveal a pattern that aligns with our hypothesis regarding its association with falseness; the false claims have higher perplexity than the true claims (For more examples of perplexity values, refer to the Appendix). Perplexity scores can be translated to debunking labels by comparing to a perplexity threshold that defines the False

boundary in the perplexity space. Any claim with a perplexity score higher than the threshold is classified as

False, and vice versa for True.

The optimal method of selecting the hyper-parameter threshold is an open research question. From an application perspective, any value can serve as a threshold depending on the desired level of “strictness” towards false claims. We define “strictness” as the degree of precaution towards false negative error, which is the most undesirable form of error in debunking (refer to Section 7 for details). From an experimental analysis perspective, a small validation set could be leveraged for hyper-parameter tuning of the threshold (). In this paper, since we have small test sets, we do k-fold cross-validation () to obtain the average performance reported in Section 6.

4 Dataset

4.1 COVID19 Related Test Sets


A new test set is constructed by collecting COVID-19-related myths and scientific truths labeled by reliable sources like MedicalNewsToday, Centers for Disease Control and Prevention (CDC), and World Health Organization (WHO). It consists of the most common scientific or medical myths about COVID-19, which must be debunked correctly to ensure the safety of the public (e.g., “drinking a bleach solution will prevent you from getting the COVID-19.”). There are 142 claims (Table 3) with labels obtained from the aforementioned reliable sources. According to WHO and CDC, some myths are unverifiable from current findings, and we assigned False labels to them (e.g., “The coronavirus will die off when temperatures rise in the spring.”).


Another test set is constructed by crawling COVID-19-related claims fact-checked by journalists from a website called Politifact 333https://www.politifact.com/. Unlike the Covid19-scientific test set, it contains non-scientific and political claims such as “For the coronavirus, the death rate in Texas, per capita of 29 million people, we’re one of the lowest in the country”. Such political claims may not be life-and-death matters, but they still have the potential to bring negative sociopolitical effects. Originally, these claims are labeled into six classes {pants-fire, false, barely-true, half-true, mostly-true, true}, which represent the decreasing degree of fakeness. We use a binary setup for consistency with our setup for Covid19-scientific by assigning the first three classes as False and the rest as True. For detailed data statistics, refer to Table 3.

Test sets
Covid19-scientific 101 41 142
Covid19-politifact 263 77 340
Table 3: COVID-19 Related Test Set Statistics

Gold Source Documents

Different gold source document are used depending on the domain of the test sets. For the Covid19-scientific, we use CORD-19 dataset444https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge, a free open research resource for combating the COVID-19 pandemic. It is a resource of over 59,000 scholarly articles, including over 47,000 with full text, about COVID-19, SARS-CoV-2, and other related coronaviruses.

For the Covid19-politifact, we leverage the resources of the Politifact website. When journalists verify the claims on Politifact, they provide pieces of text that contain: i) the snippets of relevant information from various sources, and ii) a paragraph of their justification for the verification decisions. We only take the first part (i) to be our gold source documents, to avoid using explicit verdicts on test claims as evidence.

Covid19-scientific Covid19-politifact
Models Accuracy
Fever-HexaF 64.8% 58.1% 74.8% 46.6% 37.9% 61.2%
LiarPlusMeta 42.3% 41.1% 32.8% 80.3% 66.8% 86.5%
LiarPlus 45.1% 44.8% 44.9% 54.4% 52.8% 61.5%
LiarOurs 61.5% 59.2% 82.8% 78.5% 67.7% 86.4%
LM Debunker 75.4% 69.8% 83.1% 74.4% 58.8% 84.2%
Table 4: Result comparison of our LM Debunker and baselines on two COVID-19 related test sets. Blue highlights represent the models tested in out-of-distribution setting (i.e., train set and test set are from different distribution). Note that all accuracy scores are statistically significant ().

5 Experiments

5.1 Baseline Models

Although unrelated to COVID-19 misinformation, there are notable state-of-the-art (SOTA) models and their associated datasets in fact-based misinformation field.

Fever Thorne et al. (2018)

Fact Extraction and Verification (FEVER) is a publicly released large-scale dataset generated by altering sentences extracted from Wikipedia to promote research in fact-checking systems. We use one of the winning systems from the FEVER workshop555https://fever.ai/2018/task.html. We use the 2nd team because we had problems running the 1st team’s codebase. Note that the accuracy between 1st and 2nd is very minimal as our first baseline model.

  • Fever-HexaF Yoneda et al. (2018a) Given a claim and a set of evidence, it leverages a natural language inference model to get entailment scores between claim and each evidence (

    ), and obtains the final prediction label by aggregating the entailment scores using a Multi-Layer Perceptron (MLP).

LIAR-Politifact Wang (2017)

LIAR is a publicly available dataset collected from the Politifact website, which consists of 12.8k claims. The label setup is the same as our Covid19-politifact test set, and the data characteristics are very similar, but LIAR does not contain any claims related to COVID-19.

We also report three strong BERT-based Devlin et al. (2018) baseline models trained on LIAR data:

  • LiarPlusMeta: Our BERT-large-based replication of SOTA paper from Alhindi et al.. It uses meta-information and “justification,” human-written reasoning for verification decision in Politifact article, as evidence for the claim. Our replication is a more robust baseline, outperforming the reported SOTA accuracy by absolute (refer to Appendix for detailed result table).

  • LiarPlus: Similar to LiarPlusMeta model, but without meta-information. Our replication also outperforms the SOTA by absolute in accuracy. This baseline is important because the focus of this paper is to explore the debunking ability in a data-scarce setting, where meta-information may not exist.

  • LiarOurs: Another BERT-large model fine-tuned on LIAR-Politifact claims with evidence from our Evidence Selector, instead of using human-written “justification.”

5.2 Experiment Settings

Out-of-distribution Setting

For the Covid19-scientific test set, all models are tested in the out-of-distribution (OOD) setting because the test set is from different distribution compared to all the train sets used in baseline models; Fever-HexaF model is trained on FEVER dataset, all other Liar baseline models are trained on LIAR-Politifact dataset. For the Covid19-politifact test set, Fever-HexaF model is again tested in OOD setting. However, all the Liar-models are not because both their train sets and the Covid19-politifact test set are from a similar distribution (Politifact). We use blue highlights in the Table 4 to indicate models tested in OOD settings.

Evidence Input for Testing

Recalling the task definition explained in Section 3, we test a claim with its relevant evidence. To make fair comparisons among all baseline models and our LM Denbunker, we use the same evidence extracted in the Evidence Selector step while evaluating the models on the COVID-19-related test sets.

Language Model Setting

For our experiments, GPT-2 Wolf et al. (2019) model is selected as our base language model to build LM Debunker. We use the pre-trained GPT-2 model (base), with 117 million parameters. Since the COVID-19 pandemic is a recent event, it is guaranteed that the GPT-2 has not seen any COVID-19 related information during its pre-training. Thus, very clean and unbiased language model to test our hypothesis.

Covid19-politifact LiarPlusMeta LiarOurs
Train Size Accuracy F1-Macro F1-Binary Accuracy F1-Macro F1-Binary
0 N/A N/A N/A N/A N/A N/A
10 22.6% 18.5% 0.0% 22.6% 18.5% 0.0%
100 22.6% 18.5% 0.0% 22.6% 18.5% 0.0%
500 73.5% 65.2% 82.2% 64.4% 59.4% 73.6%
1000 72.4% 63.4% 81.5% 70.9% 64.2% 79.7%
10000 (All) 80.3% 66.8% 86.5% 78.5% 67.8% 86.4%
LM Debunker 74.4% 58.8% 84.2% 74.4% 58.8% 84.2%
Table 5: Performance comparison between our LM Debunker and two baseline models trained on varying train set sizes. LiarPlusMeta and LiarOurs have shown the best performance on Covid-politifact test set in accuracy and F1-Macro, respectively. Gray highlights represent the first scores that surpass the LM Debunker scores.

5.3 Experiment Details

We evaluate the performance LM Debunker by comparing it to other baselines on two commonly used metrics: accuracy and F1-Macro score. Since identifying False claims is important in debunking, we also report F1 of False class (F1-Binary).

Recall that we report average results obtained k-fold cross-validation. The thresholds used in each fold are for Covid-politifact and for Covid-scientific666Note that we use k-fold cross-validation to obtain the average performance, not the average optimal threshold..

For the evidence grounding step, a learning rate of 5e-5 was adopted, and different epoch sizes were explored

. We reported the choice with the highest performance in both accuracy and F1-Macro. Each trial was run on Nvidia GTX 1080 Ti, taking 39 seconds per epoch for Covid-scientific and 113 seconds per epoch for Covid-politifact.

6 Experimental Results

6.1 Performance of LM Debunker

From Table 4, we can observe that our unsupervised LM debunker portrays notable strength in the out-of-distribution setting (highlighted with blue) over other supervised baselines. For the Covid19-scientific test set, it achieved state-of-the-art results across all metrics and marginally improved in accuracy and F1-binary by an absolute and respectively. Considering the severe consequences Covid19-scientific myths could potentially bring, this result is valuable.

For the Covid19-politifact test set, our LM debunker also outperformed Fever-HexaF model and LiarPlus with a significant margin, but it underperformed the LiarOurs model and the LiarPlusMeta model. Nonetheless, it is still encouraging considering the fact that these two models were trained with task-specific supervision on Politifact dataset (LIAR-Politifact), which is similar to the Covid19-politifact test set.

The results of LiarPlus and LiarPlusMeta clearly show the incongruity of the meta-based approach for cases in which meta-information is not guaranteed. LiarPlusMeta struggled to debunk the claims from the Covid19-scientific test set in contrast to achieving SOTA in Covid19-politifact test set. This is because the absence of meta-information for Covid19-scientific test set hindered LiarPlusMeta from performing to its maximum capacity. Going further, the fact that LiarPlusMeta performed even worse than LiarPlus emphasizes the difficulty faced by meta-based models in the absence of meta-information.

Figure 2: Trend of highest F1-Macro score over different numbers of epochs for evidence grounding

FEVER dataset and many FEVER-trained systems successfully showcased the advancement of systematic fact-checking. Nevertheless, Fever-HexaF model exhibited rather low performance on COVID-19 test sets ( behind LM-debunker in accuracy). One possible justification is the way how FEVER data was constructed. FEVER claims were generated by altering sentences from Wikipedia (e.g., “Hearts is a musical composition by Minogue”, label: SUPPORTS). It makes the nature of FEVER-claims have a discrepancy with the naturally occurring myths and false claims flooding the internet. This implies that FEVER might not be the most suitable dataset for training non-wikipedia-based fact-checkers.

6.2 Data Efficiency Comparison

In Table 5, we report the performance of LiarOurs and LiarPlusMeta classifiers trained on randomly sampled train sets of differing sizes {10, 100, 500, 1000, 10000}.

As shown by the gray highlights in Table 5, both classifiers overtake our debunker in F1-Macro score with labeled training data, but they require

to outperform on the rest of evaluation metric. Considering the scarcity of labeled misinformation data for newly emerged events, a data-efficient debunker is extremely meaningful.

7 Analysis and Discussion

7.1 LM Debunker Behavior Analysis

Number of Epoch for Evidence Grounding

The relationship between the number of epoch in the evidence grounding step and the debunking performance is explored. The best performance is obtained with epoch=5, as shown in Figure  2. We believe this is because a low number of epochs does not allow enough updates to encode the content of evidence into the language model sufficiently. On the contrary, a higher number of epochs over-fit the language model to the given evidence and harms the generalizability of the language model.

Threshold Perplexity Selection

As aforesaid, the threshold is controllable to reflect the desired “strictness” of the debunking system. Figure 3 depicts that decreasing the threshold helps to reduce the FN errors, which is the most dangerous form of error. Such controllability over strictness would be beneficial to the real-world applications, where the level of “strictness” matters greatly depending on the purpose of the applications.

Meanwhile, FN reduction comes with a trade-off of increased false positive errors (FP). For a more balanced debunker, an alternative threshold choice could be the intersection point of FN and FP frequencies.

Figure 3: Trend of false negative and false positive counts over varying threshold.
Covid19-scientific Covid19-politifact
Acc. F1-Macro Acc. F1-Macro
Before 74.6% 56.3% 75.0% 50.6%
After 75.4% 69.8% 74.4% 58.9%
Table 6: Performance comparison between the “before” and “after” filtering steps in Evidence Selector

7.2 Evidence Analysis

Our ablation study of the evidence filtering and cleaning steps (Table 6) shows that improved evidence quality brings big gains in F1-Macro scores ( and ) with only loss in accuracy.

Moreover, comparing the performances of LM Debunker on each of the two test sets, Covid19-scientific scores surpass Covid19-politifact scores, especially in F1-Macro, by . This is due to the disparate natures of the gold source documents used in evidence selection; the Covid19-scientific claims obtain evidence from scholarly articles, whereas Covid19-politifact claims extract evidence from news articles and other unverified internet sources. Consequently, this resulted in a different quality of extracted evidence. Therefore, an important insight would be that evidence quality is crucial to our approach, and additional performance gain would be fostered from further improvement in evidence quality.

7.3 Error analysis and Future Work

We identified areas for improvement in future work through qualitative analysis of wrongly-predicted samples from the LM debunker. First, since perplexity originally serves as a measure of sentence likelihood, when a true claim has an abnormal sentence structure, our LM deunker makes a mistake by assigning high perplexity. For example, a true claim “So Oscar Helth, the company tapped by Trump to profit from covid tests, is a Kushner company. Imagine that, profits over national safety” has extremely high perplexity. One interesting future direction would be to explore a way of disentangling “perplexity as a sentence quality measure” from “perplexity as a falseness indicator”.

Second, our LM debunker makes mistakes when selected evidence is refuting the False claim by simply negating the content of the paraphrased claim. For instance, for a false claim “Taking ibuprofen worsens symptoms of COVID-19,” the top most relevant evidence from the scholarly articles is “there is no current evidence indicating that ibuprofen worsens the clinical course of COVID-19.” Another future direction would be to learn a better way of assigning additional weight/emphasis on special linguistic features such as negation.

8 Related Work

8.1 Misinformation

Previous approaches Long et al. (2017); Karimi et al. (2018); Kirilin and Strube (2018); Shu et al. (2018); Monti et al. (2019) show that using meta-information (e.g. credibility score of the speaker) with text input helps improve the performance of misinformation detection. Considering the availability of meta-information is not always guaranteed, building a model independent from it is crucial to detect misinformation. There exist works with fact-based approaches, using evidence from external sources for assessing the truthfulness of information Etzioni et al. (2008); Wu et al. (2014); Ciampaglia et al. (2015); Popat et al. (2018); Alhindi et al. (2018); Baly et al. (2018); Lee et al. (2018); Hanselowski et al. (2018). These approaches based on the logic of “the information is correct if evidence from credible sources or a group of online sources is supporting it.” Furthermore, some works focus on reasoning and evidence selecting ability by restricting the scope of facts to those from Wikipedia Thorne et al. (2018); Nie et al. (2019); Yoneda et al. (2018b)

8.2 Language Model Applications

lead to significant advancements in wide variety of NLP tasks, including question-answering, commonsense reasoning, and semantic relatedness Devlin et al. (2018); Radford et al. (2019); Peters et al. (2018); Radford et al. (2018). These models are typically trained on documents mined from Wikipedia (among other websites). Recently, a number of works have found that LMs store a surprising amount of world knowledge, focusing particularly on the task of open-domain question answering Petroni et al. (2019); Roberts et al. (2020). Going further, Guu et al.; Roberts et al. show that task specific fine-tuning of LM can achieve impressive results, proving the power of LMs. In this paper, we explore to confirm if large pre-trained LM can also be helpful in the field of debunking.

9 Conclusion

In this paper, we show that misinformation has high perplexity from the language model primed with relevant evidence. By proposing the new application of perplexity, we build an unsupervised debunker that shows promising results, especially in the absence of labeled data. Moreover, we emphasize the importance of evidence quality in our methodology by showing the improvement in the final performance with the addition of a filtering step in the evidence selection. We are also releasing two new COVID-19 related test sets publicly to promote transparency and prevent the spread of misinformation. Based on this successful leverage of language model perplexity for debunking, we hope to foster more research in this new direction.


We would like to thank Madian Khabsa for the helpful discussion and inspiration.


  • T. Alhindi, S. Petridis, and S. Muresan (2018) Where is your evidence: improving fact-checking by justification modeling. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pp. 85–90. Cited by: §1, 1st item, §8.1.
  • R. Baly, M. Mohtarami, J. Glass, L. Màrquez, A. Moschitti, and P. Nakov (2018) Integrating stance detection and fact checking in a unified corpus. arXiv preprint arXiv:1804.08012. Cited by: §8.1.
  • Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin (2003) A neural probabilistic language model.

    Journal of machine learning research

    3 (Feb), pp. 1137–1155.
    Cited by: §2.
  • G. L. Ciampaglia, P. Shiralkar, L. M. Rocha, J. Bollen, F. Menczer, and A. Flammini (2015) Computational fact checking from knowledge networks. PloS one 10 (6), pp. e0128193. Cited by: §1, §1, §8.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §5.1, §8.2.
  • O. Etzioni, M. Banko, S. Soderland, and D. S. Weld (2008) Open information extraction from the web. Commun. ACM 51 (12), pp. 68–74. External Links: ISSN 0001-0782, Link, Document Cited by: §1, §1, §8.1.
  • K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang (2020) Realm: retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909. Cited by: §8.2.
  • A. Hanselowski, H. Zhang, Z. Li, D. Sorokin, B. Schiller, C. Schulz, and I. Gurevych (2018) Ukp-athene: multi-sentence textual entailment for claim verification. arXiv preprint arXiv:1809.01479. Cited by: §8.1.
  • H. Karimi, P. Roy, S. Saba-Sadiya, and J. Tang (2018) Multi-source multi-class fake news detection. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1546–1557. Cited by: §1, §8.1.
  • A. Kirilin and M. Strube (2018) Exploiting a speaker’s credibility to detect fake news. In

    Proceedings of Data Science, Journalism & Media workshop at KDD (DSJM’18)

    Cited by: §1, §8.1.
  • N. Lee, C. Wu, and P. Fung (2018)

    Improving large-scale fact-checking using decomposable attention models and lexical tagging


    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    pp. 1133–1138. Cited by: §8.1.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1.
  • Y. Long, Q. Lu, R. Xiang, M. Li, and C. Huang (2017) Fake news detection through multi-perspective speaker profiles. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 252–256. Cited by: §1, §8.1.
  • A. Magdy and N. Wanas (2010) Web-based statistical fact checking of textual documents. In Proceedings of the 2nd international workshop on Search and mining user-generated contents, pp. 103–110. Cited by: §1.
  • C. D. Manning, C. D. Manning, and H. Schütze (1999) Foundations of statistical natural language processing. MIT press. Cited by: §2.
  • F. Monti, F. Frasca, D. Eynard, D. Mannion, and M. M. Bronstein (2019) Fake news detection on social media using geometric deep learning. arXiv preprint arXiv:1902.06673. Cited by: §8.1.
  • Y. Nie, H. Chen, and M. Bansal (2019) Combining fact extraction and verification with neural semantic matching networks. In Association for the Advancement of Artificial Intelligence (AAAI), Cited by: §1, §8.1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §8.2.
  • F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel (2019) Language models as knowledge bases?. arXiv preprint arXiv:1909.01066. Cited by: §1, §8.2.
  • K. Popat, S. Mukherjee, A. Yates, and G. Weikum (2018) DeClarE: debunking fake news and false claims using evidence-aware deep learning. arXiv preprint arXiv:1809.06416. Cited by: §1, §8.1.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §1, §8.2.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §1, §8.2.
  • A. Roberts, C. Raffel, and N. Shazeer (2020) How much knowledge can you pack into the parameters of a language model?. arXiv preprint arXiv:2002.08910. Cited by: §8.2.
  • K. Shu, S. Wang, and H. Liu (2018) Understanding user profiles on social media for fake news detection. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 430–435. Cited by: §8.1.
  • J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355. Cited by: §1, §5.1, §8.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.
  • N. Vigdor (2020) Man fatally poisons himself while self-medicating for coronavirus, doctor says. External Links: Link Cited by: §1.
  • W. Y. Wang (2017) ” Liar, liar pants on fire”: a new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648. Cited by: §5.1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §5.2.
  • Y. Wu, P. K. Agarwal, C. Li, J. Yang, and C. Yu (2014) Toward computational fact-checking. Proceedings of the VLDB Endowment 7 (7), pp. 589–600. Cited by: §1, §1, §8.1.
  • T. Yoneda, J. Mitchell, J. Welbl, P. Stenetorp, and S. Riedel (2018a) ”UCL machine reading group: four factor framework for fact finding (HexaF)”. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), Brussels, Belgium. Cited by: 1st item.
  • T. Yoneda, J. Mitchell, J. Welbl, P. Stenetorp, and S. Riedel (2018b) UCL machine reading group: four factor framework for fact finding (HexaF). In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), Brussels, Belgium, pp. 97–102. External Links: Link, Document Cited by: §1, §8.1.