The recent COVID-19 pandemic broke down geographical boundaries and led to an infodemic of fake news and conspiracy theories . Evidence based claim verification (English only) has been studied as a weapon against fake news and disinformation . However conspiracy theories and disinformation can propagate from one language to another. Polyglotism is not uncommon. According to a 2017 Pew Research study, of European students learn English in school 111https://www.pewresearch.org/fact-tank/2020/04/09/most-european-students-learn-english-in-school/. Furthermore recent machine translation advances are increasingly bringing down language barriers [20, 15]. Disinformation can be defined as intentionally misleading information . A multilingual approach to evidence retrieval for claim verification aims at combating global disinformation, during globally significant events. The ”good cop” of the Internet , Wikipedia has become a source of ground truth as seen in the recent literature on evidence-based claim verification. There are more than 6mln English Wikipedia articles 222https://meta.wikimedia.org/wiki/List_of_Wikipedias but resources are lower in other language editions, such as Romanian (400K), which points to retrieving multilingual evidence.
As a case study in Fig. 1 we evaluate a claim about Ion Mihai Pacepa, former agent of the Romanian secret police during communism, author of books on disinformation [24, 23]. Related conspiracy theories can be found on internet platforms. For example, it was claimed that he was deceased  (Romanian online publication).
Twitter posts in multiple languages, with strong for and against language, exist such as (English and Portuguese) 333https://twitter.com/MsAmericanPie_/status/1287969874036379649 or (English and Polish) 444https://twitter.com/hashtag/Pacepa. Strong-language claim examples are ”We were tricked by Pacepa” (against) vs ”Red Horizons is one of the best political books of the 20st century ” (for). Strong language has been associated with propaganda and fake news . In the following sections we review the relevant literature, present our methodology and the experimental results and conclude with final notes.
2 Related Work
The literature review touches on three topics: online disinformation, multilingual NLP and evidence based claim verification. Online Disinformation. Previous disinformation studies focused on election related activity on social media platforms like Twitter, botnet generated hyperpartisan news, 2016 US presidential election [5, 3, 4, 13]. To combat online disinformation via claim verification one must retrieve reliable evidence at scale since fake news tend to be more viral and spread faster , , , . Multilingual Natural Language Processing Advances.
Multilingual Natural Language Processing Advances.Recent multilingual applications leverage pre-training of massive language models that can be fine-tuned for multiple tasks. For example, the cased multilingual BERT (mBERT) , 555https://github.com/google-research/bert/blob/master/multilingual.md is pre-trained on a corpus of the top 104 Wikipedia languages 666https://huggingface.co/bert-base-multilingual-cased. It has 12 layers, 768 hidden units, 12 heads and 110M parameters. Cross-lingual transfer learning has been evaluated for tasks such as: natural language inference , , document classification , question-answering , fake Indic language tweet detection . English-Only Evidence Retrieval and Claim Verification. Fact based claim verification is framed as a textual entailment task that retrieves its evidence. An annotated dataset was shared  and a task  was set up to retrieve evidence from Wikipedia documents and predict claim verification status. Recently published SotA results rely on pre-trained BERT flavors or XLNet . DREAM , GEAR  and KGAT  achieved SotA with graphs. Dense Passage Retrieval  is used in RAG  in an end-to-end approach for claim verification.
The system depicted in Fig. 2 is a pipeline with a multilingual evidence retrieval component and a multilingual claim verification component. Based on input claim in language the system retrieves evidence from Wikipedia edition in language and supports, refutes or abstains (not enough info). We employ English and Romanian as sample languages.
Multilingual Document Retrieval. To retrieve top Wikipedia documents for each language we employ an ad-hoc entity linking system similar to 
based on named entity recognition in. Entities are parsed from the (English) claim using the AllenNLP  constituency parser. We search for the entities and retrieve 7 English and 1 Romanian Wikipedia pages using MediaWiki API 777https://www.mediawiki.org/wiki/API:Main_page, based on the internationally recognized nature of the claim entities (144.9K out of 145.5K training claims have Romanian Wikipedia search results). Multilingual Sentence Selection. All sentences from each retrieved document are supplied as input to the sentence selection model. For Romanian sentences we removed diacritics . We prepend evidence sentences with the page tile to compensate for the missed co-reference pronouns [31, 37]. We frame the multilingual sentence selection as a two-way classification task [14, 26]. Our architecture includes an mBERT encoder 888https://github.com/huggingface/transformers and an MLP classification layer with softmax output . During training, all the parameters are fine-tuned and the MLP weights are trained from scratch. One example input is a pair of one evidence sentence and the claim [39, 37]. The encoded first
token, is supplied to the MLP classification layer. The model estimates
We only include the verifiable claims in training. The annotated evidence form positive examples, and we randomly sample 32 negative example sentences from the retrieved documents. We have two flavors of the fine-tuned model: EnmBERT only selects English negative sentences and EnRomBERT selects English (5) and Romanian (27) negative sentences. Claims are in English. We optimize the cross-entropy loss:
Multilingual Claim Verification. The claim verification step takes as input the top ranked 5 sentence-claim pairs by the sentence selection model (pointwise ranking ). The architecture includes an EnmBERT or EnRomBERT encoder and an MLP. We fine-tune the natural language inference model in a three-way classification task. A prediction is made for each of the 5 pairs and we aggregate based on logic rules . In training for both models we use Adam optimizer , batch size of 32, learning rate of
, cross-entropy loss and 1 and 2 epochs of training respectively.
Conceptual End-to-End Multilingual Retrieve-Verify System. There are limitations to the ad-hoc entity linking document retrieval step for non-English languages, multilingual annotation is expensive, and the inclusion of retrieved Romanian sentences only as negative sentences in the supervised sentence selection step in the pipeline leads to biases. We propose a novel end-to-end multilingual evidence retrieval and claim verification approach similar to the English-only RAG  that automatically retrieves relevant evidence passages in language from a multilingual corpus corresponding to a claim in language . In Fig. 2, the 2-step multilingual evidence retrieval is replaced with a multilingual version of dense passage retrieval (DPR) 
with mBERT backbone. The DPR-retrieved documents form a latent probability distribution. The claim verification model conditions on the claimand the latent retrieved documents to generate the label . The probabilistic model is
The multilingual retrieve - verify system is jointly trained and the only supervision is at the claim - verification level. We leave this promising avenue for future experimental evaluation.
4 Experimental Results
There are no equivalent multilingual claim verification baselines so we calibrate the model results by calculating the official FEVER score 999 https://github.com/sheffieldnlp/fever-scorer . To evaluate the zero-shot transfer learning ability of the trained models, we translate 10 supported and 10 refuted claims with 5 evidence sentences each and combine in a mix and match development set of 400 examples. Calibration results development and test sets. In Table 1 and Fig. 3 we compare EnmBERT and EnRombert label accuracy and evidence recall on a fair dev set, the test set and on the golden-forcing dev set. The golden forcing dev set adds all golden evidence to the sentence selection input, effectively forcing perfect document retrieval recall . Note that any of the available English only systems with BERT backbone such as KGAT  and GEAR  can be employed with an mBERT backbone to lift the multilingual system performance. We reach within of similar BERT-based English only systems such as , though our training differs so the comparison is not directly attributable to the multilingual nature. We also reach within evidence recall as compared to English-only KGAT  and better than .
To better understand strengths and weaknesses and the impact of including Romanian evidence we do a per class performance analysis and we also calculate FEVER-2 (score for only ”SUPPORTS” and ”REFUTES” claims). The SotA on FEVER-2 is likely given in RAG  at without golden evidence (fair dev set). Our EnRomBERT model reaches within . The inclusion of the Romanian sentences improves the FEVER-2 score (see Fig. 3) coming within of  English-only FEVER-2 SotA of on golden-forcing.
On SUPPORTS and REFUTES classes, EnRomBERT outperforms EnmBERT on both fair and golden-forcing datasets. In EnRomBERT, likely the additional noise from the second language inclusion improves generalization on the English language claims. Both models struggle on the NEI class which is not surprising since there were no NEI claims included in the training set.
Transfer Learning Performance Table 2 shows EnmBERT and EnRomBERT zero-shot transfer learning ability. We evaluate the two models performance on the mixed 400 examples (mixed column), En-En, En-Ro English evidence and Romanian claims, Ro-En and Ro-Ro. We directly evaluate the claim verification step. It is interesting to see the differences in cross-lingual transfer learning ability for the Ro-En, En-Ro and Ro-Ro scenarios. EnmBERT’s label accuracy on Ro-Ro is as compared to for En-En, better than EnRomBERT. The pattern is similar for Ro-En and En-Ro. It is not surprising that EnmBERT outperforms EnRomBERT because EnRomBERT learned that Romanian evidence sentences are NEI (included as negative examples in sentence selection training) which led to a bias against the Romanian evidence.
Disinformation Case Study We now evaluate the EnRomBERT system results for the case study in Fig. 1. We retrieve supporting evidence in English, Romanian and Portuguese. The page title and summaries are directly retrieved using the MediaWiki API 101010https://www.mediawiki.org/wiki/API:Main_page. The system will be exposed as a demo service, with limitations on number of requests and latency. Based on the top predicted evidence (in 3 languages), the system predicts that the claim is supported.
5 Final Notes
We present a first approach to multilingual evidence retrieval and claim verification to combat global disinformation. We evaluate two systems, EnmBERT and EnRomBERT, and their cross-lingual transfer learning ability for claim verification. We make available a translated claim and evidence mixed English-Romanian dataset for future multilingual research evaluation.
-  Andrei, A.: impact.ro Homepage (2020 (accessed October 28, 2020)), https://www.impact.ro/exclusiv-ce-se-intampla-acum-cu-ion-mihai-pacepa
-  Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics 7, 597–610 (2019)
-  Bastos, M.T., Mercea, D.: The Brexit botnet and user-generated hyperpartisan news. Social Science Computer Review 37(1), 38–54 (2019)
-  Bessi, A., Ferrara, E.: Social bots distort the 2016 us presidential election online discussion. First Monday 21(11-7) (2016)
-  Brachten, F., Stieglitz, S., Hofeditz, L., Kloppenborg, K., Reimann, A.: Strategies and influence of social bots in a 2017 German state election-a case study on twitter. arXiv preprint arXiv:1710.07562 (2017)
Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.: Learning to rank: from pairwise approach to listwise approach. In: Proceedings of the 24th international conference on Machine learning. pp. 129–136 (2007)
-  Clark, J.H., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V., Palomaki, J.: TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. arXiv preprint arXiv:2003.05002 (2020)
-  Cohen, N.: Conspiracy videos? fake news? enter Wikipedia, the ‘good cop’of the Internet. The Washington Post (2018)
-  Conneau, A., Lample, G., Rinott, R., Williams, A., Bowman, S.R., Schwenk, H., Stoyanov, V.: Xnli: Evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053 (2018)
-  Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia data. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL). pp. 708–716 (2007)
-  Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018), http://arxiv.org/abs/1810.04805
-  Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N., Peters, M., Schmitz, M., Zettlemoyer, L.: AllenNLP: A deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640 (2018)
-  Grinberg, N., Joseph, K., Friedland, L., Swire-Thompson, B., Lazer, D.: Fake news on twitter during the 2016 us presidential election. Science 363(6425), 374–378 (2019)
-  Hanselowski, A., Zhang, H., Li, Z., Sorokin, D., Schiller, B., Schulz, C., Gurevych, I.: UKP-Athene: Multi-sentence textual entailment for claim verification. arXiv preprint arXiv:1809.01479 (2018)
Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado, G., et al.: Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics5, 339–351 (2017)
-  Kar, D., Bhardwaj, M., Samanta, S., Azad, A.P.: No rumours please! a multi-Indic-lingual approach for COVID fake-tweet detection. arXiv preprint arXiv:2010.06906 (2020)
-  Karpukhin, V., Oğuz, B., Min, S., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv preprint arXiv:2005.11401 (2020)
-  Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L.: Multilingual denoising pre-training for neural machine translation. arXiv preprint arXiv:2001.08210 (2020)
-  Liu, Z., Xiong, C., Sun, M., Liu, Z.: Fine-grained fact verification with kernel graph attention network. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 7342–7351 (2020)
-  Malon, C.: Team Papelo: Transformer networks at FEVER. arXiv preprint arXiv:1901.02534 (2019)
-  Pacepa, I.M.: Red Horizons: Chronicles of a Communist Spy Chief. Gateway Books (1987)
-  Pacepa, I.M., Rychlak, R.J.: Disinformation: Former Spy Chief Reveals Secret Strategy for Undermining Freedom, Attacking Religion, and Promoting Terrorism. Wnd Books (2013)
-  Saez-Trumper, D.: Online disinformation and the role of wikipedia. arXiv preprint arXiv:1910.12596 (2019)
-  Sakata, W., Shibata, T., Tanaka, R., Kurohashi, S.: Faq retrieval using query-question similarity and bert-based query-answer relevance. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1113–1116 (2019)
Schroepfer, M.: Creating a data set and a challenge for deepfakes. Facebook Artificial Intelligence (2019)
-  Schwenk, H., Li, X.: A corpus for multilingual document classification in eight languages. arXiv preprint arXiv:1805.09821 (2018)
-  Sennrich, R., Haddow, B., Birch, A.: Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891 (2016)
-  Silverman, C.: This Analysis Shows How Viral Fake Election News Stories Outperformed Real News On Facebook (2016 (accessed October 28, 2020)), https://www.buzzfeednews.com/article/craigsilverman/viral-fake-election-news-outperformed-real-news-on-facebook
-  Soleimani, A., Monz, C., Worring, M.: BERT for evidence retrieval and claim verification. In: European Conference on Information Retrieval. pp. 359–366. Springer (2020)
-  Thorne, J., Vlachos, A.: Avoiding catastrophic forgetting in mitigating model biases in sentence-pair classification with elastic weight consolidation. arXiv preprint arXiv:2004.14366 (2020)
-  Thorne, J., Vlachos, A., Christodoulopoulos, C., Mittal, A.: FEVER: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355 (2018)
-  Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos, C., Mittal, A.: The fact extraction and verification (FEVER) shared task. arXiv preprint arXiv:1811.10971 (2018)
-  Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news online. Science 359(6380), 1146–1151 (2018)
-  Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: Generalized autoregressive pretraining for language understanding. In: Advances in neural information processing systems. pp. 5753–5763 (2019)
-  Yoneda, T., Mitchell, J., Welbl, J., Stenetorp, P., Riedel, S.: Ucl machine reading group: Four factor framework for fact finding (hexaf). In: Proceedings of the First Workshop on Fact Extraction and VERification (FEVER). pp. 97–102 (2018)
-  Zhong, W., Xu, J., Tang, D., Xu, Z., Duan, N., Zhou, M., Wang, J., Yin, J.: Reasoning over semantic-level graph for fact checking. arXiv preprint arXiv:1909.03745 (2019)
-  Zhou, J., Han, X., Yang, C., Liu, Z., Wang, L., Li, C., Sun, M.: Gear: Graph-based evidence aggregating and reasoning for fact verification. arXiv preprint arXiv:1908.01843 (2019)
-  Zhou, X., Mulay, A., Ferrara, E., Zafarani, R.: ReCOVery: A multimodal repository for COVID-19 news credibility research. arXiv preprint arXiv:2006.05557 (2020)
-  Zhou, X., Zafarani, R.: A survey of fake news: Fundamental theories, detection methods, and opportunities. ACM Computing Surveys (CSUR) 53(5), 1–40 (2020)