The Internet has become an extension of our physical world, whereby almost everyone is connected. As a result, one small piece of false news can spread worldwide and ruin a person, company, or even a country’s economy or reputation. To mitigate the impact of fake news, recognizing and flagging them is necessary due to the amount of news released.
In the research community, many researchers have devoted efforts to studying automatic fake news detection. Among them, automated fact-checking has attracted a great deal of attention [18, 11]. The task is to check if a claim is factually correct based on evidence retrieved from reliable sources. However, according to a recent survey , human fact-checkers generally do not trust automated solutions. Some works have been proposed to build the bridge between humans and machines. For example, Yang et al.  proposed a work to summarize claims for more scalable fact-checking and involved human-in-the-loop to evaluate summarization results.
Another popular line of work to increase trust for fact-checking is to generate explanations for the predicted results. Atanasova et al.  first proposed a pioneer work to generate explanations. They performed an optimization to learn together veracity prediction and explanation extraction from evidence. Subsequently, Kotonya et al. 
proposed a joint extractive and abstractive text summarization method for explanation generation. The authors also published a survey specifically about generating fact-checking explanations.
However, although generating explanations can provide more precise evidence to understand fact-checking decisions, existing systems lack a way to evaluate the explanations properly. Especially for explanations based on abstractive document summarization, researchers have shown that such models have problems of hallucination [3, 12], generating summaries factually inconsistent with their original document. To deal with this issue, several works have been proposed [10, 4, 19]. In particular, Pagnoni et al. 
summarized different types of errors some models make and metrics used to evaluate them. Among these evaluation metrics, leveraging question answering (QA) as a proxy has been the focus of some work[4, 19]. The idea is to rely upon a question answering mechanism as an evaluation for the faithfulness of summaries. Wang et al.  extracted answers and questions from summaries and fine-tuned a QA model to generate answers from the documents; the answers from the document and its summary for the same questions are compared to determine the actual consistency of the summary. Recently, Nan et al.  proposed an improved method than , where instead of generating answers for both the summary and document, they model the likelihood of the summary and document conditioned on question-answer pairs generated from the summaries. Through this, the likelihood metric becomes suitable as a training objective to improve the factual consistency of summaries.
A few works have been proposed to leverage QA to help in fact-checking. For example, in PathQG 
, Wang et al. generated questions from facts. They accomplished this task in two steps: first, they identified facts from an input text to build a knowledge graph (KG) and then generated an ordered sequence as a query path; second, they utilized a seq2seq model to learn to generate questions based on the query path. The human evaluation showed that their model could generate informative questions. In another work, Fan et al. generated question-answer pairs as a type of brief, along with passage brief and entity brief, and provided them to the human fact-checkers, aiming at improving their checking efficiency.
Inspired by the QA works in checking factual consistency of documents and their summaries, we believe it is suitable for the fact-checking task, where we assess if claims are factually consistent with retrieved evidence. Therefore, we propose to leverage automated QA protocols and integrate them into the traditional fact-checking pipeline. As a result, we can provide explainable fact-checking results through question answering. The answer comparison model will predict a label and pinpoint the wrong part of a claim by showing which questions are more important for the decision. In this way, humans fact-checker can easily interpret the results and correct them if necessary. Our work differs from prior works [20, 6] because we not only generate question-answer pairs but also fully integrate QA protocols in the fact-checking pipeline to automatically compare answers and predict their labels. We compare the proposed method with several baselines, achieving state-of-the-art results but with the critical feature of adding explainability to the fact-checking process.
We summarize our contributions as follows.
We propose a novel pipeline for using question answering as a proxy for explainable fact-checking;
We introduce an answer comparison model with an attention mechanism on questions to learn their importance on the claims;
2 Proposed Methodology
We introduce question answering (QA) in the fact-checking process. Despite previous mentions of using QA for fact-checking, no previous work has explored integrating QA protocols in its pipeline. Our proposed solution is described as follows:
Given a claim , generate multiple questions ,, and answers , , from it;
Retrieve and re-rank evidence based on the claim (and possibly questions);
For each question generated from 1), ask retrieved evidence for answers , , respectively;
Compare the answer pairs and transform the result into a label of SUPPORTS or REFUTES.
Our proposed pipeline leads to more explainability as we break down the fact-checking process into more steps, allowing a more fine-grained analysis of each part of the process (e.g., question generation, question answering, or answer comparison). In addition, through answer generation from claims and evidence, we vastly reduce the information (from claims and their evidence to only answer pairs) fed to the final classification model. Thus, the model learns from more direct and precise inputs.
To focus on how question answering empowers explainability, we use gold evidence instead of retrieved evidence. It means that for step 2), we take the gold evidence directly instead of retrieving them to focus on evaluating the other three stages of the problem. Future work will be dedicated to the retrieval by itself. Therefore, we focus on steps 1), 3), and 4) of the pipeline. Next, we detail the proposed methodology steps.
2.1 Question and Answer Generation
To generate questions from a text, answers for the text are usually provided first to generate more relevant questions [4, 19]. Answers are usually extracted based on named entities and noun phrases; then, questions are generated given the claim and answers. They can also be generated in parallel with questions . We adopt the approach to generate questions and answers from claims simultaneously . In particular, we follow the instruction of  to fine-tune the BART-large model to generation question-answer pairs , , from a given claim . Using beam search, 64 question-answer pairs are generated, then pairs are removed if the claim does not contain the answers. For answers of evidence , we utilize a pre-trained extractive QA model to answer the questions generated previously from the claim. The model generates multiple answers, and we choose the one with the highest score (the most likely answer).
2.2 Answer Pair Comparison
For answer comparison, the token-level F1 score is usually used to measure similarity between answer pairs; however, it does not work when the two answers have non-overlapping words but are semantically similar. We propose to fine-tune a transformer model to learn answer comparison. Considering that different questions have various purposes, thus also vary in their importance. To account for this, we add attention to each question to learn the importance weights. The structure of the model is shown in Fig. 1.
Specifically, we rely on a pre-trained masked language model to encode the claim , questions , , , and answer pairs , , . For the answer pairs, we add a SEP token between two answers of the same question. We use one encoder model for encoding all the inputs, which means the weights are shared. After the encoding, we take the representation of the CLS token as each sentence embedding, thus transforming the claim, questions and answer pairs into features: , , , , , and , , , respectively, where is the number of questions for each claim. Then we utilize additive attention proposed by Bahdanau et al.  to learn the importance of each question. We treat the claim as a query, questions as keys, and answers as values for each representation. The details are formulated as follows.
where calculate the attention weight between and (), , and are learnable parameters.
In Eq. (1) and (2), attention weights are calculated and normalized by the softmax function. Then Eq. (3) uses a weighted sum to combine all answer features to the final feature . Feature is then fed to a fully connected layer to have the final prediction of SUPPORTS or REFUTES. Notice that all the information present in both claim and evidence are reduced into their respective answers. The claim and generated questions take part in selecting the most relevant answers, assigning higher weights to them.
3 Experimental Setup
We adopt the Fool-Me-Twice (FM2) dataset, which comprises 12,968 claims and their associated evidence. FM2 is a recently published dataset collected through a multi-player game. In the game, one player generates a claim and tries to fool other players. The others have to decide if the claim is true or false based on evidence retrieved by the game before a timer runs out. The game’s setting makes this dataset challenging as the players are motivated to generate claims hard to verify. FM2 a more difficult and less biased dataset than the seminal dataset FEVER , in which a model can exploit specific words from the claim  to achieve reasonable accuracy (79.1% for two classes). In contrast, FM2 is shown not to have biases, a classification based only on claims resulted in low prediction accuracy (61.9%).
3.2 Implementation details
For question-answer pairs generation, we follow the code provided in  to fine-tune a BART-large model based on XSUM and CNNDM datasets222https://bit.ly/3iBZyqR. For answer generation for evidence, we use the FARM framework from deepset333https://github.com/deepset-ai/FARM to generate answers from evidence, and the question-answering model is deepset/electra-base-squad2. For answer comparison, we use microsoft/mpnet-base model for encoding all input representation, as it has shown to perform well in question answering tasks 
. As the question generation model does not output the same number of questions for every claim, we selected the first ten questions if the claim has more than 10; if the number of questions is less than 10, we repeat the first question until 10. We choose this quantity of questions because the average number of questions for each claim is 11.5. The hyperparameters for training the answer comparison models are:
number of epochs= 5, batch size =32, learning rate = 2e-5, which is the standard for fine-tuning a masked language model, and maximum token length
= 32. For statistical significance, we run each experiment 5 times and report the average and standard deviation. As the dataset is well-balanced, we use macro average accuracy as the evaluation metric.
We set questions and answers for the baselines to be the same, only varying different answer comparison methods.
Blackbox method: we compare our results with the original proposed method in . We refer to it as the black box method as they concatenate claim and evidence for the prediction without providing interpretability. We used the code provided by the authors444https://bit.ly/2ZO6CtR and ran it five times to have an average result.
QUALS score: it is an automatic metric for checking factual consistency . It does not generate answers for evidence. Instead, it calculates the likelihood of the evidence given the question-answer pair from the claim, compromising explainability.
Token level F1-score: a standard metric for question-answer tasks. It counts words overlap between two answers.
BERTscore: a common metric for measuring the similarity of two sentences. We use the default model roberta-large.
Cosine similarity: a metric also used for measuring sentence similarity. We use sentence transformer all-mpnet-base-v2
to embed the answers and calculate the cosine similarities between the embeddings.
Only the black box method requires training. The others are metrics to evaluate the answer pairs. These metrics calculate a score representing similarity for each answer pair, except for QUALS that outputs a score for all answers of the same claim. As each claim has several questions, we compute the average score for the claim and provide a threshold to convert the score to a binary label.
4 Results and Analysis
4.1 Comparison with baselines
We show the results with different baselines in Table 1. For the metric-based methods, we do a binary search to find the highest accuracy on the development set for the threshold selection.
|Methods||Dev Acc||Test Acc|
|Blackbox (No X-AI)||76.171.23||74.581.66|
|cosine sim (th=0.305)||61.16||62.75|
|Attention C-Q-AA (ours, X-AI)||75.440.52||73.430.83|
The results show that training an answer comparison model specifically for the fact-checking task improves accuracy compared with the methods without training. Our attention-based method achieves slightly lower accuracy than the black box method. However, our method is more suitable for real-world applications than the black box one because: 1) our method essentially reduces the input needed for prediction while remaining almost the same accuracy, 2) we enable error analysis for fact-checking with several steps, and 3) our model additionally provides more explainability by learning the importance of each question.
4.2 Attention visualization
To illustrate how attention helps explainability, we show an example of our generated questions with attention weights and their answers from claim and evidence in Fig. 2.
The question with the highest weight is bold, and the second-highest underlined. Although some answers are incorrect and there are non-matching answer pairs, the model can attend more on the questions and answers more relevant to the factuality of the claim, showing our approach’s potential. We also see that because the claim is short, most questions are repetitive.
4.3 Ablation study
We carry out an ablation study to show if our attention mechanism improves performance compared with simple classification. Thus we remove the attention layer of our proposed attention model, the network structure is shown in Fig.3. Specifically, to use all available questions, we concatenate all questions and all answers: so the model has two inputs CLS SEP , and CLS SEP (note here
can be different for different claims). As the inputs are concatenated, the maximum token length here is 128. Then through the embedding model, each input is transformed into a feature vector, and the two vectors are concatenated to be fed into the classification layer.
To study the effect of different components of our proposed model, we design the inputs as follows:
C: only claims CLS ;
Q: only concatenated questions CLS ;
AA: only answer pairs CLS SEP ;
Q-AA: concatenated questions CLS and answer pairs CLS SEP ;
CQ-AA: our full model without attention (shown in Fig. 3).
Attention C-Q-AA: our full model with attention.
|Inputs||Dev Acc||Test Acc|
From the ablation study, we want to know how much each input affects the model’s performance. In Table 2, we can see that with C or Q only, the model cannot perform well, indicating that the model can not rely solely on the claim information to achieve high accuracy. Our result agrees with the original paper, in which the model only with claims achieved an accuracy of 61.9%. Also, when adding Q and CQ information to AA, Q-AA and CQ-AA perform slightly better. This indicates that the model can learn most of the information from answer pairs only. It is reasonable because the answer pairs carry most of the critical information from both claims and evidence. Comparing CQ-AA with our attention-based C-Q-AA, we see that the attention mechanism can help increase performance because it uses claims and questions to weigh up essential answer pairs.
Question Generation. Generating diverse and relevant questions aiming at the factuality of a claim is challenging. Claims can be altered by changing the subject, object, time, place, actions, or even multiple editions together. In some cases, we observed that the questions have the problem of not recognizing complete phrases of the claim, and sometimes most questions of a claim are semantically similar because the claim is too short. For example in Fig. 2, we can see that most of the questions are paraphrases. Hence, better ways of generating questions and filtering less relevant and repetitive questions are needed to improve performance.
Question Answering. Answering correctly giving the context is a non-trivial and crucial step in the pipeline. Unfortunately, state-of-the-art models can fail to answer correctly in some cases, as they require reasoning and logical thinking to calculate the correct answer from the context. We show a failing example here: Evidence: Weber was born in Eutin, Bishopric of Lübeck, the eldest of the three children of Franz Anton von Weber and his second wife, Genovefa Weber, a Viennese singer. Question: How many siblings did Albert Weber have? Answer for evidence: three.
In the example, the model is not able to give the correct answer –two, because it is an extractive QA model, which is a limitation of this type of model. Nevertheless, the explainability provide by questions and answers gives us a better idea of which part is wrong in the claim and what could help us improve the model.
Reasoning over text is a very challenging task; other ways of transforming the claim into a format like tabular data  may also help simplify the reasoning and thus improve performance.
In this paper, we proposed a novel pipeline for using QA as a proxy for fact-checking. Based on this pipeline, we proposed an answer comparison model with an attached attention mechanism, which learns to attend critical questions with interpretability capabilities.
Our ablation study showed that the model can achieve near state-of-the-art performance with only information from answer pairs. Thus, using QA, we can encourage the model to learn from more precise evidence; this can aid fact-checkers in better understanding models’ decisions. Then, when necessary, they can compare the answers and make decisions for themselves.
In future work, we plan to add the retrieval step to the pipeline instead of using gold evidence, as the retrieval is also a crucial part of fact-checking. We can also instead answer questions directly from a more extensive set of document evidence. In addition, we plan to work on more datasets to address the generalization capabilities of the method. Finally, we intend to have human evaluations on the questions and answers to improve the generation and potentially help human fact-checkers by providing high-quality QAs.
-  (2020) Generating fact checking explanations. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1.
-  (2015) Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR), Cited by: §2.2.
-  (2018) Faithful to the original: fact aware neural abstractive summarization. In AAAI Conference on Artificial Intelligence (AAAI), Cited by: §1.
-  (2020) FEQA: a question answering evaluation framework for faithfulness assessment in abstractive summarization. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1, §2.1.
-  (2021) Fool me twice: entailment from wikipedia gamification. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), Cited by: 1st item.
Generating fact checking briefs.
Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §1, §3.2.
-  (2021) Is my model using the right evidence? systematic probes for examining evidence-based tabular reasoning. arXiv preprint. Cited by: §4.4.
-  (2020) Explainable automated fact-checking for public health claims. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1.
-  (2020) Explainable automated fact-checking: a survey. In International Conference on Computational Linguistics (ACL), Cited by: §1.
-  (2020) Evaluating the factual consistency of abstractive text summarization. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1.
-  (2020) Fine-grained fact verification with kernel graph attention network. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1.
-  (2020) On faithfulness and factuality in abstractive summarization. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1.
-  (2021) Automated fact-checking for assisting human fact-checkers. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §1.
-  (2021) Improving factual consistency of abstractive summarization via question answering. In Joint Conference of Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP), Cited by: §1, §2.1, 2nd item.
-  (2021) Understanding factuality in abstractive summarization with FRANK: a benchmark for factuality metrics. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), Cited by: §1.
-  (2019) Towards debiasing fact verification models. In Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: §3.1.
-  (2020) MPNet: masked and permuted pre-training for language understanding. arXiv preprint. Cited by: §3.2.
-  (2018) FEVER: a large-scale dataset for fact extraction and verification. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), Cited by: §1, §3.1.
-  (2020) Asking and answering questions to evaluate the factual consistency of summaries. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1, §2.1.
-  (2020) PathQG: neural question generation from facts. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §1.
-  (2021) Scalable fact-checking with human-in-the-loop. arXiv preprint. Cited by: §1.