Reasoning Over Semantic-Level Graph for Fact Checking

by   Wanjun Zhong, et al.
Peking University

We study fact-checking in this paper, which aims to verify a textual claim given textual evidence (e.g., retrieved sentences from Wikipedia). Existing studies typically either concatenate retrieved sentences as a single string or use feature fusion on the top of features of sentences, while ignoring semantic-level information including participants, location, and temporality of an event occurred in a sentence and relationships among multiple events. Such semantic-level information is crucial for understanding the relational structure of evidence and the deep reasoning procedure over that. In this paper, we address this issue by proposing a graph-based reasoning framework, called the Dynamic REAsoning Machine (DREAM) framework. We first construct a semantic-level graph, where nodes are extracted by semantic role labeling toolkits and are connected by inner- and inter- sentence edges. After having the automatically constructed graph, we use XLNet as the backbone of our approach and propose a graph-based contextual word representation learning module and a graph-based reasoning module to leverage the information of graphs. The first module is designed by considering a claim as a sequence, in which case we use the graph structure to re-define the relative distance of words. On top of this, we propose the second module by considering both the claim and the evidence as graphs and use a graph neural network to capture the semantic relationship at a more abstract level. We conduct experiments on FEVER, a large-scale benchmark dataset for fact-checking. Results show that both of the graph-based modules improve performance. Our system is the state-of-the-art system on the public leaderboard in terms of both accuracy and FEVER score.



There are no comments yet.


page 1

page 2

page 3

page 4


A Multi-Level Attention Model for Evidence-Based Fact Checking

Evidence-based fact checking aims to verify the truthfulness of a claim ...

A Knowledge Enhanced Learning and Semantic Composition Model for Multi-Claim Fact Checking

To inhibit the spread of rumorous information and its severe consequence...

LogicalFactChecker: Leveraging Logical Operations for Fact Checking with Graph Module Network

Verifying the correctness of a textual statement requires not only seman...

Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering

Commonsense question answering aims to answer questions which require ba...

GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification

Fact verification (FV) is a challenging task which requires to retrieve ...

Combining Fact Extraction and Verification with Neural Semantic Matching Networks

The increasing concern with misinformation has stimulated research effor...

Graph-based Retrieval for Claim Verification over Cross-Document Evidence

Verifying the veracity of claims requires reasoning over a large knowled...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Internet provides an efficient way for individuals and organizations to quickly spread information to massive audiences. However, some information are true while some are false. Malicious people spread false news, which may have significant influence on public opinion, stock prices, even presidential elections [4]. Some research shows that false news reaches more people than the truth [18]

. In this paper, we study fact checking, a fundamental task in natural language processing whose goal is to automatically assess the truthfulness of a textual claim given textual evidence.

Figure 1: An example from the FEVER dataset.

Specifically, we do studies on FEVER [14], short for Fact Extraction and VERification, which is one of the most influential benchmark datasets for fact checking. In FEVER, evidence comes from Wikipedia. A running example which will be used throughout the paper is given in Figure 1. Given a claim with supporting evidence as the input, our goal is to predict whether the evidence supports or refutes the claim, or there is not enough information to make the decision. Existing approaches are dominated by natural language inference models [1] because the task essentially requires matching between the claim and the evidence. In most cases, a claim is a sentence while the evidence contains multiple sentences. Therefore, mining adequate and concise information from pieces of evidence is useful for matching to the claim.

Existing studies typically concatenate evidence sentences into a single string, which is used in the top-ranked system during the official FEVER challenge [15], or add a feature fusion layer on top of evidence features to further aggregate information from evidence sentences [21]. However, both methods ignore the important relational structure of evidence sentences at semantic-level, including the participants, location, and temporality of events. Let us take the example in Figure 1. Making correct prediction for this claim requires a model to understand that “Rodney King riots” is occurred in “Los Angeles County” from the first evidence, and that “Los Angeles County” is “the most populous county in the USA” from the second evidence. Simply concatenating evidence sentences as a single string would give a large distance to relevant information pieces from difference evidence sentences. Feature fusion aggregates all the information in an implicit way, which makes it hard to reason over structural information.

To address the aforementioned issues, we present a graph-based reasoning approach for fact checking. We represent evidence sentences as a graph, where nodes are extracted by SRL (Semantic Role Labelling) [12]. Nodes belonging to the same predicate-argument structure are fully connected. We further use string similarity based measurement to connect nodes of certain types (e.g. arguments, location and temporal) which are extracted from different sentences. After obtaining the constructed graph, we present a graph-based model with XLNet [19] as the backbone. We present a graph-based contextual word representation learning module and a graph-based reasoning module to leverage the graph information. In the first module, we use graph information to redefine the distance between words and produce contextual word embedding for each word in both claim and evidence sentences. In the graph-based reasoning module, we take the contextual word representations as input, and match the claim and evidence sentences over two graphs.

Experiments show that both graph-based modules improve the performance. At the time of paper submission, our system (DREAM on the official leaderboard111 achieves state-of-the-art claim verification accuracy and FEVER score. This paper makes the following contributions:

  • We propose a graph-based reasoning approach, namely Dynamic REAsoning Machine (DREAM), for fact checking. We use SRL to construct graphs and propose two novel graph-based modules for graph-based representation learning and graph-based reasoning.

  • Results verify that both proposed modules bring improvements, and our final system achieves state-of-the-art performance.

Task Definition

FEVER (Fact Extraction and VERification) is a shared task proposed by thorne2018fact thorne2018fact, in which systems are required to assess the veracity of given claims with integrated information from multiple pieces of evidence. Evidence in this task needs to be retrieved from all the documents from Wikipedia. Specifically, with a given claim, the system is asked to search potential sentence-level evidence and state the claim as “SUPPORTED”, “REFUTED” of “NOT ENOUGH INFO (NEI)”, which indicate that the claim is supported or refuted by given evidence or is not verifiable. As the example shown in Figure 1

, verification of a claim requires the ability of aggregating pieces of information from multiple pieces of evidence and reasoning over them. In FEVER, there are two official evaluation metrics. The first one is accuracy for the three-way classification (SUPPORTED/REFUTED/NEI), which is also the main focus of this work because it directly shows the verification performance of our graph-based reasoning approach. For comparison with existing studies, we also report results in terms of the second metric, i.e. FEVER score, which additionally measures whether the retrieved evidence is correct for “SUPPORTED” and “REFUTED” categories.


In this section, we present an overview of our pipeline. At the high level, our pipeline consists of three main components: a document retrieval model, a sentence-level evidence selection model, and a claim verification model.

Figure 2 gives an overview of our pipeline, called the Dynamic REAsoning Machine (DREAM). Given a claim as input, the document retrieval model retrieves top related documents from a dump of WikiPedia. With retrieved documents, the sentence-level evidence selection model aims to select top relevant sentences as the predicted evidence. Finally, the claim verification model performs reasoning over the claim and predicted evidence, and states the veracity of the claim. We propose our reasoning framework in the claim verification model.

Figure 2: The pipeline of our DREAM system.

In this section, we briefly introduce our strategies for the first two models. The main contribution of this work is the graph-based reasoning approach we propose in the claim verification model, which we detail in the next section.

Document Retrieval Model

The document retrieval model takes a claim and a dump of Wikipedia as input, and returns most relevant documents. We mainly follow the UNC-NLP [9], which is the top-performing system on the competition hosted for FEVER shared task [15].

The document retrieval model first adopts a keywords matching mechanism [9] to filter candidate documents from the large-scale Wikipedia. Since a large proportion (10%) of the document’s titles have disambiguation information (e.g. “Vedam (film)” ), which is hard to be identified with literal matching, we further apply the NSMN [9] model to perform semantic matching between claims and candidate documents with disambiguation title. For the document with disambiguation title, the normalized matching score for claim and document will be calculated as:


where represents the concatenation of the title and the first sentence come from document and

indicate the output normalized probability. The documents without disambiguation title are assigned with the highest matching score and the documents with disambiguation title are assigned by calculated matching score

. These documents will be ranked and added to the resulting list. Finally, our system selects top documents from the resulting list as the searched documents.

Sentence-Level Evidence Selection Model

Evidence selection model selects the top potential evidence sentences by ranking all the candidate sentences from the documents retrieved by document retrieval model.

Evidence selector is required to conduct semantic matching between a claim and each evidence candidate. We employ pre-trained models like RoBERTa [7] and XLNet [19] as the sentence encoder. In our experiments, we use RoBERTa because it performs better. The input of our sentence encoder is

where and indicate tokenized word-pieces of original claim and evidence candidate. and are symbols indicating ending of a sentence and ending of a whole input, respectively. The final representation

is obtained via extracting the hidden vector of the [CLS] token.

denotes the dimension of hidden vector.


Then we employ an MLP layer and a softmax layer to compute score

for each evidence candidate:


where is a weight metric and

denotes bias vector. Afterwards, we rank all the evidence sentences by score

and select top potential evidence sentences.

Claim Verification Model

In this section, we introduce our claim verification model, which is the main contribution of this work. The task requires the ability to aggregate pieces of information from pieces of evidence and do reasoning over it to make a conclusion. Such information across multiple evidence sentences has intrinsic structures, including both intra-sentence structure such as the argument, location, and temporal of an event and inter-sentence structure such as the same mention of an argument in two sentences. Instead of simply concatenating evidence sentences into a single string, we propose to reason over a semantic-level graph for claim classification.

Our approach contains three modules, including (1) graph construction module, which constructs two semantic graphs for evidence and claim separately, (2) graph-based contextual word representation learning module, which takes the constructed graph as the input to learn a graph-enhanced contextual representation for each word in the input, and (3) graph reasoning module, which takes the outputs from the previous two modules to conduct graph-level representation learning and reasoning, and makes the prediction. Details of each module will be described below.

Graph Construction

Figure 3: An example of the constructed graph. Each box describes a result extracted by SRL regarding different verbs. The blue solid lines indicate inner-tuple edges and red dotted lines indicate inter-tuple edges.

We first introduce the common notation about graph networks that will be used throughout the paper. Then we will introduce the details of graph construction.

Graph network framework is defined as the relational learning framework built based on the graph structure. A graph is denoted as , where denotes a set of nodes and represents edges connecting them. and denote the number of nodes and edges respectively. denotes a set of neighboring nodes that have an edge connects to node . The common input of our claim verification model is the tokenized word-pieces of length .


where is the concatenation of top evidences.

We use the same method to construct graphs for evidence sentences and the claim. Below we take evidence as the example to describe the graph construction procedure. With given evidence or claim, our graph construction module operates in following steps.

  • Tuples (set of arguments nodes) are extracted via SRL toolkits. SRL is performed to identify arguments and their roles in a sentence. The sub-graph formed by the same tuple are fully-connected by inner-tuple edges.

  • We add an inter-tuple edge between each pair of nodes from different tuples if they potentially mention the same entity. We first employ NER (Named Entity Recognition)

    [10] toolkits to extract entities mentioned in the content of nodes. Assuming entity and entity come from different tuples, we add one inter-tuple edge if one of following rules are satisfied: (1) is equal to ; (2) contains ; (3) the number of overlapped words between and is larger than the half of the minimum number of words in and .

Figure 3 shows an example of the constructed graph.

Contextual Word Representation with Graph-Based Distance

Traditional reasoning approaches usually concatenate the pieces of evidence in a sequential way and feed them into a pre-trained model (e.g., XLNet) to learn the contextual word representation. Since pre-trained models adopt the absolute distance of two words in the input sequence, some closely linked nodes in the constructed graph are far away from each other. To better model the structural information in the extracted graph, we present an algorithm to re-calculate the distance between each pair of nodes in the text by introducing the distance of two nodes in the constructed graph.

However, the whole distance metric will take huge memory space and calculation time considering that each word in the extracted graph has a distinct distance vector and each element in the vector is mapped into an embedding vector. Assuming the length of the input is 512 and the dimension of distance element is 1,024, the distance tensor takes almost 268 millions memory space for one sample, making it unable to implement the whole distance metric. To address this problem, we present a trade-off approach that uses a topology sort algorithm to sort words in the extracted graph. First, we use topology sort to sort nodes in the constructed graph to shorten the distance between two closely linked nodes. Second, we feed the sorted sequence into XLNet to get the relative position of words. Furthermore, topology sort can ensure that previous nodes are either its parent nodes or its sibling nodes. This characteristic helps the model to learn the dependencies in the extracted graph.

The details of the topology sort algorithm are shown at Algorithm 1. The algorithm begins from nodes without incident relations. For each node without incident relations, we recursively visit its child nodes in a depth first search way.

In this way, we obtain the graph-guided distances between words, which will be used as the input to the XLNet model. Then, XLNet maps the input of length into a sequence of hidden vectors as follows.

1:A sequence of nodes ; A set of relations
2:function dfs(node, visited, sorted_sequence)
3:      for each child in node’s children do
4:            if  has no incident edges and visited[]==0 then
5:                 visited[]=1
6:                 DFS(, visited)
7:            end if
8:      end for
9:      sorted_sequence.append(0, )
10:end function
11:sorted_sequence = []
12:visited = [0 for i in range(n)]
13:S,R = changed_to_acyclic_graph(S,R)
14:for each node in  do
15:      if  has no incident edges and visited[i] == 0 then
16:            visited[i] = 1
17:            for each child in ’s children do
18:                 DFS(, visited, sorted_sequence)
19:                 sorted_sequence.append()
20:            end for
21:      end if
22:end for
23:return sorted_sequence
Algorithm 1 Graph-based Distance Calculation Algorithm.
Figure 4: The overview of our graph-based claim verification model.

Graph-Based Reasoning Network

Taking the graphs and graph-based distance matrices as input, we first initialize node representation based on the contextual word representation. Afterwards, we update the graph by propagating the information from neighboring nodes. Finally, after obtaining graph-level representations for claim-based and evidence-based graphs, we make the alignment between two graphs and make the final prediction.

Node Representation

The reasoning module, built on top of XLNet, takes the hidden vectors learned from XLNet to initialize the representation of nodes.

Each node in the graph is a word span in the input text. The initial representation of each node is the average of hidden vectors at corresponding position. Afterwards, the representation will be updated with graph learning module.

Graph Representation Learning

In this part, we present the graph learning module, which is designed to update representation of nodes by aggregating information from their neighbors. To capture the multi-hop relational information, we employ multi-layer graph convolutional network (GCNs) [6] to update the node representation. Our intuition of using GCNs is to utilize its ability to automatically aggregate information through edges.

Here we describe the GCNs. Formally, we denote as the graph constructed by the previous graph construction method and let be a matrix containing representation of all nodes, where and denote the number of nodes and dimension of nodes representation, respectively. Each row is the representation of node . We introduce an adjacency matrix of graph and its degree matrix , where we add self-loops to matrix and .

Specifically, one-layer GCNs will aggregate information through one-hop edges. We describe it as follows:


where is the new -dimension representation of node , is the normalized symmetric adjacency matrix, is a weight matrix, and

is an activation function. To exploit information from the multi-hop neighboring nodes, we stack multiple GCNs layers:


where denotes the layer number and is the initial representation of node initialized from the contextual representation. We simplify as for later use, where indicates the representation of all nodes updated by k-layer GCNs. The graph learning mechanism will be performed separately for claim-based and evidence-based graph. Therefore, we denote and as the representation of all nodes in claim-based graph and evidence-based graphs, respectively. Afterwards, we utilize the graph matching module to align the graph-level node representation learned for two graphs and make the final prediction.

Graph Matching

We need to explore the related information between two graphs and make semantic alignment for final prediction.

Formally, let and denote matrices containing representation of all nodes in evidence-based and claim-based graph respectively, where and denote number of nodes in the corresponding graph.

We first employ a graph attention mechanism [17] to generate claim-specific evidence representation for each node in claim-based graph. Specifically, we first take each as query, and take all node representation as keys. We then perform graph attention on the nodes, a attention mechanism to compute attention coefficient as follows:


which means the importance of evidence node to the claim node . and is the weight matrix and is the dimension of attention feature. We use dot-product function as here. We then normalize using softmax function:


Afterwards, we calculate claim-centric evidence representation using the weighted sum over :


We then perform node-to-node alignment and calculate aligned vectors by the claim node representation and claim-centric evidence representation .


where denotes the alignment function. Inspired by shen2018improved shen2018improved, we design our alignment function as:


where is a weight matrix and is element-wise Hadamard product. The final output is obtained by the mean pooling over . We then feed the concatenated vector of and the final hidden vector from XLNet through a MLP layer for the final prediction.


We conduct experiments on FEVER [14], a benchmark dataset for fact extraction and verification. Each instance in FEVER dataset consists of a claim, groups of ground-truth evidence from Wikipedia and a label (i.e., SUPPORTED, REFUTED, NOT ENOUGH INFO) indicating its veracity. Furthermore, FEVER is attached with a dump of Wikipedia, which contains 5,416,537 preprocessed documents. The statistic of FEVER is shown in Table 1.

Training 80,035 29,775 35,659
Dev 6,666 6,666 6,666
Test 6,666 6,666 6,666
Table 1: Split size of SUPPORTED, REFUTED and NOT ENOUGH INFO (NEI) classes in FEVER.

The two official evaluation metrics of FEVER are label accuracy and FEVER score. Label accuracy is the primary evaluation metric we apply for our experiments because it directly represents the performance of the claim verification model. We also report FEVER score, which measures whether both the predicted label and the retrieved evidence are correct. FEVER score is calculated with equation 13, where is the ground truth label, is the predicted label, is a set of ground-truth evidence, and is a set of predicted evidence.


No evidence is required if the predicted label is NEI.


We first select three top-performing systems on FEVER shared task as the baselines.

  • The UNC-NLP [9] employed a semantic matching neural network for both evidence selection and claim verification. They also employed additional features (e.g., WordNet features) and symbolic rules (e.g., keywords matching).

  • The UCL Machine Reading Group [20] verifies the veracity of each claim-evidence pair and aggregate predicited label.

  • The Athene UNK TU Darmstadt team [5] encodes each claim-evidence pair followed by pooling function.

We also compare to GEAR [21], which uses BERT [3] to generate claim-specific evidence representation and applies graph network to compute evidence-wise node representation for final prediction.

Model Comparison

Table 2 reports the overall performance of our model on the blind test set with the score showed on the public leaderboard222The public leaderboard for perpetual evaluation of FEVER ( DREAM is our user name on the leaderboard.. As shown in the Table 2, our model significantly outperforms previous systems with 76.85% label accuracy and 70.60% FEVER score. At the time of paper submission, our system achieves state-of-the-art performance compared with other methods from leaderboard.

Method Label FEVER
Acc. (%) Score (%)
Athene 65.46 61.58
UCL Machine Reading Group 67.62 62.52
UNC-NLP 68.21 64.21
GEAR-single 71.60 67.10
DREAM (our approach) 76.85 70.60
Table 2: Performance on the blind test set.

Ablation Study

Table 3 presents the label accuracy on the development set after eliminating different components (including the graph-based relational distance and graph-based reasoning network) separately in our model. We also report the performance of our XLNet-based baseline, which does not take any graph information, equivalent to removing both components simultaneously.

Model Label Accuracy
DREAM 79.16
-w/o Relative position 78.35
-w/o Graph Reasoning 77.12
XLNet baseline 75.40
Table 3: Ablation study on develop set.

As shown in Table 3, compared to the XLNet baseline, incorporating both graph-based modules brings 3.76% improvement on label accuracy. Removing the graph-based distance drops 0.81% in terms of label accuracy. The graph-based distance mechanism can shorten the distance of two closely-linked nodes and help the model to learn their dependency. Removing the graph-based reasoning module drops 2.04% because graph reasoning module captures the structural information and performs deep reasoning over that.

Document Retrieval Results

We evaluate the performance of our document retrieval module using recall metric, which is defined as the proportion of ground-truth documents that are successfully retrieved.

Table 4 reports the results of an efficient system (first row) which is built purely based on keywords from claim and titles of all the documents with Elastic Search333, and reports its combination with a neural network based model. The recall of the symbolic system is good, yet can be improved by the neural model. It is a trade-off between efficiency and performance in the real application.

Method Train Dev.
Recall Recall
Keywords+Elastic Search 80.46 83.33
Keywords+Elastic Search+NNSM 89.16 89.85
Table 4: Results of document retrieval module.

Evidence Selection Results

In this part, we present the performance of the sentence-level evidence selection module that we develop with different backbone. We take the concatenation of claim and each evidence as input, and take the last hidden vector to calculate the score for evidence ranking. Results from Table 5 indicate that RoBERTa performs slightly better than XLNet.

Model Dev. Set Test Set
Acc. Rec. F1 Acc. Rec. F1
XLNet 26.60 87.33 40.79 25.55 85.34 39.33
RoBERTa 26.67 87.64 40.90 25.63 85.57 39.45
Table 5: Results of evidence selection models.

Error Analysis

We randomly select 200 incorrectly predicted instances and summarize the primary types of errors.

The first type of errors is caused by failing to match the semantic meaning between phrases that describe the same event. For example, the claim states “Winter’s Tale is a book.” while the evidence states “Winter ’s Tale is a 1983 novel by Mark Helprin.”. The model fails to realize that “novel” belongs to “book” and stats that the claim is refuted. Solving this type of error needs to involve external knowledge (e.g. ConceptNet [13]) that can indicate logical relationships between different events.

The misleading information in retrieved evidence causes the second type of errors. For example, the claim states “The Gifted is a movie”, and the ground-truth evidence states “The Gifted is an upcoming American television series”. How ever, the retrieved evidence also contains “The Gifted is a 2014 Filipino dark comedy-drama movie.”, which misleads the model to make the wrong judgement.

Related Work

In general, fact checking involves assessing the truthfulness of a claim. In literature, a claim can be a text or a subject-predicate-object triple [8]. In this work, we only consider textual claim. Existing datasets differ from data source and the type of supporting evidence for verifying the claim. An early work by vlachos2014fact vlachos2014fact construct 221 labeled claims in the political domain from POLITIFACT.COM and CHANNEL4.COM, given meta-data of the speaker as the evidence. POLIFACT is further investigated by following works, including ferreira2016emergent ferreira2016emergent who build Emergent with 300 labeled rumors and about 2.6K news article, wang2017liar wang2017liar who build LIAR with 12.8K annotated short statements and six fine-grained labels, and rashkin2017truth rashkin2017truth who collect claims without meta-data while providing 74K news articles. We study FEVER [14], which requires aggregating information from multiple pieces of evidence from Wikipedia for making the conclusion. FEVER contains 185,445 annotated instances, which to the best of our knowledge is the largest benchmark dataset in this area. We plan to study fact checking with adversarial attacks [16, 11] in the future.

The majority of participating teams in the FEVER challenge [15] use the same pipeline consisting of three components, namely document selection, evidence sentence selection, and claim verification. In document selection phase, participants typically extract named entities from a claim as the query and use Wikipedia search API. In the evidence selection phase, participants measure the similarity between the claim and an evidence sentence candidate by training a classification model like Enhanced LSTM [2] in a supervised setting or using string similarity function like TFIDF without trainable parameters. In this work, our focus is the claim classification phase. Top-ranked three systems aggregate pieces of evidence through concatenating evidence sentences into a single string [9]

, classifying each evidence-claim pair separately and merge the results

[20], and encoding each evidence-claim pair followed by pooling operation [5]. A recent work by zhou-etal-2019-gear zhou-etal-2019-gear is the first to use BERT to calculate claim-specific evidence sentence representation, and then develop a graph network to aggregate the information on top of BERT, regarding each evidence as a node in the graph. Our work differs from zhou-etal-2019-gear zhou-etal-2019-gear in (1) that the construction of our graph requires understanding the syntax of each sentence, which could be viewed as a more fine-grained graph, and (2) that both the contextual representation learning module and the reasoning module have model innovations of taking consideration of the graph information. Instead of training each component separately, yin2018twowingos yin2018twowingos show that joint learning could improve both claim verification and evidence identification.


In this work, we present a graph-based approach for fact checking. When assessing the veracity of a claim given multiple evidence sentences, our approach does not conduct text-based matching at word or sentence level. Instead, our approach is built upon an automatically constructed graph, which is derived based on semantic role labeling. To better exploit the graph information, we propose two graph-based modules, one for calculating contextual word embedding using graph-based distance in XLNet, and another for learning representation for graph components and reasoning over the graph. Experiments show that both graph-based modules bring improvements and our final system is the state-of-the-art on the public leaderboard at the time of paper submission. In the future, we plan to leverage external background knowledge about the claim and evidence to improve model’s reasoning ability.


  • [1] G. Angeli and C. D. Manning (2014) Naturalli: natural logic inference for common sense reasoning. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 534–545. Cited by: Introduction.
  • [2] Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2016) Enhanced lstm for natural language inference. arXiv preprint arXiv:1609.06038. Cited by: Related Work.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Baselines.
  • [4] R. Faris, H. Roberts, B. Etling, N. Bourassa, E. Zuckerman, and Y. Benkler (2017) Partisanship, propaganda, and disinformation: online media and the 2016 us presidential election. Cited by: Introduction.
  • [5] A. Hanselowski, H. Zhang, Z. Li, D. Sorokin, B. Schiller, C. Schulz, and I. Gurevych (2018) UKP-athene: multi-sentence textual entailment for claim verification. arXiv preprint arXiv:1809.01479. Cited by: 3rd item, Related Work.
  • [6] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: Graph Representation Learning.
  • [7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: Sentence-Level Evidence Selection Model.
  • [8] N. Nakashole and T. M. Mitchell (2014) Language-aware truth assessment of fact candidates. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1009–1019. Cited by: Related Work.
  • [9] Y. Nie, H. Chen, and M. Bansal (2019) Combining fact extraction and verification with neural semantic matching networks. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 6859–6866. Cited by: Document Retrieval Model, Document Retrieval Model, 1st item, Related Work.
  • [10] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: 2nd item.
  • [11] T. Schuster, D. J. Shah, Y. J. S. Yeo, D. Filizzola, E. Santus, and R. Barzilay (2019) Towards debiasing fact verification models. arXiv preprint arXiv:1908.05267. Cited by: Related Work.
  • [12] P. Shi and J. Lin (2019) Simple bert models for relation extraction and semantic role labeling. arXiv preprint arXiv:1904.05255. Cited by: Introduction.
  • [13] R. Speer, J. Chin, and C. Havasi (2017) Conceptnet 5.5: an open multilingual graph of general knowledge. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: Error Analysis.
  • [14] J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355. Cited by: Introduction, Experiments, Related Work.
  • [15] J. Thorne, A. Vlachos, O. Cocarascu, C. Christodoulopoulos, and A. Mittal (2018) The fact extraction and verification (fever) shared task. arXiv preprint arXiv:1811.10971. Cited by: Introduction, Document Retrieval Model, Related Work.
  • [16] J. Thorne and A. Vlachos (2019) Adversarial attacks against fact extraction and verification. arXiv preprint arXiv:1903.05543. Cited by: Related Work.
  • [17] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: Graph Matching.
  • [18] S. Vosoughi, D. Roy, and S. Aral (2018) The spread of true and false news online. Science 359 (6380), pp. 1146–1151. Cited by: Introduction.
  • [19] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: Introduction, Sentence-Level Evidence Selection Model.
  • [20] T. Yoneda, J. Mitchell, J. Welbl, P. Stenetorp, and S. Riedel (2018) Ucl machine reading group: four factor framework for fact finding (hexaf). In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pp. 97–102. Cited by: 2nd item, Related Work.
  • [21] J. Zhou, X. Han, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2019-07) GEAR: graph-based evidence aggregating and reasoning for fact verification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 892–901. External Links: Link Cited by: Introduction, Baselines.