Recent years have seen the proliferation of deceptive information online. With the increasing necessity to validate information from the Internet, automatic fact-checking has emerged as an important research topic. Fact-checking is at the core of multiple applications, e.g., discovery of fake news (Lazer et al., 2018), rumor detection in social media (Vosoughi et al., 2018), information verification in question answering systems (Mihaylova et al., 2018), detection of information manipulation agents (Chen et al., 2013; Mihaylov et al., 2015b; Darwish et al., 2017), and assistive technologies for investigative journalism (Hassan et al., 2015). It touches many aspects, such as credibility of users and sources, information veracity, information verification, and linguistic aspects of deceptive language. There has been work on automatic claim identification (Hassan et al., 2015, 2016), and also on checking the factuality/credibility of a claim, of a news article, or of an information source (Castillo et al., 2011; Ba et al., 2016; Zubiaga et al., 2016; Ma et al., 2016; Hardalov et al., 2016; Karadzhov et al., 2017a, b; Nakov et al., 2017b; Rashkin et al., 2017). In general, previous work has not paid much attention to explicitly modeling contextual information and linguistic properties of the discourse in order to identify and verify claims, with some rare recent exceptions (Popat et al., 2017; Gencheva et al., 2017).
In this article, we focus on studying the role of contextual information and discourse, which provide important information that is typically not included in the usual feature sets, which are mostly based on properties of the target claim, and its similarity to a set of validation documents or snippets. In particular, we focus on the following tasks:
Check-worthy claim identification
We address the automatic identification of claims in political debates which a journalist should fact-check. In this case, the text is dialog-style: with long turns by the candidates and orchestrated by a moderator around particular topics. Journalists had to challenge the veracity of claims in the 2016 US presidential campaign, and this was particularly challenging during the debates as a journalist had to prioritize which claims to fact-check first. Thus, we developed a model that ranks the claims by their check-worthiness.
We address the automatic verification of answers in community-driven Web forums (e.g., Quora, StackOverflow). The text is thread-style, but is subject to potential dialogues: a user posts a question and others post potential answers. That is, the answers are verified in the context of discussion threads in a forum and are also interpreted in the context of an initial question. Here we deal with social media content. The text is noisier and the information being shared is not always factual; mainly due to misunderstanding, ignorance, or maliciousness of the responder.
We run extensive experiments for both tasks by training and applying classifiers based on neural networks, kernel-based support vector machines, and combinations thereof. The results confirm that the contextual and the discourse information are crucial to boost the models and to achieve state-of-the-art results for both tasks.111We make available the datasets and source code for both tasks: https://github.com/pgencheva/claim-rank and https://github.com/qcri/QLFactChecking In the former task, using context yields 4.2 MAP points of absolute improvement, while using discourse information adds 1.5 MAP absolute points; in the latter task, considering the discourse and the contextual information improves the performance by a total of 4.5 MAP absolute points.
The rest of this article is organized as follows: Section 2 describes our supervised approach to predicting the check-worthiness of text fragments with focus on political debates. Section 3 presents our approach to verifying the factuality of the answers in a community question answering forum. Section 4 provides a more qualitative analysis of the outcome of all our experiments. Section 5 discusses related work. Finally, Section 6 presents the conclusions and the lessons learned, and further outlines some possible directions for future research.
2. Claim Identification
In this section, we focus on the problem of automatically identifying which claims in a given document are most check-worthy and thus should be prioritized for fact-checking. We focus on how contextual and discourse information can help in this task. We further study how to learn from multiple sources simultaneously (e.g., PolitiFact, FactCheck, ABC), with the objective of mimicking the selection strategies of one particular target source; we do this in a multi-task learning setup.
We used the CW-USPD-2016 dataset, which is centered around political debates (Gencheva et al., 2017). It contains four transcripts of the 2016 US Presidential election debates: one vice-presidential and three presidential. Each debate is annotated at the sentence level as check-worthy or not, but the sentences are kept in the context of the full debate, including metadata about the speaker, speaker turns, and system messages about the public reaction. The annotations were derived using publicly-available manual analysis of these debates by nine reputable fact-checking sources, shown in Table 1. This analysis was converted into a binary annotation: whether a particular sentence was annotated for factuality by a given source. Whenever one or more annotations were about part of a sentence, the entire sentence was selected, and when an annotation spanned over multiple sentences, each of them was selected. The dataset with the four debates contains 5,415 sentences, out of which 880 are positive examples (i.e., selected for fact-checking by at least one of the sources). Table 2 presents an excerpt of this corpus.
Note that the investigative journalists did not select the check-worthy claims in isolation, ignoring the context. Our analysis shows that these include claims that were highly disputed during the debate, that were relevant to the topic introduced by the moderator, etc. We will make use of these contextual dependencies below.
|The New York Times||26||25||46||52||149|
|The Washington Post||26||19||33||17||95|
|Clinton:||So we’re now on the precipice of having a potentially much better economy, but the last thing we need to do is to go back to the policies that failed us in the first place.||0||0||0||0||0||0||0||0||0||0||No|
|Clinton:||Independent experts have looked at what I’ve proposed and looked at what Donald’s proposed, and basically they’ve said this, that if his tax plan, which would blow up the debt by over $5 trillion and would in some instances disadvantage middle-class families compared to the wealthy, were to go into effect, we would lose 3.5 million jobs and maybe have another recession.||1||1||0||0||1||1||0||1||1||6||Yes|
|Clinton:||They’ve looked at my plans and they’ve said, OK, if we can do this, and I intend to get it done, we will have 10 million more new jobs, because we will be making investments where we can grow the economy.||1||0||0||0||0||0||0||0||0||1||Yes|
|Clinton:||Take clean energy.||0||0||0||0||0||0||0||0||0||0||No|
|Clinton:||Some country is going to be the clean- energy superpower of the 21st century.||0||0||0||0||0||0||0||0||0||0||No|
|Clinton:||Donald thinks that climate change is a hoax perpetrated by the Chinese.||1||1||1||1||0||0||1||0||1||6||Yes|
|Clinton:||I think it’s real.||0||0||0||0||0||0||0||0||0||0||No|
|Trump:||I did not.||1||1||0||1||1||1||0||0||0||5||Yes|
2.2. Modeling Context and Discourse
We developed a rich input representation in order to model and to predict the check-worthiness of a sentence. In particular, we included a variety of contextual and discourse-based features. They characterize the sentence in the context of the full segment by the same speaker, sometimes also looking at the previous and the following segments. We define a segment as a maximal set of consecutive sentences by the same speaker, without intervention by another speaker or the moderator, i.e., a turn. We start by describing these context-based features, which are the focus of attention of this work.
2.2.1. Position (3 features)
A sentence on the boundaries of a speaker’s segment could contain a reaction to another statement or could provoke a reaction, which in turn could signal a check-worthy claim. Thus, we added information about the position of the target sentence in its segment: whether it is first/last, as well as its reciprocal rank in the list of sentences in that segment.
2.2.2. Segment sizes (3 features)
The size of the segment belonging to one speaker might indicate whether the target sentence is part of a long speech, makes a short comment or is in the middle of a discussion with lots of interruptions. The size of the previous and of the next segments is also important in modeling the dialogue flow. Thus, we include three features with the size of the previous, the current, and the next segments.
2.2.3. Metadata (8 features)
Check-worthy claims often contain accusations about the opponents, as the example below shows (from the 2nd presidential debate):
|Trump:||Hillary Clinton attacked those same women and attacked them viciously.|
|Clinton:||They’re doing it to try to influence the election for Donald Trump.|
Thus, we use a feature that indicates whether the target sentence mentions the name of the opponent, whether the speaker is the moderator, and also who is speaking (3 features). We further use three binary features, indicating whether the target sentence is followed by a system message: applause, laugh, or cross-talk.
2.2.4. Topics (303 features)
Some topics are more likely to be associated with check-worthy claims, and thus we have features modeling the topics in the target sentence as well as in the surrounding context. We trained a Latent Dirichlet Allocation (LDA) topic model (Blei et al., 2003) on all political speeches and debates in The American Presidency Project222http://www.presidency.ucsb.edu/debates.php using all US presidential debates in the 2007–2016 period333https://github.com/paigecm/2016-campaign. We had 300 topics, and we used the distribution over the topics as a representation for the target sentence. We further modeled the context using cosines with such representations for the previous, the current, and the next segment.
2.2.5. Embeddings (303 features)
We also modeled semantics using word embeddings. We used the pre-trained 300-dimensional Google News word embeddings by Mikolov et al. (2013a) to compute an average embedding vector for the target sentence, and we used the 300 dimensions of that vector. We also modeled the context as the cosine between that vector and the vectors for three segments: the previous, the current, and the following one.
(5 features) Many claims selected for fact-checking contain contradictions to what has been said earlier, as in the example below (from the 3rd presidential debate):
|Clinton:||[…] about a potential nuclear competition in Asia, you said, you know, go ahead, enjoy yourselves, folks.|
|Trump:||I didn’t say nuclear.|
We model this by counting the negations in the target sentence as found in a dictionary of negation cues such as not, didn’t, and never. We further model the context as the number of such cues in the two neighboring sentences from the same segment and the two neighboring segments.
2.2.7. Similarity of the sentence to known positive/negative examples (3 features)
We used three more features that measure the similarity of the target sentence to other known examples. The first one computes the maximum over the training sentences of the number of matching words between the target and the training sentence, which is further multiplied by -1 if the latter was not check-worthy. We also used another version of the feature, where we multiplied it by 0 if the speakers were different. A third version took as a training set all claims checked by PolitiFact444http://www.politifact.com/ (excluding the target sentence).
2.2.8. Discourse (20 features)
We saw above that contradiction can signal the presence of check-worthy claims and contradiction can be expressed by a discourse relation such as Contrast. As other discourse relations such as Background, Cause, and Elaboration can also be useful, we used a discourse parser (Joty et al., 2015) to parse the entire segment. This parser follows the Rhetorical Structure Theory (RST). It produces a hierarchical representation of the discourse by linking first the elementary discourse units with binary discourse relations (indicating also which unit is the nucleus and which is the satellite), and building up the tree by connecting with the same type of discourse relations the more general cross-sentence nodes until a root node covers all the text. From this tree, we focused on the direct relationship between the target sentence and the other sentences in its segment; this gave rise to 18 contextual indicator features. We further analyzed the internal structure of the target sentence —how many nuclei and how many satellites it contains—, which gave rise to two sentence-level features.
2.3. Other Features
2.3.1. ClaimBuster-based (1,045 core features)
In order to be able to compare our model and features directly to the previous state of the art, we re-implemented, to the best of our ability, the sentence-level features of ClaimBuster (Hassan et al., 2015), namely TF-IDF-weighted bag of words (998 features), part-of-speech tags (25 features), named entities as recognized by Alchemy API555http://www.ibm.com/watson/alchemy-api.html (20 features), sentiment score from Alchemy API (1 feature), and number of tokens in the target sentence (1 feature). Apart from providing means of comparison to the state of the art, these features also make a solid contribution to our final system for check-worthiness estimation. However, note that we did not have access to the training data of ClaimBuster, which is not publicly available, and we thus train on our own dataset.
2.3.2. Sentiment (2 features)
Some sentences are highly negative, which can signal the presence of an interesting claim to check, as the two following example sentences show (from the 1st and the 2nd presidential debates):
|Trump:||Murders are up.|
|Clinton:||Bullying is up.|
We used the NRC sentiment lexicon(Mohammad and Turney, 2013) as a source of words and -grams with positive/negative sentiment, and we counted the number of positive and of negative words in the target sentence. These features are different from those in ClaimBuster, where these lexicons were not used.
2.3.3. Named entities (NE) (1 feature)
Sentences that contain named entity mentions are more likely to contain a claim that is worth fact-checking as they discuss particular people, organizations, and locations. Thus, we have a feature that counts the number of named entities in the target sentence; we use the NLTK toolkit2002). Unlike the ClaimBuster features above, here we only have one feature; we also use a different toolkit for named entity recognition.
2.3.4. Linguistic features (13 features)
We use as features the presence and the frequency of occurrence of linguistic markers such as factives and assertives from (Hooper, 1974), implicatives from (Karttunen, 1971), hedges from (Hyland, 2005), Wiki-bias terms from (Recasens et al., 2013), subjectivity cues from (Riloff and Wiebe, 2003), and sentiment cues from (Liu et al., 2005).666Most of these bias cues can be found at http://people.mpi-sws.org/~cristian/Biased_language.html We compute a feature vector according to Equation (1) where for each bias type and answer , the frequency of the cues for in is computed and then normalized by the total number of words in :
Below we describe these cues in more detail.
Factives (1 feature) (Hooper, 1974) are verbs that imply the veracity of their complement clause. In E1, know suggests that “they will open a second school …” and “they provide a qualified french education …” are factually true statements.
know that they will open a second school; and they are a nice french school…I know that they provide a qualified french education and add with that the history and arabic language to be adapted to the qatar. I think that’s an interesting addition.
Assertives (1 feature) (Hooper, 1974) are verbs that imply the veracity of their complement clause with a level of certainty. E.g., in E1, think indicates some uncertainty, while verbs like claim cast doubt on the certainty of their complement clause.
Implicatives (1 feature) (Karttunen, 1971) are verbs that imply the (un)truthfulness of their complement clause, e.g., decline and succeed.
Hedges (1 feature) (Hyland, 2005) reduce the person’s commitment to the truth, e.g., may and possibly.
Reporting verbs (1 feature) are used to report a statement from a source, e.g., argue and express.
Wiki-bias cues (1 feature) (Recasens et al., 2013) are extracted from the NPOV corpus from Wikipedia and cover bias cues (e.g., provide in E1), and controversial words, such as abortion and execute. These words are not available in neither of the other bias lexicons.
Modals (1 feature) are used to change the certainty of the statement (e.g., will or can), make an offer (e.g., shall), ask permission (e.g., may), or express an obligation or necessity (e.g., must).
Negations (1 feature) are used to deny or make negative statements such as no, never.
Subjectivity cues (2 features) (Riloff and Wiebe, 2003) are used when expressing personal opinions and feelings. There are strong and weak cues, e.g., in E1, nice and interesting are strong, while qualified is weak.
Sentiment cues (2 features). We use positive and negatives sentiment cues (Liu et al., 2005) to model the attitude, thought, and emotions of the speaker. In E1, nice, interesting and qualified are positive cues.
The above bias and subjectivity cues are mostly single words. Sometimes a multi-word cue (e.g., “we can guarantee”) can be a stronger signal for user’s certainty/uncertainty in their answers. We thus further generate multi-word cues (1 feature) by combining implicative, assertive, factive and report verbs with first person pronouns (I/we), modals and strong subjective adverbs, e.g., I/we+verb (e.g. “I believe”), I/we+adverb+verb (e.g., “I certainly know”), I/we+modal+verb (e.g., “we could figure out”) and I/we+modal+adverb+verb (e.g., “we can obviously see”).
2.3.5. Tense (1 feature)
Most of the check-worthy claims mention past events. In order to detect when the speaker is making a reference to the past or is talking about his/her future vision and plans, we include a feature with three values —indicating whether the text is in past, present or future tense. The feature is extracted in a simplified fashion from the verbal expressions, using POS tags and a list of auxiliary phrases. In particular, we consider a sentence to be in the past tense if it contains a past verb (VBD), and in the future tense if it contains will or have to; otherwise, we assume it to be in the present tense.
2.3.6. Length (1 feature)
Shorter sentences are generally less likely to contain a check-worthy claim.777One notable exception are short sentences with negations, e.g., Wrong., Nonsense., etc. Thus, we have a feature for the length of the sentence in terms of characters. Note that this feature was not part of the ClaimBuster features, as there length was modeled in terms of tokens, but here we do so using characters.
We used ReLU(Glorot et al., 2011) et al., 1998)
for 300 epochs with a batch size of 550. We set the L2 regularization to 0.0001, and we kept a constant learning rate of 0.04. We further enhanced the learning process by using a Nesterov’s momentum(Sutskever et al., 2013) of 0.9.
We trained the models to classify sentences as positive if one or more media had fact-checked a claim inside the target sentence, and negative otherwise. We then used the classifier scores to rank the sentences with respect to check-worthiness.999We also tried using ordinal regression, and SVM-perf (an instantiation of SVM-struct), to directly optimize precision, but they performed worse. We tuned the parameters and we evaluated the performance using 4-fold cross-validation, using each of the four debates in turn for testing while training on the remaining three.
We use ranking measures such as Precision at () and Mean Average Precision (MAP). As Table 1 shows, most media rarely check more than 50 claims per debate, which means that there is no need to fact-check more than 50 sentences. Thus, we report for .101010Note that as far as the difference between the P@k metrics (especially between 5 and 10) is in terms of a few sentences, the deviation between them can seem large, while caused by a few correctly/wrongly classified sentences. MAP is the mean of the Average Precision across the four debates. Finally, we also measure the recall at the -th position of returned sentences for each debate, where is the number of relevant documents for that debate and the metric is known as -Precision (-Pr). As with MAP, we provide the average across the 4 debates.
Table 4 shows all the results of our claim ranking system with several feature variants. In order to put the numbers in perspective, we also show the results for four increasingly competitive baselines (‘Reference Systems’). The first one is a random baseline. It is then followed by an SVM classifier based on a bag-of-words representation with TF-IDF weights estimated on the training data. Then come two versions of the ClaimBuster system: Claimbuster–Platform refers to the performance of ClaimBuster using the scores obtained from their online demo,111111http://idir-server2.uta.edu/claimbuster/demo which we accessed on December 20, 2016, and Claimbuster–Features is our re-implementations of ClaimBuster using our FNN classifiers trained on our dataset with their features.
We can see that our system with all features outperforms all reference systems by a large margin for all metrics. The two versions of ClaimBuster also outperform the TF-IDF baseline on most measures. Moreover, our re-implementation of ClaimBuster is better than the online platform, especially in terms of MAP. This is expected as their system is trained on a different dataset and it may suffer from testing on slightly out-of-domain data. Our advantage with respect to ClaimBuster implies that the extra information coded in our model, mainly more contextual, structural, and linguistic features, has an important contribution to the final performance.
Rows 2–4 in Table 4 show the effect of the discourse and of the contextual features implemented in our system. The contextual features have a major impact on performance: excluding them yields major drop for all measures, e.g., MAP drops from 0.427 to 0.385, and P@5 drops from 0.800 to 0.550. The discourse features also have an impact, although it is smaller. The most noticeable difference is in the quality at the lower positions in the rank, e.g., P@5 does not vary when removing discourse features, but P@10, P@20 and P@50 all drop by 2.5 to 5 percent points. Finally, row 4 in the table shows that contextual+discourse features alone already yield a competitive system, performing about the same as Claimbuster–Platform (which uses no contextual features at all). In Section 4, we will present a further qualitative description of the results including some examples.
2.5. Multi-task Learning Experiments
Unlike the above single-source approaches, in this subsection, we explore a multi-source neural network framework, in which we try to predict the selections of each and every fact-checking organization simultaneously. We show that, even when the goal is to mimic the selection strategy of one particular fact-checking organization, it is beneficial to leverage on the selection choices by multiple such organizations.
We approach the task of check-worthiness prediction using the same features, while at the same time modeling the problem as multi-task learning, using different sources of annotation over the same training dataset. As a result, we can learn to mimic the selection strategy of each and every of these individual sources. As we have explained above, in our dataset the individual judgments come from nine independent fact-checking organizations, and we thus predict the selection choices of each of them in isolation plus a collective label ANY, which indicates whether at least one source would judge that claim as check-worthy.
Figure 1 illustrates the architecture of the full neural multi-source learning model, which predicts the selection choices of each of the nine individual sources (tasks) and of the special cumulative source: task ANY
. There is a hidden layer (of size 300) that is shared between all ten tasks. Then, each task has its own task-specific hidden layer (each of size 300). Finally, each task-specific layer is followed by an output layer: a single sigmoid unit that provides the prediction of whether the utterance was fact-checked by the corresponding source. Eventually, we make use of the probability of the prediction to prioritize claims for fact-checking. During training, each task modifies the weights of both its own task-specific layer and of the shared layer. For our neural network architecture, we used ReLU units, Stochastic Gradient Descent with Nesterov momentum of 0.7, iterating for 100 epochs with batches of size 500 and a learning rate of 0.08.
This kind of neural network architecture for multi-task learning is known in the literature as hard parameter sharing (Caruana, 1993), and it can greatly reduce the risk of overfitting. In particular, it has been shown that the risk of overfitting the shared parameters in the hidden layer is an order smaller than overfitting the task-specific parameters in the output layers, where is the number of tasks at hand (Baxter, 1997). The input to our neural network consists of the various domain-specific features that have been previously described.
We implemented the neural network using Keras. We tried adding more shared and task-specific layers as well as having some task-specific layers linked directly to the input, but we eventually settled on the architecture in Figure1. We also tried to optimize directly for average precision and adding loss weights to task ANY, but using the standard binary cross-entropy loss yielded the best results.
As before, we perform 4-fold cross-validation, where each time we leave one debate out for testing. Moreover, in order to stabilize the results, we repeat each experiment three times with different random seeds, and we report the average over these three reruns.121212Having multiple reruns is a standard procedure to stabilize an optimization algorithm that is sensitive to the random seed, e.g., this strategy has been argued for when using MERT for tuning hyper-parameters in Statistical Machine Translation (Foster and Kuhn, 2009).
We should note that in most cases this was not really needed, as the standard deviation for the reruns was generally tiny: 0.001 or less, absolute.
presents the results, with all evaluation metrics, when predicting each of the nine sources. We experiment with three different configurations of the model described in the previous section. All of them aim at learning to mimic the selection choices by one single fact-checking organization (source). The first one is a single-task baselinesingleton where a separate neural network is trained for each source. The other two are multi-task learning configurations: multi trains to predict labels for each of the nine tasks, one for each fact-checker; and multi+any trains to predict labels for each of the nine tasks (one for each fact-checker), and also for task ANY (as shown in Figure 1). We can see in Table 5 that, for most of the sources, multi-task learning improves over the single-source system. The results of the multi-task variations that improve over the single baseline are boldfaced in the table. The improvements are consistent across evaluation metrics and vary largely depending on the source and the metric. One notable exception is NYT, for which the single-task learning shows the highest scores. We hypothesize that the network has found some distinctive features of NYT, which make it easy to predict. These relations are blurred when we try to optimize for multiple tasks at once. However, it is important to state that removing NYT from the learning targets worsens the results for the other sources, i.e., it carries some important relations that are worth modeling.
The first three rows of Table 6 present the same results but averaged over the nine sources. Again, we can see that multi-task learning yields sizable improvement over the single-task learning baseline for all evaluation measures. Another conclusion that can be drawn is that including the task any does not help to improve the multi-task model. This is probably due to the fact that this information is already contained in the multi-task model with nine distinct sources only. The last two rows in Table 6 present two additional variants of the model: the single-task learning any system, which is trained on the union of the selected sentences by all nine fact-checkers to predict the target fact-checker only; and the system singleton+any that predicts labels for two tasks: (i) for the target fact-checker, and (ii) for task ANY.
We can see that the model any performs comparably to the singleton baseline, thus being clearly inferior than the multi-task learning variants. Finally, singleton+any is also better than the single-task learning variants, but it falls short compared to the other multi-task learning variants. Including output units for all nine individual media seems crucial for getting advantage of the multi-task learning, i.e., considering only an extra output prediction node for task ANY is not enough.
With the ever growing amount of unreliable content online, veracity will almost certainly become an important component of question answering systems in the future. In this section, we focus on fact-checking in the context of community question answering (cQA), i.e., predicting whether an answer to a given question is likely to be true. This aspect has been ignored, e.g., in recent cQA tasks at NTCIR and SemEval (Ishikawa et al., 2010; Nakov et al., 2015, 2016, 2017a), where an answer is considered as Good if it tries to address the question, irrespective of its veracity. Yet, veracity is an important aspect, as high-quality automatic fact-checking can offer a better experience to the users of cQA systems; e.g., a possible application scenario would be that in which the user could be presented with a ranking of all good answers accompanied by veracity scores, where low scores would warn her not to completely trust the answer or to double-check it.
|:||If wife is under her husband’s sponsorship and is willing to come Qatar on visit, how long she can stay after extending the visa every month? I have heard it’s not possible to extend visit visa more than 6 months? …|
|:||Maximum period is 9 Months….|
|:||6 months maximum|
|:||This has been answered in QL so many times. Please do search for information regarding this. BTW answer is 6 months.|
Figure 2 presents an excerpt of an example from the Qatar Living forum, with one question () and three plausible answers () selected from a longer thread. According to the SemEval-2016 Task 3 annotation instructions (Nakov et al., 2016), all three answers are considered Good since they address the question. Nevertheless, contains false information, while and are true,131313One could also guess that answers and are more likely to be true from the fact that the 6 months answer fragment appears many times in the current thread (it also happens to appear more often in related threads as well). While these observations serve as the basis for useful features for classification, the real verification for a gold standard annotation requires finding support from a reliable external information source: in this case, an official government information portal. as can be checked on an official governmental website.141414https://www.moi.gov.qa/site/english/departments/PassportDept/news/2011/01/03/23385.html
We use the CQA-QL-FACT dataset, which stresses the difference between (a) distinguishing a good vs. a bad answer, and (b) distinguishing between a factually true vs. a factually false one. We added the factuality annotations on top of the CQA-QL-2016 dataset from the SemEval-2016 Task 3 on community Question Answering (Nakov et al., 2016). In CQA-QL-2016, the data is organized in question–answer threads extracted from the Qatar Living forum. Each question has a subject, a body, and metadata: ID, category (e.g., Computers and Internet, Education, and Moving to Qatar), date and time of posting, and user name.
First, we annotated the questions using the following labels:
Factual: The question is asking for factual information, which can be answered by checking various information sources, and it is not ambiguous.
Opinion: The question asks for an opinion or an advice, not for a fact.
Socializing: Not a real question, but rather socializing/chatting. This can also mean expressing an opinion or sharing some information without really asking anything of general interest.
We annotated 1,982 questions, with the above factuality labels. We ended up with 625 instances that contain multiple questions, which we excluded from further analysis. Table 7 shows the annotation results for the remaining 1,357 questions, including examples.
|Factual||373||What is Ooredoo customer service number?|
|Opinion||689||Can anyone recommend a good Vet in Doha?|
|Socializing||295||What was your first car?|
Next, we annotated for veracity the answers to the factual questions. We only annotated the originally judged as Good answers (ignoring both Bad and Potentially Useful ), and we used the following labels:
Factual - True: The answer is True and this can be verified using an external resource. (q: “I wanted to know if there were any specific shots and vaccinations I should get before coming over [to Doha].”; a: “Yes there are; though it varies depending on which country you come from. In the UK; the doctor has a list of all countries and the vaccinations needed for each.”).151515This can be verified at https://wwwnc.cdc.gov/travel/destinations/traveler/none/qatar
Factual - False: The answer gives a factual response, but it is false. (q: “Can I bring my pitbulls to Qatar?”, a: “Yes you can bring it but be careful this kind of dog is very dangerous.”).161616The answer is not true because pitbulls are included in the list of banned breeds in Qatar: http://canvethospital.com/pet-relocation/banned-dog-breed-list-qatar-2015/
Factual - Partially True: We could only verify part of the answer. (q: “I will be relocating from the UK to Qatar […] is there a league or TT clubs / nights in Doha?”, a: “Visit Qatar Bowling Center during thursday and friday and you’ll find people playing TT there.”).171717The place has table tennis, but we do not know on which days: https://www.qatarbowlingfederation.com/bowling-center/
Factual - Conditionally True: The answer is True in some cases, and False in others, depending on some conditions that the answer does not mention. (q: “My wife does not have NOC from Qatar Airways; but we are married now so can i bring her legally on my family visa as her husband?”, a: “Yes you can.”).181818This answer can be true, but this depends upon some conditions: http://www.onlineqatar.com/info/dependent-family-visa.aspx
Factual - Responder Unsure: The person giving the answer is not sure about the veracity of his/her statement. (e.g., “Possible only if government employed. That’s what I heard.”)
NonFactual: The answer is not factual. It could be an opinion, an advice, etc. that cannot be verified. (e.g., “Its better to buy a new one.”)
We further discarded items whose factuality was very time-sensitive (e.g., “It is Friday tomorrow.”, “It was raining last week.”)191919Arguably, many answers are somewhat time-sensitive, e.g., “There is an IKEA in Doha.” is true only after IKEA opened, but not before that. In such cases, we just used the present situation as a point of reference., or for which the annotators were unsure.
|Coarse-Grained Label||Answers||Fine-Grained Label||Answers|
|Positive||128||Factual - True||128|
|Negative||121||Factual - False||22|
|Factual - Partially True||38|
|Factual - Conditionally True||16|
|Factual - Responder Unsure||26|
We considered all questions from the Dev and the Test partitions of the CQA-QL-2016 dataset. We targeted very high quality, and thus we did not crowdsource the annotation, as pilot annotations showed that the task was very difficult and that it was not possible to guarantee that Turkers would do all the necessary verification, e.g., gather evidence from trusted sources. Instead, all examples were first annotated independently by four annotators, and then they discussed each example in detail to come up with a final label. We ended up with 249 Good answers202020This is comparable in size to other fact-checking datasets, e.g., Ma et al. (2015) used 226 rumors, and Popat et al. (2016) had 100 Wiki hoaxes. to 71 different questions, which we annotated for factuality: 128 Positive and 121 Negative examples. See Table 8 for details.
3.2. Modeling Context and Discourse
We model the context of an answer with respect to the entire answer thread in which it occurs, and with respect to other high-quality posts from the entire Qatar Living forum. We further use discourse features as in Section 2.2.8.
3.2.1. Support from the current thread (5 features)
We use the cosine similarity between an answer- and a thread-vector of allGood answers using Qatar Living embeddings. For this purpose, we use 100-dimensional in-domain word embeddings (Mihaylov and Nakov, 2016b), which were trained using word2vec (Mikolov et al., 2013b) on a large dump of Qatar Living data (2M answers).212121Available at http://alt.qcri.org/semeval2016/task3/data/uploads/QL-unannotated-data-subtaskA.xml.zip The idea is that if an answer is similar to other answers in the thread, it is more likely to be true. To this, we add thread-level features related to the rank of the answer in the thread: (i) the reciprocal rank of the answer in the thread and (ii) percentile of answer’s rank in the thread. As there are exactly ten answers per thread in the dataset, the first answer gets the score of 1.0, the second one gets 0.9, the next one gets 0.8, and so on. We calculate these two ranking features twice: once for the full list of answers, and once for the list of good answers only.
3.2.2. Support from all of Qatar Living (60 features)
We further collect supporting evidence from all threads in the Qatar Living forum. To do this, we query a search engine, limiting the search to the forum only. See Section 3.3.3 for more detail about how the search for evidence on the Web is performed and what features are calculated.
|Question:||does anyone know if there is a french speaking nursery in doha?|
|Answer:||there is a french school here. don’t know the ages but my neighbor’s 3 yr old goes there…|
|Best Matched Sentence for Q&A: there is a french school here.|
|35639076||15||1||10||the pre-school follows the english program but also gives french and arabic lessons.|
|32448901||4||2||11||france bought the property in 1952 and since 1981 it has been home to the french institute.|
|31704366||7||3||1||they include one indian school, two french, seven following the british curriculum…|
|27971261||6||4||4||the new schools include six qatari, four indian, two british, two american and a finnish…|
3.2.3. Support from high-quality posts in Qatar Living
Among the active users of the Qatar Living forum, there is a community of 38 trusted users, who have written high-quality articles on topics that attract a lot of interest, e.g., issues related to visas, work legislation, etc. We try to verify the answers against these high-quality posts. (i) Since an answer can combine both relevant and irrelevant information with respect to its question, we first generate a query against a search engine for each Q&A. (ii) We then compute cosines between the query and the sentences in the high-quality posts, and we select the -best matches. (iii) Finally, we compute textual entailment scores (Kouylekov and Negri, 2010) for the answer given the -best matches, which we then use as features. An example is shown in Table 9.
3.2.4. Discourse features
We use the same discourse features as for the claim identification task (cf. Section 2.2.8).
3.3. Other Features
3.3.1. Linguistic bias, subjectivity and sentiment
Forum users, consciously or not, often put linguistic markers in their answers, which can signal the degree of the user’s certainty in the veracity of what they say. We thus use the linguistic features from the previous task (see above).
3.3.2. Credibility (31 features)
We use features that have been previously proposed for credibility detection (Castillo et al., 2011): number of URLs/images/emails/phone numbers; number of tokens/sentences; average number of tokens; number of positive/negative smileys; number of single/double/triple exclamation/interrogation symbols. To this set, we further add number of interrogative sentences; number of nouns/verbs/adjectives/adverbs/pronouns; and number of words that are not in word2vec’s Google News vocabulary (such OOV words could signal slang, foreign language, etc.) We also use the number of 1st, 2nd, 3rd person pronouns in the comments: (i) in absolute number, and also (ii) normalized by the total number of pronouns in the comment. The latter is also a feature.
|Question: Hi; Just wanted to confirm Qatar’s National Day. Is it 18th of December? Thanks.|
|Answer: yes; it is 18th Dec.|
|Query generated from Q&A: "National Day" "Qatar" National December Day confirm wanted|
|qppstudio.net||No||Other||Public holidays and national …the world’s source of Public holidays information|
|dohanews.co||Yes||Reputed||culture and more in and around Qatar …The documentary features human interest pieces that incorporate the day-to-day lives of Qatar residents|
|iloveqatar.net||Yes||Forum||Qatar National Day - Short Info …the date of December 18 is celebrated each year as the National Day of Qatar…|
|cnn.com||No||Reputed||The 2022 World Cup final in Qatar will be held on December 18 …Qatar will be held on December 18 – the Gulf state’s national day. Confirm. U.S …|
|icassociat||No||Other||In partnership with ProEvent Qatar, ICA can confirm that the World Stars|
|ion.co.uk||will be led on the 17 December, World Stars vs. Qatar Stars - Qatar National Day.|
3.3.3. Support from the Web (60 features)
We tried to verify whether an answer’s claim is factually true by searching for supporting information on the Web. We started with the concatenation of an answer to the question that heads the respective thread. Then, following (Potthast et al., 2013), we extracted nouns, verbs and adjectives, sorted by TF-IDF (we computed IDF on the Qatar Living dump). We further extracted and added the named entities from the text and we generated a query of 5-10 words. If we did not obtain ten results, we dropped some terms and we tried again.
We automatically queried Bing and Google, and we extracted features from the resulting pages, considering Qatar-related websites only. An example is shown in Table 10. Based on the results, we calculated similarities: (i) cosine with TF-IDF vectors, (ii) cosine using Qatar Living embeddings, and (iii) containment (Lyon et al., 2001). We calculated these similarities between, on the one side, (i) the question or (ii) the answer or (iii) the question–answer pair, vs. on the other side, (i) the snippets or (ii) the web pages. To calculate the similarity to a webpage, we first converted the page to a list of rolling sentence triplets, then we calculated the score of the Q/A/Q-A vs. this triplet, and finally we took the average and also the maximum similarity over these triplets. Now, as we had up to ten Web results, we further took the maximum and the average over all the above features over the returned Qatar-related pages. We created three copies of each feature, depending on whether it came from a (i) reputed source (e.g., news, government websites, official sites of companies, etc.), from a (ii) forum type site (forums, reviews, social media), or (iii) from some other type of websites.
) Finally, we used as features the embeddings of the claim (i.e., the answer), of the best-scoring snippet, and of the best-scoring sentence triplet from a webpage. We calculated these embeddings using long short-term memory (LSTM) representations, which we trained for the task as part of a deep neural network (NN). We also used a task-specific embedding of the question and of the answer together with all the above evidence about it, which comes from the last hidden layer of the neural network.
3.4. Classification Model
Our model combines an LSTM-based neural network with kernel-based support vector machines. In particular, we use a bi-LSTM recurrent neural network to train abstract feature representations of the examples. We then feed these representations into a kernel-based SVM, together with other features. The architecture is shown in Figure2(a). We have five LSTM sub-networks, one for each of the text sources from two search engines: Claim, Google Web page, Google snippet, Bing Web page, and Bing snippet. We feed the claim (i.e., the answer) into the neural network as-is. As we can have multiple snippets, we only use the best-matching one as described above. Similarly, we only use a single best-matching triple of consecutive sentences from a Web page. We further feed the neural network with the similarity features described above. All these vectors are concatenated and fully connected to a much more compact hidden layer that captures the task-specific embeddings. This layer is connected to a softmax output unit that classifies the claim as true or false. Figure 2(b) shows the general architecture of each of the LSTM components. The input text is transformed into a sequence of word embeddings, which are then passed to the bidirectional LSTM layer to obtain a representation for the input text.
Next, we extract the last three layers of the neural network —(i) the concatenation layer, (ii) the embedding layer, and (iii
) the classification node— and we feed them into an SVM with a radial basis function kernel (RBF). In this way, we use the neural network to train task-specific embeddings of the input text fragments, and also of the entire input example. Ultimately, this yields a combination of deep learning and task-specific embeddings with RBF kernels.
3.5.1. Question Classification
Table 11 shows the results of our run for classification of the three question categories (Factual, Opinion, Socializing), using an SVM with bag-of-words and some other features. We can see a 10-point absolute improvement over the baseline, which means the task is feasible. This also leaves plenty of space for further improvement, which is beyond the scope of this work. Instead, below we focus on the more interesting task of checking the factuality of Good answers to Factual questions.
|Baseline: All Opinion (majority class)||50.7|
|Our pilot: SVM, bag-of-words||62.0|
|Our pilot: SVM, text features||60.3|
3.5.2. Answer Classification
Setting and Evaluation
We perform leave-one-thread-out cross validation, where each time we exclude and use for testing one of the 71 questions together with all its answers. This is done in order to respect the structure of the threads when splitting the data. We report Accuracy, Precision, Recall, and F for the classification setting.
We used a bidirectional LSTM with 25 units and a hard-sigmoid activation, which we trained using an RMSprop optimizer with 0.001 initial learning rate, L2 regularization with=0.1, and 0.5 dropout after the LSTM layers. The size of the hidden layer was 60 with tanh activations. We used a batch of 32 and we trained for 400 epochs. Similarly to the bi-LSTM layers, we used an regularizer with = 0.01 and dropout with a probability of 0.3.
For the SVM, we used grid search to find the best parameters for the parameters and . We optimized the SVM for classification Accuracy.
Table 12 shows the results from our experiments for several feature combinations and for two baselines. First, we can see that our system with all features performs better than the baseline for Accuracy. The ablation study shows the importance of the context and of the discourse features. When we exclude the discourse and the contextual features, the accuracy drops from 0.683 to 0.659 and 0.574, respectively. When both the context and the discourse features are excluded, the accuracy drops even further, to 0.542. The F results are consistent with this trend. This is similar to the trend for check-worthiness estimation (cf. Table 4). Finally, using the discourse and the contextual features, without any other features, yields an accuracy of 0.635, which is quite competitive. Overall, these results show the importance of the contextual and of the discourse features for the fact-checking task, with the former being more important than the latter.
|All information sources||0.683||0.693||0.688||0.690|
|All discourse and context||0.542||0.554||0.563||0.558|
|All positive (majority class)||0.514||0.514||1.000||0.679|
Here we look at some examples that illustrate how context and discourse help for our two tasks.
4.1. Impact of Context
First, we give some examples where the use of contextual information yields the correct prediction for the check-worthiness task (Section 2). In each of these examples, there is a particular contextual feature type that turned out to be critical for making the correct prediction, namely that these are check-worthy sentences (they were all misclassified as not check-worthy when excluding that feature type):
Metadata - using opponent’s name.
Similarity of the sentence to known positive/negative examples.
The sentence “For the last seven-and-a-half years, we’ve seen America’s place in the world weakened.” is similar to the already fact-checked sentence “We’ve weakened America’s place in the world.” Thus, the latter is to be classified as check-worthy.
Following, there are some examples for the cQA fact-checking task, where the use of particular contextual features allowed the system to predict correctly the factuality of the answers (they were all misclassified when the corresponding contextual feature was turned off):
Support from the current thread.
The example in Figure 5(a) shows how the thread information (e.g., similarity of one answer to the other answers in the thread) helps to predict the answer’s factuality. The question has four answers that should all be True, but they had been misclassified without the support from the current thread.
Support from high-quality posts in Qatar Living.
The example in Figure 5(b) was correctly classified as True when using the high-quality posts, and misclassified as False otherwise. The high-quality posts in the QL forum contain verified information about common topics discussed by people living in Qatar such as visas, driving regulations, customs, etc. The example shows one piece of relevant evidence selected by our method from the high-quality posts, which possibly helps in making the right classification.
Support from all of Qatar Living
The example in Figure 5(c) shows the evidence found in the search results in the entire Qatar Living forum. It was classified correctly as True when using the support from all of the Qatar Living forum, and it was misclassified without it.
4.2. Impact of Discourse
As the evaluation results have shown, discourse also played an important role. Let us take the check-worthiness task as an example. In the sentence “But what President Clinton did, he was impeached, he lost his license to practice law.”, the discourse parser identified the fragment “But what President Clinton did” as Background referring to the text for facilitating understanding; the segment “he was impeached” is Elaboration referring to additional information and “ to practice law” is Enablement referring to the action. These relations are associated with factually-true claims.
Similarly, for cQA fact-checking using discourse information yielded correct classification as True for the example in Figure 5(d). The question and the answer were parsed together and the segment containing the answer was identified as Elaboration. The answer further contains a Background segment (“In the UK; the doctor has a list of all countries and the vaccinations needed for each.”) and an Attribution segment (“they have the same in the US”). These discourse relations are also associated with factually-true answers (as we have seen also in the Figure 5(c)).
5. Related Work
Journalists, web users, and researchers are aware of the proliferation of false information on the Web, and as a result, topics such as information credibility and fact-checking are becoming increasingly important as research directions (Lazer et al., 2018; Vosoughi et al., 2018). For instance, there was a recent special issue of the ACM Transactions on Information Systems journal on Trust and Veracity of Information in Social Media (Papadopoulos et al., 2016), there was a SemEval-2017 shared task on Rumor Detection (Derczynski et al., 2017), and there was a lab at CLEF-2018 on Automatic Identification and Verification of Claims in Political Debates (Nakov et al., 2018; Atanasova et al., 2018; Barrón-Cedeño et al., 2018).
5.1. Detecting Check-Worthy Claims
The task of detecting check-worthy claims has received relatively little research attention so far. Hassan et al. (2015) developed ClaimBuster, which assigns each sentence in a document a score, i.e., a number between 0 and 1 showing how worthy it is for fact-checking. The system is trained on their own dataset of about 8,000 debate sentences (1,673 of them check-worthy), annotated by students, university professors, and journalists. Unfortunately, this dataset is not publicly available, and it contains sentences without context as about 60% of the original sentences had to be thrown away due to lack of agreement. In contrast, we developed a new publicly-available dataset based on manual annotations of political debates by nine highly-reputed fact-checking sources, where sentences are annotated in the context of the entire debate. This allows us to explore a novel approach, which focuses on the context. Note also that the ClaimBuster dataset is annotated following guidelines from (Hassan et al., 2015) rather than trying to mimic a real fact-checking website; yet, it was later evaluated against PolitiFact (Hassan et al., 2016). In contrast, we train and evaluate directly on annotations from fact-checking websites, and thus we learn to fit them better.222222Our model is released as an online demo that supports both English and Arabic (Jaradat et al., 2018): http://claimrank.qcri.org/
Patwari et al. (2017) also focused on the 2016 US Presidential election campaign and independently obtained their data following a similar approach. Their setup asked to predict whether any of the fact-checking sources would select the target sentence. They used a boosting-like model that takes SVMs focusing on different clusters of the dataset and the final outcome was that coming from the most confident classifier. The features considered go from LDA topic-modeling to POS tuples and bag-of-word representations. Unlike that work, we further mimic the selection strategy of one particular fact-checking organization by learning to jointly predict the selection choices by various such organizations.
The above-mentioned lab on fact-checking at CLEF-2018, was partially based on a variant of our data, but it focused on one fact-checking organization only (Atanasova et al., 2018), unlike our multi-source setup here.
Beyond the document context, it has been proposed to mine check-worthy claims on the Web. For example, Ennals et al. (2010a) searched for linguistic cues of disagreement between the author of a statement and what is believed, e.g., “falsely claimed that X”. The claims matching the patterns go through a statistical classifier, which marks the text of the claim. This procedure can be used to acquire a corpus of disputed claims from the Web. Given a set of disputed claims, Ennals et al. (2010b) approached the task as locating new claims on the Web that entail the ones that have already been collected. Thus, the task can be conformed as recognizing textual entailment, which is analyzed in detail in (Dagan et al., 2009). Finally, Le et al. (2016) argued that the top terms in claim vs. non-claim sentences are highly overlapping, which is a problem for bag-of-words approaches. Thus, they used a CNN, where each word is represented by its embedding and each named entity is replaced by its tag, e.g., person, organization, location.
5.2. Fact-Checking and Credibility
The credibility of contents on the Web has been questioned by researches for a long time. While in the early days the main research focus was on online news portals (Brill, 2001; Hardalov et al., 2016), the interest has eventually shifted towards social media (Castillo et al., 2011; Zubiaga et al., 2016; Popat et al., 2017; Karadzhov et al., 2017a; Vosoughi et al., 2018), which are abundant in sophisticated malicious users such as opinion manipulation trolls (Mihaylov et al., 2018) — paid (Mihaylov et al., 2015b) or just perceived (Mihaylov et al., 2015a; Mihaylov and Nakov, 2016a) —, sockpuppets (Maity et al., 2017), Internet water army (Chen et al., 2013), and seminar users (Darwish et al., 2017).
Most of the efforts on assessing credibility have focused on micro-blogging websites. For instance, Canini et al. (2011) studied the credibility of Twitter accounts (as opposed to tweet posts), and found that both the topical content of information sources and social network structure affect source credibility. Another work, closer to ours, aims at addressing credibility assessment of rumors on Twitter as a problem of finding false information about a newsworthy event (Castillo et al., 2011). Their model considered a variety of features including user reputation, writing style, and various time-based features, among others.
Other efforts have focused on news communities. For example, several truth discovery algorithms were studied and combined in an ensemble method for veracity estimation in the VERA system (Ba et al., 2016). They proposed a platform for end-to-end truth discovery from the Web: extracting unstructured information from multiple sources, combining information about single claims, running an ensemble of algorithms, and visualizing and explaining the results. They also explored two different real-world application scenarios for their system: fact-checking for crisis situations and evaluation of trustworthiness of a rumor. However, the input to their model is structured data, while here we are interested in unstructured text. Similarly, the task defined in (Mukherjee and Weikum, 2015) combines three objectives: assessing the credibility of a set of posted articles, estimating the trustworthiness of sources, and predicting user’s expertise. They considered a manifold of features characterizing language, topics and Web-specific statistics (e.g., review ratings) on top of a continuous conditional random fields model. In follow-up work, Popat et al. (2016) proposed a model to support or refute claims from snopes.com and the Wikipedia by considering supporting information gathered from the Web. In another follow-up work, (Popat et al., 2017) proposed a complex model that considers stance, source reliability, language style, and temporal information.
Another important research direction is on using tweets and temporal information for checking the factuality of rumors. For example, Ma et al. (2015) used temporal patterns of rumor dynamics to detect false rumors and to predict their frequency. They focused on detecting false rumors in Twitter using time series. They used the change of social context features over a rumor’s life cycle in order to detect rumors at an early stage after they were broadcast.
A more general approach for detecting rumors is explored in (Ma et al., 2016)
, who used recurrent neural networks to learn hidden representations that capture the variation of contextual information of relevant posts over time. Unlike this work, we do not use microblogs, but we query the Web directly in search for evidence.
In the context of question answering, there has been work on assessing the credibility of an answer, e.g., based on intrinsic information, i.e. without any external resources (Banerjee and Han, 2009). In this case, the reliability of an answer is measured by computing the divergence between language models of the question and of the answer. The spawn of community-based question answering Websites also allowed for the use of other kinds of information. Click counts, link analysis (e.g., PageRank), and user votes have been used to assess the quality of a posted answer (Agichtein et al., 2008; Jeon et al., 2006; Jurczyk and Agichtein, 2007). Nevertheless, these studies address the answers’ credibility level just marginally.
Efforts to estimate the credibility of an answer in order to assess its overall quality required the inclusion of content-based information (Su et al., 2010), e.g., verbs and adjectives such as suppose and probably, which cast doubt on the answer. Similarly, Lita et al. (2005)
used source credibility (e.g., does the document come from a government Website?), sentiment analysis, and answer contradiction compared to other related answers. Another way to assess the credibility of an answer is to incorporate textual entailment methods to find out whether a text (question) can be derived from a hypothesis (answer). Overall, thecredibility assessment for question answering has been mostly modeled at the feature level, with the goal of assessing the quality of the answers. A notable exception is the work of Nakov et al. (2017b), where credibility is treated as a task of its own right. Yet, credibility is different from factuality (our focus here) as the former is a subjective perception about whether a statement is credible, rather than verifying it as true or false; still, these notions are often wrongly mixed in the literature. To the best of our knowledge, no previous work has targeted fact-checking of answers in the context of community Question Answering by gathering external support.
6. Conclusion and Future Work
We have studied the role of context and discourse information for two factuality tasks: (i) detecting check-worthy claims in political debates, and (ii) fact-checking answers in a community question answering forum. We have developed annotated resources for both tasks, which we have made publicly available, and we have proposed rich input representations —including discourse and contextual features—, and also a complementary set of core features to make our systems as strong as possible. The definition of context varies between the two tasks. For check-worthiness estimation, a target sentence occurs in the context of a political debate, where we model the current intervention by a debate participant in relationship to the previous and to the following participants’ turns, together with meta information about the participants, about the reaction of the debate’s public, etc. In the answer’s factuality checking task, the context for the answer involves the full question-answering thread, the related threads in the entire forum, or the set of related high-quality posts in the forum.
We trained classifiers for both tasks using neural networks, kernel-based support vector machines, and combinations thereof, and we ran a rigorous evaluation, comparing against alternative systems whenever possible. We also discussed several cases from the test set where the contextual information helped make the right decisions. Overall, our experimental results and the posterior manual analysis have shown that discourse cues, and especially modeling the context, play an important role and thus should be taken into account when developing models for these tasks.
In future work, we plan to study the role of context and discourse for other related tasks, e.g., for checking the factuality of general claims (not just answers to questions), and for stance classification in the context of factuality. We also plan to experiment with a joint model for check-worthiness estimation, for stance classification, and for fact-checking, which would be useful in an end-to-end system (Baly et al., 2018; Mohtarami et al., 2018).
We would also like to extend our datasets (e.g., with additional debates, but also with interviews and general discussions), thus enabling better exploitation of deep learning. Especially for the answer verification task, we would like to try distant supervision based on known facts, e.g., from high-quality posts, which would allow us to use more training data. We also want to improve user modeling, e.g., by predicting factuality for the user’s answers and then building a user profile based on that. Finally, we want to explore the possibility of providing justifications for the verified answers, and ultimately of integrating our system in a real-world application.
- Finding high-quality content in social media. In Proceedings of the International Conference on Web Search and Data Mining, WSDM ’08, Palo Alto, California, USA, pp. 183–194. External Links: Cited by: §5.2.
- Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims, task 1: check-worthiness. In CLEF 2018 Working Notes, Avignon, France. Cited by: §5.1, §5.
- VERA: a platform for veracity estimation over web data. In Proceedings of the 25th International Conference Companion on World Wide Web, WWW ’16, Montréal, Québec, Canada, pp. 159–162. External Links: Cited by: §1, §5.2.
- Integrating stance detection and fact checking in a unified corpus. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’18, New Orleans, Louisiana, USA, pp. 21–27. Cited by: §6.
- Answer credibility: a language modeling approach to answer validation. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’09, Boulder, Colorado, USA, pp. 157–160. Cited by: §5.2.
- Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims, task 2: factuality. In CLEF 2018 Working Notes. Working Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, Avignon, France. Cited by: §5.
- A Bayesian/information theoretic model of learning to learn via multiple task sampling. Machine Learning 28 (1), pp. 7–39. External Links: Cited by: §2.5.
- Latent Dirichlet allocation. Journal of Machine Learning Research 3 (1), pp. 993–1022. Cited by: §2.2.4.
- Online journalists embrace new marketing function. Newspaper Research Journal 22 (2), pp. 28. Cited by: §5.2.
- API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning, Prague, Czech Republic, pp. 108–122. Cited by: §2.4.
- Finding credible information sources in social networks based on content and social structure. In Proceedings of the IEEE International Conference on Privacy, Security, Risk, and Trust, and the IEEE International Conference on Social Computing, SocialCom/PASSAT ’11, Boston, Massachusetts, USA, pp. 1–8. Cited by: §5.2.
- Multitask learning: a knowledge-based source of inductive bias. In Proceedings of the Tenth International Conference on Machine Learning, ICML ’13, Amherst, Massachusetts, USA, pp. 41–48. Cited by: §2.5.
- Information credibility on twitter. In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, Hyderabad, India, pp. 675–684. External Links: Cited by: §1, §3.3.2, §5.2, §5.2.
- Battling the Internet Water Army: detection of hidden paid posters. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’13, Niagara, Ontario, Canada, pp. 116–120. External Links: Cited by: §1, §5.2.
- Recognizing textual entailment: rational, evaluation and approaches. Natural Language Engineering 15 (4), pp. i–xvii. Cited by: §5.1.
- Seminar users in the Arabic Twitter sphere. In Proceedings of the 9th International Conference on Social Informatics, SocInfo ’17, Oxford, UK, pp. 91–108. Cited by: §1, §5.2.
- SemEval-2017 Task 8: RumourEval: determining rumour veracity and support for rumours. In Proceedings of the 11th International Workshop on Semantic Evaluation, SemEval ’17, Vancouver, Canada, pp. 60–67. Cited by: §5.
- What is disputed on the web?. In Proceedings of the 4th Workshop on Information Credibility, WICOW ’10, New York, New York, USA, pp. 67–74. Cited by: §5.1.
- Highlighting disputed claims on the web. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, Raleigh, North Carolina, USA, pp. 341–350. External Links: Cited by: §5.1.
- Stabilizing minimum error rate training. In Proceedings of the Fourth Workshop on Statistical Machine Translation, StatMT ’09, Athens, Greece, pp. 242–249. Cited by: footnote 12.
- A context-aware approach for detecting worth-checking claims in political debates. In Proceedings of Recent Advances in Natural Language Processing, RANLP ’17, Varna, Bulgaria, pp. 267–276. Cited by: §1, §2.1, footnote 8.
- Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, PMLR ’11, Vol. 15, Fort Lauderdale, Florida, USA, pp. 315–323. Cited by: §2.4.
- In search of credible news. In Proceedings of the 17th International Conference on Artificial Intelligence: Methodology, Systems, and Applications, AIMSA ’16, Varna, Bulgaria, pp. 172–180. External Links: Cited by: §1, §5.2.
- Detecting check-worthy factual claims in presidential debates. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management, CIKM ’15, pp. 1835–1838. Cited by: §1, §2.3.1, §5.1.
- Comparing automated factual claim detection against judgments of journalism organizations. In Computation + Journalism Symposium, Stanford, California, USA, pp. . Cited by: §1, §5.1.
- On assertive predicates. Indiana University Linguistics Club, Indiana University Linguistics Club. Cited by: 1st item, 2nd item, §2.3.4.
- Metadiscourse: exploring interaction in writing. Continuum Discourse, Bloomsbury Publishing. External Links: Cited by: 4th item, §2.3.4.
- Overview of the NTCIR-8 Community QA Pilot Task (Part I): The Test Collection and the Task. In Proceedings of NTCIR-8 Workshop Meeting, Tokyo, Japan, pp. 421–432. Cited by: §3.
- ClaimRank: detecting check-worthy claims in arabic and english. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’18, New Orleans, Louisiana, USA, pp. 26–30. Cited by: footnote 22.
- A framework to predict the quality of answers with non-textual features. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’06, Seattle, Washington, USA, pp. 228–235. External Links: Cited by: §5.2.
- CODRA: a novel discriminative framework for rhetorical analysis. Comput. Linguist. 41 (3), pp. 385–435. External Links: Cited by: §2.2.8.
- Discovering authorities in question answer communities by using link analysis. In Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management, CIKM ’07, Lisbon, Portugal, pp. 919–922. External Links: Cited by: §5.2.
- We built a fake news & click-bait filter: what happened next will blow your mind!. In Proceedings of the 2017 International Conference on Recent Advances in Natural Language Processing, RANLP ’17, Varna, Bulgaria, pp. 334–343. Cited by: §1, §5.2.
- Fully automated fact checking using external sources. In Proceedings of the 2017 International Conference on Recent Advances in Natural Language Processing, RANLP ’17, Varna, Bulgaria, pp. 344–353. Cited by: §1.
- Implicative verbs. Language 47 (2), pp. 340–358. Cited by: 3rd item, §2.3.4.
- An open-source package for recognizing textual entailment. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, ACL ’10, Uppsala, Sweden, pp. 42–47. Cited by: §3.2.3.
- The science of fake news. Science 359 (6380), pp. 1094–1096. External Links: Cited by: §1, §5.
- Towards a text analysis system for political debates. In Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, LaTeCH ’16, Berlin, Germany, pp. 134–139. Cited by: §5.1.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.4.
- Qualitative dimensions in question answering: extending the definitional QA task. In Proceedings of the 20th National Conference on Artificial Intelligence, AAAI ’05, Pittsburgh, Pennsylvania, USA, pp. 1616–1617. External Links: Cited by: §5.2.
- Opinion observer: analyzing and comparing opinions on the web. In Proceedings of the 14th International Conference on World Wide Web, WWW ’05, New York, New York, USA, pp. 342–351. External Links: Cited by: 10th item, §2.3.4.
- NLTK: the natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, ETMTNLP ’02, Philadelphia, Pennsylvania, USA, pp. 63–70. Cited by: §2.3.3, §2.4.
- Detecting short passages of similar text in large document collections. In Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, EMNLP ’01, Pittsburgh, Pennsylvania, USA. Cited by: §3.3.3.
- Detecting rumors from microblogs with recurrent neural networks. In Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI’16, New York, New York, USA, pp. 3818–3824. External Links: Cited by: §1, §5.2.
- Detect rumors using time series of social context information on microblogging websites. In Proceedings of the International on Conference on Information and Knowledge Management, CIKM ’15, Melbourne, Australia, pp. 1751–1754. External Links: Cited by: §5.2, footnote 20.
- Detection of sockpuppets in social media. In Proceedings of the ACM Conference on Computer Supported Cooperative Work and Social Computing, CSCW ’17, Portland, Oregon, USA, pp. 243–246. External Links: Cited by: §5.2.
- Finding opinion manipulation trolls in news community forums. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning, CoNLL ’15, Beijing, China, pp. 310–314. Cited by: §5.2.
- Exposing paid opinion manipulation trolls. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP ’15, Hissar, Bulgaria, pp. 443–450. Cited by: §1, §5.2.
- The dark side of news community forums: opinion manipulation trolls. Internet Research 28 (5), pp. 1292–1312. Cited by: §5.2.
- Hunting for troll comments in news community forums. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL ’16, Berlin, Germany, pp. 399–405. Cited by: §5.2.
- SemanticZ at SemEval-2016 Task 3: ranking relevant answers in community question answering using semantic similarity based on fine-tuned word embeddings. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval ’16, San Diego, California, USA, pp. 879–886. Cited by: §3.2.1.
- Fact checking in community forums. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, AAAI ’18, New Orleans, Lousiana, USA, pp. 5309–5316. Cited by: §1.
- Exploiting similarities among languages for machine translation. CoRR abs/1309.4168. Cited by: §2.2.5.
- Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’13, Atlanta, Georgia, USA, pp. 746–751. Cited by: §3.2.1.
- Crowdsourcing a word-emotion association lexicon. 29 (3), pp. 436–465. Cited by: §2.3.2.
- Automatic stance detection using end-to-end memory networks. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT ’18, New Orleans, Louisiana, USA, pp. 767–776. Cited by: §6.
- Leveraging joint interactions for credibility analysis in news communities. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, Melbourne, Australia, pp. 353–362. External Links: Cited by: §5.2.
- Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims. In Proceedings of the Ninth International Conference of the CLEF Association: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Lecture Notes in Computer Science, Avignon, France, pp. 372–387. Cited by: §5.
- SemEval-2017 task 3: community question answering. In Proceedings of the International Workshop on Semantic Evaluation, SemEval ’17, Vancouver, Canada, pp. 27–48. Cited by: §3.
- SemEval-2015 Task 3: answer selection in community question answering. In Proceedings of the International Workshop on Semantic Evaluation, SemEval ’15, Denver, Colorado, USA, pp. 269–281. Cited by: §3.
- SemEval-2016 Task 3: Community Question Answering. In Proceedings of the 8th International Workshop on Semantic Evaluation, SemEval ’16, San Diego, California, pp. 525–545. Cited by: §3.1, §3, §3.
- Do not trust the trolls: predicting credibility in community question answering forums. In Proceedings of the 2017 International Conference on Recent Advances in Natural Language Processing, RANLP ’17, Varna, Bulgaria, pp. 551–560. Cited by: §1, §5.2.
- Overview of the special issue on trust and veracity of information in social media. ACM Trans. Inf. Syst. 34 (3), pp. 14:1–14:5. External Links: Cited by: §5.
- TATHYA: a multi-classifier system for detecting check-worthy statements in political debates. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM, Singapore, pp. 2259–2262. Cited by: §5.1.
- Credibility assessment of textual claims on the web. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, Indianapolis, Indiana, USA, pp. 2173–2178. External Links: Cited by: §5.2, footnote 20.
- Where the truth lies: explaining the credibility of emerging claims on the web and social media. In Proceedings of the 26th International Conference on World Wide Web Companion, WWW ’17, Perth, Australia, pp. 1003–1012. Cited by: §1, §5.2, §5.2.
- Overview of the 5th international competition on plagiarism detection. In Proceedings of the CLEF Conference on Multilingual and Multimodal Information Access Evaluation, Valencia, Spain, pp. 301–331. Cited by: §3.3.3.
- Truth of varying shades: analyzing language in fake news and political fact-checking. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’17, Copenhagen, Denmark, pp. 2931–2937. Cited by: §1.
- Linguistic models for analyzing and detecting biased language. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL ’13, Sofia, Bulgaria, pp. 1650–1659. Cited by: 6th item, §2.3.4.
- Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50 (English). Cited by: §2.4.
- Learning extraction patterns for subjective expressions. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP ’03, Sapporo, Japan, pp. 105–112. Cited by: 9th item, §2.3.4.
- Incorporate credibility into context for the best social media answers. In Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, PACLIC ’10, Sendai, Japan, pp. 535–541. Cited by: §5.2.
- On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML ’13, Vol. 28, Atlanta, Georgia, USA, pp. 1139–1147. Cited by: §2.4.
- The spread of true and false news online. Science 359 (6380), pp. 1146–1151. External Links: Cited by: §1, §5.2, §5.
- Analysing how people orient to and spread rumours in social media by looking at conversational threads. PLoS ONE 11 (3), pp. 1–29. Cited by: §1, §5.2.