Community Question Answering (cQA) forums are very popular nowadays, as they represent effective means for communities around particular topics to share information. Unfortunately, this information is not always factual. Thus, here we explore a new dimension in the context of cQA, which has been ignored so far: checking the veracity of answers to particular questions in cQA forums. As this is a new problem, we create a specialized dataset for it. We further propose a novel multi-faceted model, which captures information from the answer content (what is said and how), from the author profile (who says it), from the rest of the community forum (where it is said), and from external authoritative sources of information (external support). Evaluation results show a MAP value of 86.54, which is 21 points absolute above the baseline.READ FULL TEXT VIEW PDF
We present SemEval-2019 Task 8 on Fact Checking in Community Question
Given the constantly growing proliferation of false claims online in rec...
Community question answering and discussion platforms such as Reddit, Ya...
The article describes our submission to SemEval 2019 Task 8 on Fact-Chec...
Domain-specific community question answering is becoming an integral par...
We explore the applicability of machine translation evaluation (MTE) met...
Community Question Answering (cQA) forums such as StackOverflow, Yahoo! Answers, and Quora are very popular nowadays, as they represent effective means for communities around particular topics to share information and to collectively satisfy their information needs. However, the information being shared is not always factual. There are multiple factors explaining the presence of incorrect answers in cQA forums, e.g., misunderstanding, ignorance, or maliciousness of the responder. This is exacerbated by the fact that most cQA forums are barely moderated and lack systematic quality control. Moreover, in our dynamic world of today, truth is often time-sensitive: what was true yesterday may become false today.
We explore a new dimension in the context of cQA: checking the veracity of answers to a given question. This aspect has been ignored so far, e.g., in recent cQA tasks at NTCIR and SemEval [Ishikawa, Sakai, and Kando2010, Nakov et al.2015, Nakov et al.2016, Nakov et al.2017a], where an answer is considered as Good if it tries to address the question, irrespective of its veracity. Yet, veracity is an important aspect, as high-quality automatic fact checking can offer better user experience for cQA systems. For instance, the user could be presented with veracity scores, where low scores would warn him/her not to completely trust the answer or to double-check it.
Figure 1 presents an excerpt of an example from the Qatar Living forum, with one question and three answers selected from a longer thread. According to SemEval-2017 Task 3 [Nakov et al.2017a], all three answers are good since they address the question . Nevertheless, contains false information, while and are true,111One could also guess that and are more likely to be true from the fact that the 6 months answer appears many times in the thread, as well as in other threads. While these observations serve the basis for useful features for classification, the real verification for a gold standard annotation requires finding support from a credible external source. as can be checked on an official governmental website.222https://www.moi.gov.qa/site/english/departments/PassportDept/news/2011/01/03/23385.html
Determining the veracity of a claim is a very difficult problem, and solving it in full would require language understanding and inference, integration of several sources of information, and world knowledge, among other things. Here, we approach it as a supervised classification task, and we propose a novel model based on multi-faceted modeling of the facts, which integrates knowledge from several complementary sources, such as the answer content (what is said and how), the author profile (who says it), the rest of the community forum (where it is said), and external authoritative sources of information (external support).
The main contributions of this paper are as follows: (i) first, we are the first to study factuality in cQA, and we create a new high-quality dataset —CQA-QL-2016-fact—, which we release to the research community;333The dataset and the source code are available online at https://github.com/qcri/QLFactChecking to the best of our knowledge, this is the first publicly-available dataset specifically targeting factuality in a cQA setting; (ii) we approach the problem of fact-checking using a multi-faceted model based on a rich input representation, including new features that have not been compared in such a configuration before; (iii) this rich representation allows us to obtain strong results that are applicable to supporting in practice the application scenario outlined above; and (iv) we perform a qualitative analysis of what works well and what does not.
To the best of our knowledge, no previous work has targeted fact-checking of answers in the context of community Question Answering. Yet, there has been work on credibility assessment in cQA [Nakov et al.2017b]. However, credibility is different from veracity (our focus here) as it is a subjective perception about whether a statement is credible, rather than verifying it as true/false as a matter of fact.
In the context of general QA, there has been work on credibility assessment which has been only modeled at the feature level, with the goal of improving Good answer identification. For example, Jurczyk:2007:DAQ:1321440.1321575 Jurczyk:2007:DAQ:1321440.1321575 modeled author authority using link analysis, while Agichtein:2008:FHC:1341531.1341557 Agichtein:2008:FHC:1341531.1341557 used PageRank and HITS in addition to intrinsic content quality (e.g., punctuation and typos, syntactic and semantic complexity, and grammaticality), and usage analysis (e.g., number of clicks and dwell time).
In [Lita et al.2005]
the focus was on source credibility, sentiment analysis, and answer contradiction compared to other answers, while in[Su, yun Chen, and Huang2010] the emphasis was on verbs and adjectives that cast doubt. Other authors used language modeling to validate the reliability of an answer’s source [Banerjee and Han2009]
or focused on non-textual features such as click counts, answer activity level, and copy counts[Jeon et al.2006]. There has been also work on curating social media content using syntactic, semantic, and social signals [Pelleg et al.2016]. Unlike this research, we (i) target factuality rather than credibility, (ii) address it as a task in its own right, (iii) use a specialized dataset, and (iv) use a much richer text representation.
Information credibility, fact-checking and rumor detection have been also studied in the area of social computing. Castillo:2011:ICT:1963405.1963500 Castillo:2011:ICT:1963405.1963500 used user reputation, author writing style, and various time-based features. Canini:2011 Canini:2011 analyzed the interaction of content and social network structure and Morris:2012:TBU:2145204.2145274 Morris:2012:TBU:2145204.2145274 and PlosONE:2016 PlosONE:2016 studied how people handle rumors in social media. lukasik-cohn-bontcheva:2015:ACL-IJCNLP lukasik-cohn-bontcheva:2015:ACL-IJCNLP used temporal patterns to detect rumors and to predict their frequency. Ma:2015:DRU Ma:2015:DRU further used recurrent neural networks, and PlosONE:2016 PlosONE:2016 focused on conversational threads. Other authors have gone beyond social media and have been querying the Web to gather support for accepting or refuting a claim[Popat et al.2016]. Finally, there has been also work on studying credibility, trust, and expertise in news communities [Mukherjee and Weikum2015]. However, none of this work was about QA or cQA.
As we have a new problem —fact-checking of answers in the context of cQA—, for which no dataset exists, we had to create our own one. We chose to augment with factuality annotations a pre-existing dataset for cQA, which allows us to stress the difference between (a) distinguishing a Good vs. a Bad answer, and (b) distinguishing between a factually-true vs. a factually-false one. In particular, we added annotations for factuality to the CQA-QL-2016 dataset from SemEval-2016 Task 3444http://alt.qcri.org/semeval2016/task3/ on Community Question Answering.
In CQA-QA-2016, the data is organized in question–answer threads from the Qatar Living forum.555http://www.qatarliving.com/forum Each question has a subject, a body, and meta information: ID, category (e.g., Computers and Internet, Education, and Moving to Qatar), date and time of posting, user name and ID.
We selected for annotation only the factual questions such as “What is Ooredoo customer service number?” In particular, we filtered out all (i) socializing, e.g., “What was your first car?”, (ii) requests for opinion/advice/guidance, e.g., “Which is the best bank around??”, and (iii) questions containing multiple sub-questions, e.g., “Is there a land route from Doha to Abudhabi. If yes; how is the road and how long is the journey?”
Next, we annotated for veracity the answers to the questions that we retained in the previous step. In CQA-QA-2016, each answer has a subject, a body, meta information (answer ID, user name and ID), and a judgment about how well it answers the question of its thread: Good vs. Bad vs. Potentially Useful . We only annotated the Good answers, using the following labels:
Factual - True: The answer is True and this can be manually verified using a trusted external resource. (Q: “I wanted to know if there were any specific shots and vaccinations I should get before coming over [to Doha].”; A: “Yes there are; though it varies depending on which country you come from. In the UK; the doctor has a list of all countries and the vaccinations needed for each.”).666This can be verified: https://wwwnc.cdc.gov/travel/destinations/traveler/none/qatar
Factual - False: The answer gives a factual response, but it is false. (Q: “Can I bring my pitbulls to Qatar?”, A: “Yes you can bring it but be careful this kind of dog is very dangerous.”).777The answer is not true because pitbulls are included in the list of breeds that are banned in Qatar: http://canvethospital.com/pet-relocation/banned-dog-breed-list-qatar-2015/
Factual - Partially True: We could only verify part of the answer. (Q: “I will be relocating from the UK to Qatar […] is there a league or TT clubs / nights in Doha?”, A: “Visit Qatar Bowling Center during thursday and friday and you’ll find people playing TT there.”).888The place has table tennis, but we do not know on which days: https://www.qatarbowlingfederation.com/bowling-center/
Factual - Conditionally True: The answer is True in some cases, and False in others, depending on some conditions that the answer does not mention. (Q: “My wife does not have NOC from Qatar Airways; but we are married now so can i bring her legally on my family visa as her husband?”, A: “Yes you can.”).999This answer can be true, but this depends upon some conditions: http://www.onlineqatar.com/info/dependent-family-visa.aspx
Factual - Responder Unsure: The person giving the answer is not sure about the veracity of his/her statement. (e.g., “Possible only if government employed. That’s what I heard.”)
NonFactual: The answer is not factual. It could be an opinion, advice, etc. that cannot be verified. (e.g., “Its better to buy a new one.”)
We further discarded answers whose factuality was very time-sensitive (e.g., “It is Friday tomorrow.”, “It was raining last week.”)101010Arguably, many answers are somewhat time sensitive, e.g., “There is an IKEA in Doha.” is true only after IKEA opened, but not before that. In such cases, we just used the present situation (Summer 2017) as a point of reference., or for which the annotators were unsure.
We considered all questions from the Dev and the Test partitions of the CQA-QA-2016 dataset. We targeted very high quality, and thus we did not use crowdsourcing for the annotation, as pilot annotations showed that the task was very difficult and that it was not possible to guarantee that Turkers would do all the necessary verification; e.g., gather evidence from trusted sources. Instead, all examples were first annotated independently by four annotators, and then they discussed each example in detail to come up with a final consensus label. We ended up with 249 Good answers111111This is comparable in size to other fact-checking datasets, e.g., Ma:2015:DRU Ma:2015:DRU experimented with 226 rumors, and popat2016credibility popat2016credibility used 100 Wiki hoaxes. to 71 different questions, which we annotated for factuality: 128 Positive and 121 Negative examples. See Table 1 for more detail.
|Factual - True||128|
|Factual - False||22|
|Factual - Partially True||38|
|Factual - Conditionally True||16|
|Factual - Responder Unsure||26|
We use a multi-faceted model, based on a rich input representation that models (i) the user profile, (ii) the language used in the answer, (iii) the context in which the answer is located, and (iv) external sources of information.
These are features characterizing the user who posted the answer, previously proposed for predicting credibility in cQA [Nakov et al.2017b].
(396 individual features) We count the answers a user has posted in each of the 197 categories in Qatar Living. We have each feature twice: once raw and once normalized by the total number of answers the user has posted. We further use as features this , and the number of distinct categories the user has posted in.
(13 features) We first use the CQA-QA-2016 data to train a Good vs. Bad
answer classifier, as described by barroncedeno-EtAl:2015:ACL-IJCNLP barroncedeno-EtAl:2015:ACL-IJCNLP. We then run this classifier (which has 80+% accuracy) on the entire unannotated Qatar Living database (2M answers, provided by the SemEval-2016 Task 3 organizers) and we aggregate its predictions to build a user profile: number ofGood /Bad answers, total number of answers, percentage of Good /Bad
answers, sum of the classifier’s probabilities forGood /Bad answers, total sum of the classifier’s probabilities over all answers, average score for the probability of Good /Bad answers, and highest absolute score for the probability of a Good /Bad answer.
(19 features) These features describe the overall activity of the user. We include the number of answers posted, number of distinct questions answered, number of questions asked, number of posts in the Jobs and in the Classifieds sections, number of days since registering in the forum, and number of active days. We also have features modeling the number of answers posted during working hours (7:00-15:00h)121212This is forum time, i.e., local Qatar time., after work, at night, early in the morning, and before noon. We also model the day of posting: during a working day vs. during the weekend. Finally, we track the number of answers posted among the first in a question–answer thread, for .
These features model what the answer says, and how. Such features were previously used by RANLP2017:debates RANLP2017:debates.
Forum users (consciously or not), often put linguistic markers in their answers, which can signal the degree of the user’s certainty in the veracity of what they say. Table 2 lists some categories of such markers, together with examples.
We use linguistic markers such as factives from [Hooper1974], assertives from [Hooper1974], implicatives from [Karttunen1971], hedges from [Hyland2005], Wiki-bias terms from [Recasens, Danescu-Niculescu-Mizil, and Jurafsky2013], subjectivity cues from [Riloff and Wiebe2003], and sentiment cues from [Liu, Hu, and Cheng2005].131313Most of these bias cues can be found at https://people.mpi-sws.org/~cristian/Biased_language.html
Factives (1 feature) are verbs that imply the veracity of their complement clause. For example, in E1 below, know suggests that “they will open a second school …” and “they provide a qualified french education …” are factually true statements.
What do you recommend as a French school; Lycee Voltaire or Lycee Bonaparte?
… About Voltaire; I know that they will open a second school; and they are a nice french school… I know that they provide a qualified french education and add with that the history and arabic language to be adapted to the qatar. I think that’s an interesting addition.
|Bias Type||Sample Cues|
|Factives||realize, know, discover, learn|
|Implicatives||cause, manage, hesitate, neglect|
|Assertives||think, believe, imagine, guarantee|
approximately, estimate, essentially
|Report-verbs||argue, admit, confirm, express|
|Wiki-bias||capture, create, demand, follow|
|Modals||can, must, will, shall|
|Negations||neither, without, against, never, none|
|Strong-subj||admire, afraid, agreeably, apologist|
|Weak-subj||abandon, adaptive, champ, consume|
|Positives||accurate, achievements, affirm|
|Negatives||abnormal, bankrupt, cheat, conflicts|
Assertives (1 feature) are verbs that imply the veracity of their complement clause with some level of certainty. For example, in E1, think indicates some uncertainty, while verbs like claim cast doubt on the certainty of their complement clause.
Implicatives (1 feature) imply the (un)truthfulness of their complement clause, e.g., decline and succeed.
Hedges (1 feature) reduce commitment to the truth, e.g., may and possibly.
Reporting verbs (1 feature) are used to report a statement from a source, e.g., argue and express.
Wiki-bias (1 feature) This feature involves bias cues extracted from the NPOV Wikipedia corpus [Recasens, Danescu-Niculescu-Mizil, and Jurafsky2013], e.g., provide (in E1), and controversial words such as abortion and execute.
Modals (1 feature) can change certainty (e.g., will or can), make an offer (e.g., shall), ask permission (e.g., may), or express an obligation or necessity (e.g., must).
Negation (1 feature) cues are used to deny or make negative statements, e.g., no, never.
Subjectivity (2 features) is used when a question is answered with personal opinions and feelings. There are two types of subjectivity cues: strong and weak. For example, in E1, nice and interesting are strong subjectivity cues, while qualified is a weak one.
Sentiment cues (2 features) We use positive and negative sentiment cues to model the attitude, thought, and emotions of the person answering. For example, in E1, nice, interesting and qualified are positive cues.
The above cues are about single words. We further generate multi-word cues by combining implicative, assertive, factive and report verbs with first person pronouns (I/we), modals and strong subjective adverbs, e.g., I/we+verb (e.g. “I believe”), I/we+adverb+verb (e.g., “I certainly know”), I/we+modal+verb (e.g., “we could figure out”) and I/we+modal+adverb+verb (e.g., “we can obviously see”).
Credibility (31 features) We use features that have been previously proposed for credibility detection [Castillo, Mendoza, and Poblete2011]: number of URLs/images/emails/phone numbers; number of tokens/sentences; average number of tokens; number of 1st/2nd/3rd person pronouns; number of positive/negative smileys; number of single/double/triple exclamation/interrogation symbols. To this set, we further add number of interrogative sentences; number of nouns/verbs/adjectives/adverbs/pronouns; and number of words not in word2vec’s Google News vocabulary (such OOV words could signal slang, foreign language, etc.).
Embeddings (300 features) We use the pre-trained, 300-dimensional embedding vectors that mikolov-yih-zweig:2013:NAACL-HLT mikolov-yih-zweig:2013:NAACL-HLT trained on 100 billion words from Google News. We compute a vector representation for an answer by simply averaging the embeddings of the words it contains.
Embeddings (100 features) We also use 100-dimensional word embeddings from [Mihaylov and Nakov2016], trained on all Qatar Living.
Following RANLP2017:factchecking:external RANLP2017:factchecking:external, we tried to verify whether an answer’s claim is true by searching for support on the Web. We started with the concatenation of an answer to its question. Then, following potthast2013overview potthast2013overview, we extracted nouns, verbs and adjectives, sorted by TF-IDF (IDF computed on Qatar Living). We further extracted and added the named entities from the text and we generated a query of 5-10 words. If we did not obtain ten results, we dropped some terms from the query and we tried again.
(180 features): We automatically queried Bing141414We also experimented with Google and the aggregation of Bing and Google, with slightly worse results. and extracted features from the resulting webpages, excluding those that are not related to Qatar. In particular, we calculated similarities: (i) cosine with TF-IDF weighting, (ii) cosine using Qatar Living embeddings, and (iii) containment [Lyon, Malcolm, and Dickerson2001].
We calculated these similarities between, on the one hand, (i) the question or (ii) the answer or (iii) the question-answer pair, vs. on the other hand, (a) the snippets or (b) the web pages. In order to calculate the similarity against a webpage, we first converted that webpage into a list of rolling sentence triplets. Then we calculated the score of the Q/A/Q-A vs. this triplet, and finally we took the average and also the maximum similarity over these triplets. Now, as we had up to ten Web results, we further took the maximum and the average over all the above features over the returned Qatar-related pages.
We created three copies of each feature, depending on whether it came (i) from a reputed source (e.g., news, government websites, official sites of companies), (ii) from a forum-type site (forums, reviews, social media), or (iii) from some other type of websites.
Support from the current thread (3 features
) :We use the cosine similarity between an answer- and a thread-vector of allGood answers using Embeddings and Embeddings. The idea is that if an answer is similar to other answers in the thread, it is more likely to be true. To this, we add a feature for the reciprocal rank of the answer in the thread, assuming that more recent answers are more likely to be up-to-date and factually true.
Support from all of Qatar Living (60 features) We further collect supporting evidence from all threads in the Qatar Living forum. We use a search engine as for the external evidence features above, but this time we limit the search to the Qatar Living forum only.
Support from high-quality posts in Qatar Living (10 features) Among the 60,000 active users of the Qatar Living forum, there is a community of 38 trusted users who have written 5,230 high-quality articles on topics that attract a lot of interest, e.g., visas, work legislation, etc. We try to verify the answers against these high-quality posts. (i) Since an answer can combine both relevant and irrelevant information with respect to its question, we first generate a query as explained above for each Q&A. (ii) We then compute cosines between the query and the sentences in the high-quality posts, and we select the -best matches. (iii) Finally, we compute textual entailment scores [Kouylekov and Negri2010] for the answer given the -best matches, which we then use as features.
|Rank||Feature Group / System||Acc||P||R||F||MAP|
|2||Support from the Web||63.45||59.59||89.84||71.65||67.71|
|1||Support from all of Qatar Living||65.46||66.41||66.41||66.41||83.97|
|4||Support from high-quality posts in Qatar Living||60.24||61.60||60.16||60.87||74.50|
|7||Support from the current thread||53.41||53.53||71.09||61.07||64.15|
|3||Linguistic bias, subjectivity and sentiment||60.64||60.42||67.97||63.97||78.81|
|10||User posts categories||42.57||46.67||82.03||59.49||68.50|
|11||User posts quality||28.92||31.01||31.25||31.13||67.43|
|Optimizing for Accuracy||72.29||70.63||78.91||74.54||74.32|
|Optimizing for MAP||69.88||70.87||70.31||70.59||86.54|
|Credibility [Castillo, Mendoza, and Poblete2011]||56.23||56.21||67.19||61.21||64.92|
|All Positive (majority class)||51.41||51.41||100.00||67.91||—|
|Thread order (chronological)||—||—||—||—||63.75|
We train an SVM classifier [Joachims1999] on the 249 examples as described above, where each example is one question–answer pair. For the evaluation, we use leave-one-thread-out cross validation, where each time we exclude and use for testing one of the 71 questions together with all its answers. We do so in order to respect the structure of the threads when splitting the data. We report Accuracy, Precision, Recall, and F for the classification setting. We also calculate Mean Average Precision (MAP).
Table 3 shows results for each of the above-described feature groups, further grouped by type of evidence —external, internal, answer-based, or user-related—, as well as for ensemble systems and for some baselines.
We can see that the best-performing feature group, both in terms of accuracy and MAP (65.46 and 83.97, respectively), is the one looking for intra-forum evidence based on search for similar answers in Qatar Living. It is closely followed by the feature group looking for external evidence in Qatar-related web sites, excluding Qatar Living, which achieved accuracy of 63.45, and the best overall F score of 71.65.
Evidence from high-quality posts in Qatar Living ranks 4th with accuracy of 60.24, and support from the current thread only comes 7th with accuracy of just 53.41. These results show the importance of forum-level evidence that goes beyond the target thread and beyond known high-quality posts in the forum.
Answer-related features are the third most important feature family. In particular, linguistic features rank third overall with accuracy of 60.64; this should not be surprising as such features have been shown to be important in previous work [Popat et al.2016]. We can also see the strong performance of using knowledge about the domain in the form of word embeddings trained on Qatar Living, which are ranked 5th with accuracy of 59.44. However, general word embeddings, e.g., those trained on Google News, do not work well: with accuracy of 52.61, they are barely above the majority class baseline, which has an accuracy of 51.41.
The answer content feature family also contains a group of features that have been previously proposed for modeling credibility. This group achieves an accuracy of 56.23, and we also use it as one of the baselines in the bottom of the table. There are two reasons for its modest performance: (i) credibility is different from veracity as the former is subjective while the latter is not, and (ii) these features are generally not strong enough by themselves, as they have been originally proposed to work together with features modeling the user (age, followers, friends, etc.), a target topic, and propagation (spreading tree) on Twitter [Castillo, Mendoza, and Poblete2011].
Interestingly, the feature types about the user profile perform the worst. They are also below the majority class baseline in terms of accuracy; however, they outperform the baselines in terms of MAP. We believe that the poor performance is due to modeling a user based on her activity, posting categories, and goodness (whether she tries to answer the question irrespective of the veracity of the given answer) of her posts in the past, which do not target factuality directly. In future work, we could run our factuality classifier over all of Qatar Living, and we can then characterize a user based on our predicted veracity of his/her answers.
The bottom of the table shows the results for two ensemble systems that combine the above feature groups, yielding accuracy of 72.29 (19 points of improvement over the majority class baseline, absolute) and MAP of 86.54 (23 points of improvement over the chronological baseline, absolute). These results indicate that our system might already be usable in real applications.
|Q: does anyone know if there is a french speaking nursery in doha?|
|A: there is a french school here. don’t know the ages but my neighbor’s 3 yr old goes there…|
|Best Matched Sentence for Q&A: there is a french school here.|
|35639076||15||1||10||the pre-school follows the english program but also gives french and arabic lessons.|
|32448901||4||2||11||france bought the property in 1952 and since 1981 it has been home to the french institute.|
|31704366||7||3||1||they include one indian school, two french, seven following the british curriculum…|
|27971261||6||4||4||the new schools include six qatari, four indian, two british, two american and a finnish…|
|Q: Hi; Just wanted to confirm Qatar’s National Day. Is it 18th of December? Thanks.|
|A: yes; it is 18th Dec.|
|Query generated from Q&A: "National Day" "Qatar" National December Day confirm wanted|
|qppstudio.net||No||Other||Public holidays and national … the world’s source of Public holidays information|
|dohanews.co||Yes||Reputed||culture and more in and around Qatar … The documentary features human interest pieces that incorporate the day-to-day lives of Qatar residents|
|iloveqatar.net||Yes||Forum||Qatar National Day - Short Info … the date of December 18 is celebrated each year as the National Day of Qatar…|
|cnn.com||No||Reputed||The 2022 World Cup final in Qatar will be held on December 18 … Qatar will be held on December 18 – the Gulf state’s national day. Confirm. U.S …|
|icassociation.co.uk||No||Other||In partnership with ProEvent Qatar, ICA can confirm that the World Stars|
|will be led on the 17 December, World Stars vs Qatar Stars - Qatar National Day.|
As explained above, we use a three-step approach to extract supporting evidence from the high-quality posts, namely query generation (Step 1), evidence retrieval using vector-based similarity (Step 2), and re-ranking based on entailment (Step 3). We conducted an ablation experiment in order to investigate the individual contribution of steps 1 and 3. We considered the following settings:
Full system. All features from the three steps are used.
No re-ranking. Only steps 1 and 2 are applied.
No query generation. The entire answer is used to extract evidence instead of using the generated query, i.e., only steps 2 and 3 are applied.
No query generation and no re-ranking. Only step 2 is applied. As in , the entire answer is used to retrieve evidence.
The results confirmed (i) the importance of generating a good query: discarding step 1 yields sizable drop in performance by 12 accuracy points when comparing to , and by 4 accuracy points when comparing to ; and (ii) the importance of re-ranking based on textual entailment: discarding step 3 yields 11 accuracy points decrease in performance when comparing to , and 3 accuracy points when comparing to .151515More detailed results are omitted for the sake of brevity.
Table 4 illustrates the effect of the entailment-based re-ranking (step 3). It shows a question (), an answer to verify (), and the top-4 supporting sentences retrieved by our system, sorted according to the entailment-based re-ranking scores (). Column shows the ranking for the same sentences using vector-based similarity (i.e., without applying step 3). We can see that using re-ranking yields better results. For example, the first piece of support in ’s ranking is the best overall, while the same sentence is ranked 10th by . Moreover, the top-ranked evidence in , although clearly pertinent, is not better than the best one in .
We further investigated the effectiveness of the linguistic features. The experimental results show that the top-5 linguistic features are (in this order) strong subjectivity cues, implicatives, modals, negatives, and assertives.
The query generated from the question–answer pair provides enough context for a quality Web search. The results returned by the search engine are mostly relevant, which indicates that the query generation works well. More importantly, as Table 5 shows, the results returned by the search engine are relevant with respect to both the query and the question–answer pair. Note also that, as expected, the results that are Qatar-related and also from a reputed or a forum source tend to be generally more relevant.
We have explored a new dimension in the context of community question answering, which has been ignored so far: checking the veracity of forum answers. As this is a new problem we created CQA-QL-2016-fact, a specialized dataset which we are releasing freely to the research community. We further proposed a novel multi-faceted model, which captures information from the answer content (what is said and how), from the author profile (who says it), from the rest of the community forum (where it is said), and from external authoritative sources of information (external support). The evaluation results have shown very strong performance.
In future work, we plan to extend our dataset with additional examples. We would also like to try distant supervision based on known facts, e.g., from high-quality posts, which would allow us to use more training data, thus enabling more sophisticated learning architectures, e.g., based on deep learning. We also want to improve user modeling, e.g., by predicting factuality for the user’s answers and then building a user profile based on that. Finally, we want to explore the possibility of providing justifications for the verified answers and to integrate our system in a real application.
This research is developed by the Arabic Language Technologies (ALT) group at Qatar Computing Research, HBKU in collaboration with MIT-CSAIL. It is part of the Interactive sYstems for Answer Search (Iyas) project.
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 687–693.
Making large-scale support vector machine learning practical.In Schölkopf, B.; Burges, C. J. C.; and Smola, A. J., eds., Advances in Kernel Methods. 169–184.
Proceedings of the AAAI Conference on Artificial Intelligence, 1616–1617.