Answering Comparative Questions: Better than Ten-Blue-Links?

by   Matthias Schildwächter, et al.

We present CAM (comparative argumentative machine), a novel open-domain IR system to argumentatively compare objects with respect to information extracted from the Common Crawl. In a user study, the participants obtained 15 accurate answers using CAM compared to a "traditional" keyword-based search and were 20



There are no comments yet.


page 1

page 2

page 3

page 4


An IDE-Based Context-Aware Meta Search Engine

Traditional web search forces the developers to leave their working envi...

Where Do All These Search Terms Come From? - Two Experiments in Domain-Specific Search

Within a search session users often apply different search terms, as wel...

Unsupervised Keyword Extraction for Full-sentence VQA

In existing studies on Visual Question Answering (VQA), which aims to tr...

Do Answers to Boolean Questions Need Explanations? Yes

Existing datasets that contain boolean questions, such as BoolQ and TYDI...

Bew: Towards Answering Business-Entity-Related Web Questions

We present BewQA, a system specifically designed to answer a class of qu...

A Comparative User Study of Human Predictions in Algorithm-Supported Recidivism Risk Assessment

In this paper, we study the effects of using an algorithm-based risk ass...

Structured Attention Graphs for Understanding Deep Image Classifications

Attention maps are a popular way of explaining the decisions of convolut...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Everyone faces choice problems on a daily basis. Besides choosing what to wear or what to have for lunch, people compare all kinds of options: cameras to buy, universities to study at, or even programming languages to use. Question answering platforms like Quora, Reddit, or StackExchange are packed with comparative questions like “How does X compare to Y with respect to Z?”. An informed choice then is often based on an objective argumentation why to favor one of the candidates (e.g., comparing important aspects).

Specific product comparison systems, such as or Check24, allow to compare any subset of objects in narrow domains such as cameras. Other systems like WolframAlpha aim at providing comparative functionality across domains, but also often only use some (limited) structured database while ignoring the rich textual content available on the web. Somewhat surprising, no system is currently able to satisfy comparative information needs for the general domain with sufficient coverage and explanations. No available system is able to support comparisons on a broad range of object types with arguments about relative qualities or even supporting objective arguments about the best choice. Indeed, web search engines are able to directly answer many factoid questions but do not treat comparative questions any special beyond returning default search results. Advanced question answering systems, such as IBM’s Watson (Ferrucci et al., 2010), answer factoid questions very well, but do not really handle comparative questions of everyday users.

We present CAM (comparative argumentative machine), a system that aims at solving the shortcomings mentioned above. CAM is a tool for answering comparative general-domain questions based on information extracted from the web-scale Common Crawl.111Demo, API & code:

2. Related Work

Commercial systems like GoCompare,,, and offer high-precision comparison capabilities based on well-curated structured data sources focusing on single domains. But their low coverage in other than their focus domains rules out answering most of the comparative questions found on portals like Quora or Yahoo! Answers—that themselves form a good source of (argumentative) comparisons and results.

Previous text-based comparison approaches have mostly focused on the biomedical domain. Fiszman et al. (2007) collected sentences comparing drug therapies using manually crafted patterns to recognize the subjects of comparison and the comparison direction. They reached a very high precision at moderate recall. On a set of full-text articles on toxicology, Park and Blake (2012)

succeeded in training a highly precise Bayesian Network for identifying comparative sentences relying on lexical clues and dependency parsers. More recently,

Gupta et al. (2017)

described a system based on manually collected patterns on the basis of lexical matches and dependency parses in order to identify comparison targets and to classify the type of comparison into the classes given by

Jindal and Liu (2006): gradable vs. non-gradable and superlative comparisons.

Building a general-domain argumentative comparison facility comes with the additional challenge of argument mining from user-generated content (Snajder, 2017). Text is typically noisy, misses argument structures and contains poorly formulated claims. On the other hand, specialized jargon and idiosyncrasies of a platform can be utilized (Dusmanu et al., 2017) (e.g., hashtags for mining argumentative tweets).

Aker et al. (2017) confirmed the findings of Stab and Gurevych (2014) that information from dependency parsers does not help to find the (general) argument structure in persuasive essays and Wikipedia articles while simpler structural features such as punctuation are more effective. Daxenberger et al. (2017) noted that claims across different domains share lexical clues and further stated that current datasets are too small for recent DNN-based classifiers resorting to traditional feature engineering for argument mining.

Some argument mining systems work on larger corpora of user-generated content to find the most relevant argument for a given claim (Hua and Wang, 2017) or to oppose different argumentative viewpoints (Wachsmuth et al., 2017). Web-scale systems for comparing query results (Sun et al., 2006) or for retrieving single arguments matching a user query (Stab et al., 2018) form the inspiration for our new CAM system (comparative argumentative machine).

3. The CAM System Design

To ensure a wide coverage, a comparative answer of our CAM system for two objects is based on argumentative structures extracted from web-scale text resources. The system looks for textual structures asserting that one of the compared objects is superior to the other, that they are equal, or that they are not comparable. A comparison of two objects  and  in the CAM sense is defined as “”, where ? is in and is the set of comparison aspects of  and . We thus focus on mining sentences like “Python= is better than Matlab= for web development=.”

Figure 1. Design of the CAM system.

The design of our CAM system is shown in Figure 1. It consists of the following generic stages, which are further described in details: (1) retrieval of relevant sentences, (2) classification of comparative sentences, (3) ranking of the comparative sentences, (4) extraction of object aspects, and (5) presentation of the answer.

3.1. Sentence Retrieval

Our CAM system uses an Elasticsearch full text index of a linguistically pre-processed corpus (Panchenko et al., 2018) containing 14.3 billion English sentences from the Common Crawl. To retrieve textual argumentative structures relevant to a comparative user input, the index is queried for sentences matching the input objects and containing comparison aspects; sentences without aspects are used as a fall-back. Questions are removed from the initial retrieval results since they usually do not help in returning an argumentative answer.

Figure 2. CAM answer presentation for the question “Is Python faster than Matlab?”. Pro and con sentences are shown.
Figure 3. CAM input form.

3.2. Sentence Classification

We use a classifier to distinguish between four classes: the first object from the user input is better / equal / worse than the second one () w.r.t. a comparison aspect, or no comparison is found (). The classifier uses the text between both objects to identify the “polarity” and is inspired by the best model reported by Franzek et al. (2018)

: XGBoost 

(Chen and Guestrin, 2016) using word frequencies as representations, which achieves a high F1 score of 0.92 for , a good F1 of 0.74 for  but a rather low F1 of 0.46 for 

. We identified the main issue in missing negation handling, for which we added a simple keyword-based heuristic to our CAM system inverting common negations.

3.3. Sentence Ranking and Object Comparison

To rank comparative sentences (category or ), we score them by combining the classifier confidence and the Elasticsearch score222 according to the following heuristic :

where is the Elasticsearch score of the sentence, is the maximum Elasticsearch score of any comparative sentence retrieved for the user input, and if the user-specified aspect is present in the sentence and otherwise. For the aspect boost, the weights are specifiable in the user interface. Confidently classified sentences obtain a boost of while scores of low confidence sentences are decreased by a factor of ; we set and in our experiments.

For scoring a CAM output “”, we sum up the -scores of all sentences supporting the statement. To this end, we have developed a heuristic to include two directions of comparison and thus taking into account that a statement like “Python is better than Matlab” (class ) is also supported by “Matlab is worse than Python” (class ); important factors being the object ordering and the polarity.

3.4. Aspect Extraction

In addition to user-specified comparison aspects, CAM generates up to ten supplementary aspects (even when no comparison aspect at all was provided by the user). We use three different methods for aspect mining: (1) searching for comparative adjectives and adverbs; (2) searching for phrases with comparative adjectives / adverbs and a preposition like to, for, etc. (e.g., “quicker to develop code” or “better for scientific computing”); (3) searching for specific hand-crafted patterns like “because of higher speed”, “since it has more options”, “as we have proven its resilience” or “reason for this is the price”. An extracted aspect is assigned to the object with the higher co-occurrence frequency (cf. Figure 2 for examples).

3.5. CAM User Interface

The user interface consists of a question input form (Figure 3) and an answer presentation component (Figure 2). The input interface allows to submit a comparative question in the form of two compared objects and their aspects. The answer presentation summarizes the sentences retrieved from the Common Crawl providing decision-making support for the informed choice.

Input Form

The input form is divided into three parts (cf. Figure 3). On the top, the user enters two comparison target objects. In the middle, the interface allows to add an arbitrary number of aspects and assign them a weight indicating their importance (1 to 5; used to boost the scores of the sentences containing the aspect). On the bottom, one of the three different search models can be selected: Default is based on keyphrases like “better than” or “faster than” to find comparative sentences; BoW is built upon the word frequency-based XGBoost classifier described above, and Infersent uses sentence embeddings. The Faster Search option limits the number of queried fall-back sentences to 500 in order to speed up the answer construction.

Answer Presentation

On the top of the comparative answer presentation (cf. Figure 2), different score bars are given. The overall score distribution allows the user to grasp a general impression for the entered comparative question while the aspect-specific score bars show the distribution for the individual user-specified aspects.

Additionally, up to ten automatically generated aspects are presented in a clickable manner to allow the user to only display result sentences for such aspects (disjunctive filter interpretation). The user-specified aspects are used on both result sentence sides while the generated aspects only filter the corresponding column.

The objects in the displayed sentences are highlighted with their respective colors from the score bars, while the aspect highlighting uses different colors. Clicking on a result sentence reveals its Common Crawl context—by default the sentences around it, with the possibility of expanding to the whole original document.

4. Evaluation

We compare our new CAM system to a keyword-based search in two user studies with 14 and 9 participants on 34 comparison topics.

4.1. Experimental Setup

The 34 topics (two compared objects + one aspect) for our studies were created from comparative Quora questions containing the phrase “better than” and also being present on comparison pages like and For each topic, we manually double-checked that the underlying corpus of our CAM system and the keyword-based search (i.e., the 14.3 billion Common Crawl sentences) allows to answer the comparison; we only included topics with at least 20 support sentences. One of the topics for instance has mp3 and wma as objects and compression ratio as the aspect. Given the ground truth answer worse, a study participant should answer that mp3 is worse than wma with respect to compression ratio. To clarify potential ambiguities or subjectivities, descriptions for the participants were added to the topics (e.g., to inform a potential music afficionado who might claim a worse compression rate being better since it might come with an improved sound quality).

In a Group A setup, we focus on the question whether users are faster answering correctly when using the CAM system. A G*Power analysis (Faul et al., 2007) did output a required sample size of 272 comparisons to be a able to measure a statistically significant difference in answer times. We thus decided to engage 14 participants on all 34 topics (477 comparisons). Each participant uses both experimental systems alternatingly (CAM / keyword-based search). Since every participant should work on each topic just once with one of the systems but not the other, we randomly split the 34 topics into two groups (one for each system). To avoid any order bias, the topics of each group were presented in random order.

The participants were informed that they should give an answer as quickly as possible; the whole study took about one hour per participant and ended with a questionnaire. We measured the time for different phases (e.g., the time needed to enter a query and the time needed to determine an answer) and the correctness of a participant’s answer with respect to manually derived gold labels.

In a Group B setup, we focussed on collecting some more “natural” feedback using a less forced study environment. We had 9 different participants (not from Group A) who were allowed to just “play” with the systems for any and as many of our 34 comparison topics as they liked. In total, 85 comparisons were performed.

4.2. Study Participants

Among the 14 Group A participants, 9 were male (5 female), 13 indicated 18–24 as their age (1 was in the 25–34 range), 8 participants had an Engineering & Computer Science background (3 from Arts, Culture & Entertainment, 1 from Law & Public Policy, 2 selected “other”) with 9 having a Bachelor’s degree. The participants characterized themselves as having a proficient (nine) or intermediate (five) English level. Seven participants stated to use comparison websites rarely or never (once a year or less), whereas five used them once a month and two even once a week.

Group B consisted of five female and four male participants, 1 participant was 13–17 years old, 2 participants fell in the 18–24 age range, 5 in the 25–34 range, and one in 35–44. This group was dominated by an Engineering & Computer Science background (five out of nine); one from Education, one from Business, one from Arts, Culture & Entertainment, and one “other” background. Four participants already had a Master’s degree, two were students, one had a Bachelor’s degree, one a doctorate degree, and one selection of “other”. Six participants rated their English level as proficient and three as intermediate. Five participants stated they used comparison websites rarely or never, whereas two used them once a month and two even once a week or more.

4.3. Results and Discussion

A Shapiro-Wilk test (Shapiro and Wilk, 1965)

verified the visual assumption of a log-normal distribution (

) of the different measured times for CAM usage and keyword-based search. Therefore, t-tests were used to check whether the null hypothesis of same answer determination times or same total times can be rejected.

Figure 4 shows the time distributions of Group A. Until typing indicates the participants being about 19% faster starting to enter a query with the CAM system (in Group B, the CAM users were even about 25% faster). Typing is the time from the first key stroke until the query is submitted. The Group A participants again were faster with the CAM interface (about 24% on average); the Group B participants needed about twice as long, being slightly faster with the CAM system. The loading phase measures the time the system needs to show the answer (from sending the query until the result is presented). On average, keyword-based search loads slightly faster than CAM since CAM uses a keyword-based search subroutine with some further post-processing.

Most importantly, the time the users need to give their answer (determination in Figure 4) shows that the Group A participants were significantly faster when using the CAM system (about 39% difference). In Group B, the participants were slower in general, but interestingly they were also slightly slower using CAM than keyword-based search. One potential reason is that the participants explored the new CAM interface more even providing verbal feedback during their work (remember that Group B was allowed to “play” with the systems).

For the overall task (total in Figure 4), Group A was significantly faster when using the CAM interface while the more exploratory Group B was overall slower but with no substantial advantage for either system. Our main focus in a Group B was on observing participants behavior which is why they were allowed to test, play with and comment on the systems while using them.

Figure 4. Times of question answering phases (Group A).

Besides statistically significant quicker answers, the Group A participants also made fewer errors using the CAM system (cf. Figure 5). The average CAM accuracy in Group A is 95% (9 of 14 participants reached 100%), whereas for the keyword-search it is 81% (with a best result of 94%). The Group B participants also were more accurate using CAM (84%) than using keyword-based search (75%).

In the evaluation questionnaire, we asked the participants to rate the system features on the scale from 1 (very negative) to 5 (very positive). The question “How convenient was it to use the CAM system?” and the statement “Learning the usage of CAM is…” achieved values between 4 and 5 for both groups, which is very positive. In addition, the participants of both groups on average were almost one point more confident that an answer determined by CAM was correct than for keyword-based search (cf. Figure 6 for Group A; 5 being the highest confidence).

Figure 5. Answer accuracy (Group A).
Figure 6. Responses on the question “How confident are you that the determined answer is correct?” (Group A).

5. Conclusion

Our new CAM system helps users to faster and more confidently find answers on comparative questions compared to a keyword-based search. Moreover, a summary provided in the answer serves to support a decision-making process. While the objects of comparison and the important aspects have still to be stated explicitly, this gives rise to comparative question handling in search engines once respective questions can be identified automatically. A demo of our CAM system is online333 and available as open source.444

This work has been supported by the Deutsche Forschungsgemeinschaft (DFG) within the project “Argumentation in Comparative Question Answering (ACQuA)” (grant BI 1544/7-1 and HA 5851/2-1) that is part of the Priority Program “Robust Argumentation Machines (RATIO)” (SPP-1999).


  • (1)
  • Aker et al. (2017) Ahmet Aker, Alfred Sliwa, Yuan Ma, Ruishen Lui, Niravkumar Borad, Seyedeh Ziyaei, and Mina Ghobadi. 2017. What works and what does not: Classifier and feature analysis for argument mining. In Proceedings of ArgMining@EMNLP 2017. 91–96.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. In Proceedings of KDD 2016. 785–794.
  • Daxenberger et al. (2017) Johannes Daxenberger, Steffen Eger, Ivan Habernal, Christian Stab, and Iryna Gurevych. 2017. What is the essence of a claim? Cross-domain claim identification. In Proceedings of EMNLP 2017. 2055–2066.
  • Dusmanu et al. (2017) Mihai Dusmanu, Elena Cabrio, and Serena Villata. 2017. Argument mining on Twitter: Arguments, facts and sources. In Proceedings of EMNLP 2017. 2317–2322.
  • Faul et al. (2007) Franz Faul, Edgar Erdfelder, Albert-Georg Lang, and Axel Buchner. 2007. G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods 39, 2 (2007), 175–191.
  • Ferrucci et al. (2010) David A. Ferrucci, Eric W. Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John M. Prager, Nico Schlaefer, and Christopher A. Welty. 2010. Building Watson: An overview of the DeepQA project. AI Magazine 31, 3 (2010), 59–79.
  • Fiszman et al. (2007) Marcelo Fiszman, Dina Demner-Fushman, François-Michel Lang, Philip Goetz, and Thomas C. Rindflesch. 2007. Interpreting comparative constructions in biomedical text. In Proceedings of BioNLP@ACL 2007. 137–144.
  • Franzek et al. (2018) Mirco Franzek, Alexander Panchenko, and Chris Biemann. 2018. Categorization of comparative sentences for argument mining. CoRR abs/1809.06152 (2018).
  • Gupta et al. (2017) Samir Gupta, A. S. M. Ashique Mahmood, Karen Ross, Cathy H. Wu, and K. Vijay-Shanker. 2017. Identifying comparative structures in biomedical text. In Proceedings of BioNLP@ACL 2017. 206–215.
  • Hua and Wang (2017) Xinyu Hua and Lu Wang. 2017. Understanding and detecting supporting arguments of diverse types. In Proceedings of ACL 2017 (Volume 2: Short Papers). 203–208.
  • Jindal and Liu (2006) Nitin Jindal and Bing Liu. 2006. Mining comparative sentences and relations. In Proceedings of AAAI 2006. 1331–1336.
  • Panchenko et al. (2018) Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, and Chris Biemann. 2018. Building a web-scale dependency-parsed corpus from CommonCrawl. In Proceedings of LREC 2018.
  • Park and Blake (2012) Dae Hoon Park and Catherine Blake. 2012. Identifying comparative claim sentences in full-text scientific articles. In Proceedings of DSSD@ACL 2012. 1–9.
  • Shapiro and Wilk (1965) Samuel Sanford Shapiro and Martin B. Wilk. 1965.

    An analysis of variance test for normality (complete samples).

    Biometrika 52, 3/4 (1965), 591–611.
  • Snajder (2017) Jan Snajder. 2017. Social media argumentation mining: The quest for deliberateness in raucousness. CoRR abs/1701.00168 (2017).
  • Stab et al. (2018) Christian Stab, Johannes Daxenberger, Chris Stahlhut, Tristan Miller, Benjamin Schiller, Christopher Tauchmann, Steffen Eger, and Iryna Gurevych. 2018. ArgumenText: Searching for arguments in heterogeneous sources. In Proceedings of NAACL 2018 (Demonstrations). 21–25.
  • Stab and Gurevych (2014) Christian Stab and Iryna Gurevych. 2014. Identifying argumentative discourse structures in persuasive essays. In Proceedings of EMNLP 2014. 46–56.
  • Sun et al. (2006) Jian-Tao Sun, Xuanhui Wang, Dou Shen, Hua-Jun Zeng, and Zheng Chen. 2006. CWS: A comparative web search system. In Proceedings of WWW 2006. 467–476.
  • Wachsmuth et al. (2017) Henning Wachsmuth, Martin Potthast, Khalid Al Khatib, Yamen Ajjour, Jana Puschmann, Jiani Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff, and Benno Stein. 2017. Building an argument search engine for the web. In Proceedings of ArgMining@EMNLP 2017. 49–59.