“Question answering” (QA) is a deceptively simple description of an incredibly broad range of phenomena. Its original use in the natural language processing (NLP) and information retrieval (IR) literature had a very narrow scope: answering open-domain factoid questions that a person might pose to a retrieval systemVoorhees (1999); Kwok et al. (2001). As NLP systems have improved, people have started using question answering as a format to perform a much wider variety of tasks, leading to a dilution of the term “question answering”. This dilution is natural: questions are simply a class of sentences that can have arbitrary semantics, so “question answering” per se also has arbitrary scope.
In this paper we aim to give some clarity to what “question answering” is and when it is useful in the current NLP and computer vision literature. Some researchers advocate for only using question answering when the task involves questions that humans would naturally ask in some settingKwiatkowski et al. (2019); Yogatama et al. (2019); Clark et al. (2019), while others push for treating every NLP task, even classification and translation, as question answering Kumar et al. (2016); McCann et al. (2018). Additionally, some in the NLP community have expressed fatigue at the proliferation of question answering datasets, with some conference presentations and paper submissions feeling like they need to address head-on the complaint of “yet another QA dataset”.
We argue that “question answering” should be considered a format (as opposed to other formats such as slot filling) instead of a phenomenon or task in itself. Question answering is mainly a useful format in tasks that require question understanding i.e., understanding the language of the question is a non-trivial part of the task itself (detailed in Section 3). If the questions can be replaced by integer ids without meaningfully altering the nature of the task, then question understanding is not required. Sometimes, question answering is a useful format even for datasets that do not require question understanding, and we elaborate on this in Section 3.3.
We argue that there are three broad motivations for using question answering as a format for a particular task. The first, and most obvious, motivation is (1) to fill human information needs, where the data is naturally formatted as question answering, because a person is asking a question. This is not the only valid use of question answering, however. It is also useful to pose a task as question answering (2) to probe a system’s understanding of some context (such as an image, video, sentence, paragraph, or table) in a flexible way that is easy to annotate, and (3) to transfer learned parameters or model architectures from one task to another (in certain limited cases). We give a detailed discussion of and motivation for these additional uses.
In short: question answering is a format, not a task. Calling a dataset a “QA dataset” is not an informative description of the nature of the task, nor is it meaningful to talk about “question answering datasets” as a cohesive group without some additional qualifier describing what kind of question answering is being performed. The community should not be concerned that many different datasets choose to use this format, but we should be sure that all datasets purporting to be “QA datasets” are in fact reasonably described this way, and that the choice of question answering as a format makes sense for the task at hand.
2 Question Answering is a Format
A question is just a particular kind of natural language sentence. The space of things that can be asked in a natural language question is arbitrarily large. Consider a few examples: “What is 2 + 3?”, “What is the sentiment of this sentence?”, “What’s the fastest route to the hospital?”, and “Who killed JFK?”. It seems ill-advised to treat all of these questions as conceptually similar; just about the only unifying element between them is the question mark. “Question answering” is clearly not a cohesive phenomenon that should be considered as a “task” by NLP researchers.
What, then, is question answering? It is a format, a way of posing a particular problem to a machine, just as classification or natural language inference are formats. The phrase “yet another question answering dataset” is similar in meaning to the phrase “yet another classification dataset”—both question answering and classification are formats for studying particular phenomena. Just as classification tasks differ wildly in their complexity, domain, and scope, so do question answering tasks. Question answering tasks additionally differ in their output types (consider the very different ways that one would provide an answer to the example questions above), so it is not even a particularly unified format. The community should stop thinking of “question answering” as a task and recognize it as a format that is useful in some situations and not in others. Instead, the community should consider finding useful cases of whether to pose a task as a question answering format or not.
Question answering is mainly useful in tasks where understanding the question language is itself part of the task. This typically means that the phenomena being queried (i.e., the questions in the dataset) do not lend themselves well to enumeration, because the task is either unbounded or inherently compositional. If every question in the data can be replaced by an integer id without fundamentally changing the nature of the task, then it is usually not useful to pose the task as question answering.
To demonstrate this point, we begin with an extreme example. Some works treat traditional classification or tagging problems as question answering, using questions such as “What’s the sentiment?” or “What are the POS tags?” (Kumar et al., 2016; McCann et al., 2018). In these cases, not only can every question be replaced by a single integer, they can all be replaced by the same integer. There is no meaningful question understanding component in this formulation. This kind of reformulation of classification or tagging as question answering is occasionally useful, but only in rare circumstances when trying to transfer models across related datasets (Section 3.3).
As a less extreme example, consider the WikiReading dataset (Hewlett et al., 2016). In this dataset, a system must read a Wikipedia page and predict values from a structured knowledge base. The type of the value, or “slot” in the knowledge base, can be represented by an integer id. One could also pose this task as question answering, by writing a question template for each slot. These templates, however, are easily memorized by a learning system given enough data, meaning that understanding the language in the templates is not a significant part of the underlying task. The template could be replaced by the integer id of the slot without changing the task; the purpose of the template is largely for humans to understand the example, not the machines.111Or for transfer from a QA dataset; c.f. QA-ZRE (Levy et al., 2017), discussed in §3.3. Attempts to have multiple surface realizations of each template do not help here; the system can still memorize template cluster ids. Even when formatted as question answering, we argue this kind of dataset is more appropriately called “slot filling”. Similarly, a dataset with a question template that involves an entity variable (e.g., When was [PERSON] born?) is simply a pairing of an integer id with an entity id, and does not require meaningful question understanding. This is still appropriately called “slot filling”.
Some templated language datasets can be considered “question answering”, however. The CLEVR and GQA datasets Johnson et al. (2017); Hudson and Manning (2019), for example, use synthetic questions generated from a complex grammar. While these certainly aren’t natural language questions, the dataset is still requires question understanding, because the questions are complex, compositional queries and replacing them with single integer ids misses the opportunity of modeling the compositional structure and dramatically reducing sample complexity. There is admittedly a fuzzy line between complex slot filling cases with multiple variables and grammar-based templated generation, but we believe the basic principle is still valid: if it is reasonable to solve the problem by assigning each question an id (or an id paired with some variable instantiations), then the task does not require significant question understanding and is likely more usefully posed with a format other than question answering.
3 When QA is useful
In the previous section we argued that question answering is best thought of as a format for posing particular tasks, and we gave a concrete definition for when a task should be called question answering. In this section we move to a discussion of when this format is useful.
There are three very different motivations from which researchers arrive at question answering as a format for a particular task, all of which are potentially useful. The first is when the goal is to fill human information needs, where the end task involves humans naturally asking questions. The second is when the complexity or scope of a task go beyond the bounds of a fixed formalism, requiring the flexibility of natural language as an annotation and/or querying mechanism. The third is that question answering, or a unified input/output format in general, might be a way to transfer learned knowledge from one task to a related task.
3.1 Filling human information needs
There are many scenarios where people naturally pose questions to machines, wanting to receive an answer. These scenarios are too varied to enumerate, but a few examples are search queries (Dunn et al., 2017; Kwiatkowski et al., 2019), natural language interfaces to databases Zelle and Mooney (1996); Berant et al. (2013); Iyer et al. (2017), and virtual assistants Dahl et al. (1994). In addition to practical usefulness, natural questions prevent some of the biases found in artificial settings, as analyzed by lee2019latent (though they will naturally have their own biased distribution). These are “natural” question answering settings, and keeping the natural format of the data is an obvious choice that does not need further justification, so we will not dwell on this section. The danger is to think that this is the only valid use of question answering as a format. It is not, as the next two sections will show.
3.2 QA as annotation / probe
When considering building a dataset to study a particular phenomenon, a researcher has many options for how that dataset should be formatted. The most common approach in NLP is to define a formalism, typically drawn from linguistics, and train people to annotate data in that formalism (e.g., part of speech tagging, syntactic parsing, coreference resolution, etc.). In computer vision research, a similar approach is taken for image classification, object detection, scene recognition, etc. When the phenomenon being studied can be cleanly captured by a fixed formalism, this is a natural approach.
There are times, however, when defining a formalism is not optimal. This can be either because the formalism is too expensive to annotate, or because the phenomenon being annotated does not fit nicely into a fixed formalism. In these cases, the flexibility and simplicity of natural language annotations can be incredibly useful. For example, researchers often rely on crowd workers when constructing datasets, and training them in a linguistic formalism can be challenging. However, there are many areas of semantics that any native speaker could easily annotate in natural language, without needing to be taught a formalism (c.f. QA-SRL, described below).
Having decided on natural language annotations instead of a fixed formalism still leaves a lot of room for choice of formats. Free-form generation, such as in image captioning and summarization, or natural language inference, are also flexible formats that use natural language as a primary annotation mechanismPoliak et al. (2018). In what circumstances should one use question answering instead of these other options?
Question answering is often a good choice over summarization or captioning-style formats when (1) there are many things about a given context that could be queried. In summarization and captioning, only one output per input image or passage is generated. Question answering allows the dataset designer to query several different aspects of the context. Question answering may also be preferred over summarization-style formats (2) for easier evaluation. Current metrics for automatically evaluating natural language generation are not very robust(Edunov et al., 2019, inter alia)
. In question answering formats, restricted answer types, such as span extraction, are often available with more straightforward evaluation metrics, though those restrictions often come with their own problems, such as reasoning shortcuts or other biasesJia and Liang (2017); Min et al. (2019).
Question answering is strictly more general than natural language inference (NLI) as a format, as an NLI example can always be converted to a QA example by phrasing the hypothesis as a question and using yes, no, or maybe as the answer. The opposite is not true, as questions with answers other than yes, no, or maybe are challenging to convert to NLI format without losing information. The question and answer can be converted into a declarative hypothesis with label “entailed”, but coming up with a useful negative training signal is non-trivial and very prone to introducing biases. Because the output space is larger in QA, there is a richer learning signal from each example. We recommend using QA over NLI as a format for new datasets in almost all cases.
The remainder of this section looks at specific examples (in a non-exhaustive manner) where question answering is usefully used as an annotation mechanism for particular phenomena. In none of these cases would a human seeking information actually ask any of the questions in the dataset; the person would just look at the given context (sentence, image, paragraph, etc.) to answer their own question. There is no ”natural distribution” of questions for this kind of task.
Qasrl / Qamr
The motivation for this work is explicitly to make annotation cheaper by having crowd workers label semantic dependencies in sentences using natural language instead of training them in a formalism (He et al., 2015; FitzGerald et al., 2018; Michael et al., 2018). In addition, the QA pairs can also be seen as a probe for understanding the structure of the sentence.
Visual question answering
The motivation for this task is to demonstrate understanding of a passage of text, using various kinds of questions (Rajpurkar et al., 2016; Joshi et al., 2017; Dua et al., 2019; Amini et al., 2019). These questions aim at evaluating different phenomena, from understanding simple relationships about entities, to numerical analysis, to multi-hop reasoning. There are two recent surveys that give a good overview of the kinds of reading comprehensions that have been built so far Liu et al. (2019); Zhang et al. (2019). The open-domain nature of the reading comprehension task makes it very unlikely that a formalism could be developed, leaving question answering as the natural way to probe a system’s understanding of longer passages of text.
Background knowledge and common sense
3.3 As a transfer mechanism
There has been a lot of work on transferable representation learning, trying to share linguistic or other information learned between a diverse set tasks. The dominant, and most successful, means of doing this is by sharing a common language representation layer, and having several different task-specific heads that output predictions in particular formats. An alternative approach is to pose a large number of disparate tasks in the same format. This has generally been less successful, though there are a few specific scenarios in which it appears promising. Below, we highlight two of them.
The first case in which it helps to pose a non-QA task as QA is when the non-QA task is closely related to a QA task, and one can reasonably hope to get few-shot (or even zero-shot) transfer from a trained QA model to the non-QA task. This model transfer was successfully demonstrated by Levy et al. (2017), who took a relation extraction task and used templates to pose it as QA in order to transfer a SQuAD-trained model to the task. However, as in other transfer learning or multi-task learning scenarios, this is only likely to succeed when the source and target tasks are very similar to each other. McCann et al. (2018)
attempted to do multi-task learning with many diverse tasks, including machine translation, summarization, sentiment analysis, and more, all posed as question answering. In most cases, this hurt performance over training on each task independently. It seems likely that having a shared representation layer and separate prediction heads would be a more fruitful avenue for achieving this kind of transfer than posing everything as QA.
The second case in which it helps to pose a non-QA task as QA is when the model architectures used in QA are helpful for the task. Das et al. (2019) and Li et al. (2019) achieve significant improvement by converting the initial format of their data (entity tracking and relation extraction, respectively) to a QA format via question templates and using a QA model with no pretraining. We hypothesize that in these cases, forcing the model to compute similarities between the input text and the words in the human-written templates provides a useful inductive bias. For example, the template “Where is CO2 created?” will encourage the model to map “where” to locations in the passage and to find synonyms of “CO2”, inductive biases which may be difficult to inject in other model architectures.
In this paper, we argued that the community should think of question answering as a format, not a task, and that we should not be concerned when many datasets choose to use this format. Question answering is a useful format predominantly when the task has a non-trivial question understanding component, and the questions cannot simply be replaced with integer ids. We observed three different situations in which posing a task as question answering is useful: (1) when filling human information needs, and the data is already naturally formatted as QA; (2) when the flexibility inherent in natural language annotations is desired, either because the task does not fit into a formalism, or training people in the formalism is too expensive; and (3) to transfer learned representations or model architectures from a QA task to another task. As NLP moves beyond sentence-level linguistic annotation, many new datasets are being constructed, often without well-defined formalisms backing them. We encourage those constructing these datasets to think carefully about what format is most useful for them, and we have given some guidance about when question answering might be appropriate.
- Amini et al. (2019) Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In NAACL.
- Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In ICCV.
- Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic parsing on freebase from question-answer pairs. In EMNLP.
- Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In NAACL.
- Dahl et al. (1994) D. A. Dahl, M. Bates, M. Brown, W. Fisher, K. Hunicke-Smith, D. Pallett, C. Pao, A. Rudnicky, and E. Shriberg. 1994. Expanding the scope of the ATIS task: The ATIS-3 corpus. In Workshop on Human Language Technology, pages 43–48.
Das et al. (2019)
Rajarshi Das, Tsendsuren Munkhdalai, Xingdi Yuan, Adam Trischler, and Andrew
Building dynamic knowledge graphs from text using machine reading comprehension.In ICLR.
- Dua et al. (2019) Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In NAACL.
- Dunn et al. (2017) Matthew Dunn, Levent Sagun, Mike Higgins, V Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. Searchqa: A new q&a dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179.
- Edunov et al. (2019) Sergey Edunov, Myle Ott, Marc’Aurelio Ranzato, and Michael Auli. 2019. On the evaluation of machine translation systems trained with back-translation.
- FitzGerald et al. (2018) Nicholas FitzGerald, Julian Michael, Luheng He, and Luke Zettlemoyer. 2018. Large-scale qa-srl parsing. In ACL.
- He et al. (2015) Luheng He, Mike Lewis, and Luke Zettlemoyer. 2015. Question-answer driven semantic role labeling: Using natural language to annotate natural language. In EMNLP.
- Hewlett et al. (2016) Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, and David Berthelot. 2016. WikiReading: A novel large-scale language understanding task over wikipedia. In ACL.
- Hudson and Manning (2019) Drew A. Hudson and Christopher D. Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR.
- Iyer et al. (2017) S. Iyer, I. Konstas, A. Cheung, J. Krishnamurthy, and L. Zettlemoyer. 2017. Learning a neural semantic parser from user feedback. In ACL.
- Jia and Liang (2017) Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In EMNLP.
- Johnson et al. (2017) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross B. Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR.
- Joshi et al. (2017) Mandar S. Joshi, Eunsol Choi, Daniel S. Weld, and Luke S. Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In ACL.
- Kumar et al. (2016) Ankit Kumar, Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce, Peter Ondruska, Ishaan Gulrajani, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. In ICML.
- Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. TACL.
- Kwok et al. (2001) C. Kwok, O. Etzioni, and D. S. Weld. 2001. Scaling question answering to the web. ACM Transactions on Information Systems (TOIS), 19:242–262.
- Lee et al. (2019) Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. In ACL.
- Levy et al. (2017) Omer Levy, Minjoon Seo, Eunsol Choi, and Luke S. Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. In CoNLL.
- Li et al. (2019) Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, Arianna Yuan, Duo Chai, Mingxin Zhou, and Jiwei Li. 2019. Entity-relation extraction as multi-turn question answering. In ACL.
- Liu et al. (2019) Shanshan Liu, Xin Zhang, Sheng Zhang, Hui Wang, and Weiming Zhang. 2019. Neural machine reading comprehension: Methods and trends. ArXiv, abs/1907.01118.
- McCann et al. (2018) Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. ArXiv, abs/1806.08730.
- Michael et al. (2018) Julian Michael, Gabriel Stanovsky, Luheng He, Ido Dagan, and Luke Zettlemoyer. 2018. Crowdsourcing question-answer meaning representations. In NAACL-HLT.
- Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
- Min et al. (2019) Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. Compositional questions do not necessitate multi-hop reasoning. In ACL.
- Poliak et al. (2018) Adam Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme. 2018. Collecting diverse natural language inference problems for sentence representation evaluation. In BlackboxNLP@EMNLP.
- Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In EMNLP.
- Sap et al. (2019) Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. 2019. Social IQa: Commonsense reasoning about social interactions. In EMNLP.
- Talmor et al. (2019) A. Talmor, J. Herzig, N. Lourie, and J. Berant. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In NAACL.
- Voorhees (1999) Ellen M. Voorhees. 1999. The TREC-8 question answering track report. In TREC.
- Yogatama et al. (2019) Dani Yogatama, Cyprien de Masson d’Autume, Jerome Connor, Tomás Kociský, Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris Dyer, and Phil Blunsom. 2019. Learning and evaluating general linguistic intelligence. ArXiv, abs/1901.11373.
Zelle and Mooney (1996)
M. Zelle and R. J. Mooney. 1996.
Learning to parse database queries using inductive logic programming.In AAAI.
- Zellers et al. (2019) Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. In CVPR.
- Zhang et al. (2019) Xin Zhang, An Yang, Sujian Li, and Yizhong Wang. 2019. Machine reading comprehension: a literature review. ArXiv, abs/1907.01686.