Human level natural language understanding involves reading between the lines and relying on implicit background knowledge. Consider the scene in Figure 1: Alice let Bob stand in front of her at the concert. Using physical and social commonsense – (i) Bob and Alice want to see the stage, and (ii) If Bob is taller, they would block Alice’s view – one can infer that Alice is taller than Bob. Such examples are ubiquitous across natural language understanding (NLU) tasks such as reading comprehension Hirschman et al. (1999) and recognizing textual entailment Dagan et al. (2013), and even more so in tasks dedicated to commonsense reasoning such as the Winograd schema challenge (WSC; Levesque et al., 2012).
Most current NLU models rely on pre-trained language models (LMs; e.g. Radford et al., 2019; Devlin et al., 2019; Liu et al., 2019). The standard practice is to use task-specific data to fine-tune a pre-trained LM in a supervised manner. Alternatively, LM score is used to rank answer choices in a zero-shot setup Wang et al. (2019); Sakaguchi et al. (2020). In both setups, pre-trained LMs yield improved performance upon prior methods, greatly due to the world knowledge that such LMs capture, having been trained on massive texts Petroni et al. (2019); Davison et al. (2019).
|Dataset||Context + Question||Choices|
|COPA||The man broke his toe.||1) He got a hole in his sock.|
|What was the cause of this?||2) He dropped a hammer on his foot.|
|Common||Where on a river can you hold a cup||1) waterfall 2) bridge 3) valley|
|SenseQA||upright to catch water on a sunny day?||4) pebble 5) mountain|
|MC-TACO||[…] dream of becoming a judge. How many||1) 63 years 2) 7 weeks|
|years did it take for Mark to become a judge?||3) 7 years 4) 7 seconds 5) 7 hours|
|Social IQa||In the school play, Robin played a hero in the||1) sorry for the villain|
|struggle to the death with the angry villain.||2) hopeful that Robin will succeed|
|How would others feel as a result?||3) like Robin should lose the fight|
|PIQA||To separate egg whites from the yolk using a||1) […] Release, which creates suction and lifts the yolk.|
|water bottle, you should||2) […] Keep pushing, which creates suction and lifts the yolk.|
|WinoGrande||Katrina had the financial means to afford||1) Katrina|
|a new car while Monica did not,||2) Monica|
|since _ had a high paying job.|
Despite the performance boost, LMs as knowledge providers suffer from various shortcomings: (i) insufficient coverage: due to reporting bias, many trivial facts might not be captured by LMs (purple set in Figure 1), because they are rarely written about Gordon and Van Durme (2013)
. (ii) insufficient precision: the distributional training objective increases the probability of non-facts (light green set in Figure1) that are semantically similar to true facts, as in negation (“birds cannot fly”; Kassner and Schütze, 2019). LMs excel in predicting the semantic category of a missing word, but might predict the wrong instance in that category (e.g., depending on the phrasing, BERT sometimes predicts red as the color of a dove). Finally, (iii) it is unclear that LMs are capable of performing multiple reasoning steps involving implicit knowledge.
To increase the coverage of high-precision world knowledge and facilitate multi-hop reasoning by making intermediate reasoning steps explicit, prior work incorporated KBs (e.g. ConceptNet; Speer and Havasi, 2012) and knowledge-informed models into LM-based models Xia et al. (2019); Bosselut and Choi (2019); Chen et al. (2019).
In this paper, we study pre-trained LMs as an alternative to external KBs in providing knowledge to commonsense question answering tasks. We propose an unsupervised model that uses an LM as the answer scorer, and a (possibly different) LM as a knowledge source. We formulate the process of obtaining relevant knowledge as a self-talk, inquiry-based discovery learning Bruner (1961), with the following steps: 1) seeking out knowledge by generating natural-language “clarification questions” conditioned on a given context; 2) generating their corresponding answers (“clarifications”); and 3) incorporating the clarifications as additional context.
Our model does not rely on external knowledge or additional supervision. Yet, we show that on 4 out of 6 tasks it substantially improves upon a zero-shot baseline that relies on LM score alone and performs on par, and sometimes better than, models that use external knowledge sources.
Integrating external knowledge warrants discerning relevant and helpful facts for solving a particular instance. LMs further require identifying that a clarification is factually-correct. We show that even among the clarifications that helped the prediction, humans perceived many as unhelpful or even incorrect, demonstrating that LM-based models often solve problems correctly for seemingly incorrect reasons. Our results call for future research on robust and correct knowledge integration to LM-based question answering systems.
We focused on the multiple-choice question answering tasks exemplified in Table 1 and detailed below. Each instance consists of an optional context, an optional question, and several answer choices. The development sets sizes vary from 100 (COPA) to 1,954 (Social IQa).
COPA: Choice of Plausible Alternatives Gordon et al. (2012):
Asking about either a plausible cause or a plausible result, among two alternatives, of a certain event expressed in a simple sentence.
CommonSenseQA: commonsense Question Answering Talmor et al. (2019).
General questions about concepts from ConceptNet. To increase the challenge, the distractors are related to the target concept either by a relationship in ConceptNet or as suggested by crowdsourcing workers.
MC-TACO: Multiple Choice Temporal commonsense Zhou et al. (2019).
Questions about temporal aspects of events such as ordering (Table 1), duration, stationarity, frequency, and typical time. The distractors were selected in an adversarial way using BERT.111To make this task compatible with the other tasks, we only kept a single correct answer per instance, making our results not comparable to previously reported results.
Social IQa: Social Interaction Question Answering Sap et al. (2019).
Questions regarding social interactions, based on the ATOMIC dataset Sap et al. (2019). Contexts describe social interactions and questions refer to one of a few aspects (e.g. the subject’s motivation, following actions, etc.). The answers were crowdsourced.
PIQA: Physical Interaction Question Answering Bisk et al. (2020).
Questions regarding physical commonsense knowledge. Contexts are goals derived from an instruction website, typically involving less prototypical uses of everyday objects (e.g., using a bottle to separate eggs). The answers were crowdsourced, and an adversarial filtering algorithm was used to remove annotation artifacts.222Word associations and dataset-specific features that are not informative for the task are identified by a strong baseline and removed Gururangan et al. (2018); Zellers et al. (2018).
WinoGrande Sakaguchi et al. (2020).
A large-scale version of WSC that exhibits less bias thanks to adversarial filtering and use of placeholders instead of pronouns. As opposed to WSC that was curated by experts, WinoGrande was crowdsourced with a carefully designed approach that produces diverse examples which are trivial for humans.
A given instance consists of an optional context , an optional question , and answer choices: . We first describe the baseline model, which makes the prediction based on the instance alone (Section 3.1). We then describe a knowledge-informed model that relies on external resources (Section 3.2). Finally, we discuss the proposed inquiry-based model, which uses a pre-trained LMs to produce clarifications (Section 3.3).
3.1 LM-only Baseline
We use a pre-trained language model to score the plausibility of different text fragments. We experiment with the various LMs provided by the transformers package Wolf et al. (2019): GPT Radford et al. (2018), GPT2 (Radford et al., 2019, all sizes), a distilled GPT2 Sanh et al. (2019), and XLNet (Yang et al., 2019, both sizes).
We assign each of the answer choices into the combination of the context and the question, and obtain . The function is computed differently for each task. For example, in COPA, where the question might be either about the cause or the effect of the context, we create the following texts for cause: “[context]. As a result, [choice]” and for effect: “[context]. The cause for it was that [choice]”.
We denote the score of each answer choice as , where is cross-entropy loss defined as:
We predict the with the lowest score as the correct answer, which is the most likely option according to : .
3.2 Baseline Model with External Knowledge
In the setup illustrated in Figure 2, each instance consists of an additional clarification list: . Those are text fragments containing potentially relevant knowledge for solving the instance. For instance, the clarification “The purpose of the internship is to help people find jobs” might help answering the question “which of Brett and Ian found a job less quickly after graduation?”. We don’t expect all the clarifications to be relevant and helpful for answering the main question. Instead, the model relies on the single clarification that increases its belief of a certain answer choice. Thus, the score of each answer choice is selected as the score of the text containing the clarification that most supports it, i.e., whose combination with it yields the minimal loss: .
Again we predict .
We extract clarifications from the following sources, exemplified in Figure 3.
Similarly to previous work, we extract relation paths between words from the context and the question, and words from the answer choices. Since we incorporate the knowledge into the model as text, we convert each ConceptNet relation to a natural language template as in davison-etal-2019-commonsense. We limit the path length to 2 edges in order to maintain high precision.
For pairs of words from the context and question and from the answer choices, we extract their joint occurrences (with minimum frequency of 100) in Google N-gramsBrants and Franz . This yields text fragments of up to 5 words rather than well-formed sentences, with the potential of describing the relationship between the two words Shwartz and Dagan (2018).
which consists of everyday situations along with multiple commonsense dimensions such as their causes, effects, pre- and post-conditions, etc. We generate all the dimensions unless we can generate specific relations that are more likely to help. Specifically, in Social IQa, we heuristically try to understand which type of relation in COMET the question asks for. In COPA, we use the pre-condition relations for cause questions (xIntent, xNeed) and the post-condition relations for effect questions (xEffect, xReact, xWant, oEffect, oReact, oWant). When possible, we replace personX with the syntactic subject of the context or the question.
3.3 Self-talk Model
Our proposed model makes the prediction identically to Figure 2, but extracts the clarifications from pre-trained LMs. We treat the knowledge extraction from LMs as a process of self-asking clarification questions about the context and “discovering” their answers. Figure 4 exemplifies this process for WinoGrande with a generator language model . For the sake of simplicity, the illustration depicts the process of generating a single pair of clarification question and answer.
We start by generating multiple clarification questions conditioned on the context, by 1) concatenating one of several question prefixes, which we curated for each task (e.g. “What is the purpose of”, see the appendix); and 2) generating 5 questions for each prefix using Nucleus sampling with , i.e., sampling from the top 20% tokens Holtzman et al. (2019).333This value was chosen in preliminary experiments and is significantly lower than the standard value for in the literature, which is typically around 0.9. We use a low value because we optimize for factual correctness, and our preliminary experiments have shown that lower values produce texts that are more “faithfull” to their training corpus (at the price of being more bland). We limit the question length to up to 6 tokens excluding the prefix.
For each well-formed question that we obtained at the previous step, e.g. “What is the purpose of the internship?”, we generate multiple answers using a similar method. Each question prefix corresponds to an answer prefix. We use the concatenation of the context, generated clarification question, and answer prefix as the prompt for generating an answer (clarification). We limit the answer length to 10 generated tokens, and use Nucleus sampling with . We generate 10 answers for each clarification question and keep all the well-formed clarifications. Note that the clarification questions themselves are only means to generate the clarifications, and they are not used by our model.
In some datasets, an instance consists of both a context and a question. In this case, we can use the instance question as a “clarification” question and generate additional clarification questions similar to it. Figure 5 exemplifies this shortcut for Social IQa: instead of generating a clarification question, the given question “Why did Austin do this?” is used, and together with a heuristically matched answer prefix, the model can generate a potentially direct solution: “Austin did this because they wanted to keep him alive”.
Since we did not train the clarification generator to ask sensical, relevant, and helpful questions, nor did we train the answer generator to generate coherent and factually correct answers, we can assume that some of the generated clarifications do not provide useful information to the model.
|Pre. Sup||Albert ensemble||83.7||76.5|
Table 2 displays the performance of the best model in each category according to the development accuracy. We report the performance of the following models: majority baseline, LM baseline (Baseline), LM-based model with external knowledge (Ext. Knowledge), Self-talk, supervised models from prior work when applicable (Pre. Sup),444Excluding unpublished leaderboard submissions. and human performance. Our zero-shot models are highlighted in purple.
|Dataset||Rank (Mean Dev Acc.)|
|COPA||Distil-GPT2 (63.7) GPT2-M (61.8) GPT2-L (60.6) GPT2 (59.7) GPT (58.6) GPT2-XL (57.9) XLNet-base (51.9) XLNet-L (49.5)|
|CSQA||GPT2-L (31.8) GPT2-XL (31.2) GPT2-M (27.7) GPT (27.6) GPT2 (25.6) Distil-GPT2 (25.4) XLNet-base (21.5) XLNet-L (20.8)|
|MC-TACO||GPT2-XL (58.1) GPT2-L (56.6) GPT2-M (53) GPT2 (50.1) Distil-GPT2 (48.8) GPT (47.7) XLNet-L (37) XLNet-base (34.2)|
|Social IQa||GPT2-XL (45.5) GPT2-L (44.4) GPT2-M (43.4) GPT2 (41.8) GPT (41.6) Distil-GPT2 (40.4) XLNet-L (33.6) XLNet-base (33.1)|
|PIQA||GPT2-XL (69.6) GPT2-L (67.9) GPT2-M (65.6) GPT2 (62) Distil-GPT2 (59.6) GPT (57.9) XLNet-base (49.2) XLNet-L (48.8)|
|Wino.||GPT2-XL (54) GPT2-L (52.9) GPT (52.2) GPT2 (51.2) Distil-GPT2 (50.9) GPT2-M (50.2) XLNet-base (49.1) XLNet-L (48.7)|
|Dataset||Rank (Mean Dev Acc.)|
|COPA||COMET (61.1) GPT2-XL (58.6) Google Ngrams (58.4) GPT2-M (58.2) XLNet-L (58.2) GPT (58.1) GPT2 (58.0)|
|CSQA||COMET (29.8) Google Ngrams (29.1) GPT2-M (26.3) ConceptNet (26.1) GPT2-L (26.1) XLNet-L (25.8) GPT2 (25.8)|
|MC-TACO||Google Ngrams (49.1) ConceptNet (48.9) GPT2 (48.7) GPT2-L (48.6) GPT2-XL (48.5) Distil-GPT2 (48.1) GPT2-M (48.1)|
|Social IQa||COMET (41.4) GPT2-XL (40.9) GPT2-L (40.6) Distil-GPT2 (40.5) XLNet-L (40.4) GPT2-M (40.4) XLNet-base (40.4)|
|PIQA||Google Ngrams (60.5) XLNet-L (60.2) ConceptNet (60.2) GPT (60.1) GPT2-XL (60.1) GPT2-M (60.0) GPT2-L (60.0)|
|WinoGrande||GPT (51.3) GPT2-XL (51.3) GPT2-L (51.2) COMET (51.2) ConceptNet (51.2) GPT2 (51.2) GPT2-M (51.2)|
As expected, the overall performance is worse for the zero-shot models compared to the state-of-the-art supervised models, but they perform substantially better than the majority baselines on most tasks, with the exception of WinoGrande where they only slightly outperform it. Among the LM-based models, self-talk performs on par or within a few points from the external knowledge model.
Table 3 shows the ranking of the LMs according to their development accuracy averaged across the different knowledge sources. In general there is a preference to GPT-2, and in particular to the larger models, except for COPA in which the distilled version works best. A possible explanation might be that the language model distillation reduces the likelihood of rare words Tang and Lin (2018), which works well for the simple sentences in COPA. The XLNet models perform poorly, perhaps due to their smaller training corpus (16GB vs 40GB in GPT-2, both using web text).
Best Knowledge Source.
Among the knowledge informed models, COMET achieves the best performance across tasks. This likely happens first because COMET can dynamically generate predictions for any context, while the other two knowledge sources are static and lack coverage. Second, as expected, COMET improves the predictions for Social IQa, which was built based on the ATOMIC resource on which COMET is trained.
Table 4 sorts the knowledge sources based on the average development accuracy across LMs. PIQA and MC-TACO, tasks that require different types of knowledge from social commonsense, perform well with ConceptNet and Google Ngrams. With respect to self-talk models, there is a rather small difference in performance between the different LMs used as knowledge sources, with slight preference to GPT-2 in most datasets.
We also experimented with combining the clarifications from all the knowledge sources, which didn’t prove beneficial except for MC-TACO (where it added +7.9 points to the dev accuracy, bringing it to 66.7). We assume that some resources added noise, making the whole smaller than the sum of its parts.
5 Human Evaluation of the Clarifications
While the performance on the end task serves as an extrinsic evaluation for the quality of the generated clarifications, we are also interested in evaluating it intrinsically. From preliminary experiments we know that there is a high ratio of noisy clarifications. Thus, we analyze the clarifications that help predict the correct answer, i.e. clarifications with the best LM score in their instance and whose existence change the answer from an incorrect prediction by the baseline to a correct prediction by the model.
We sampled up to 50 such clarifications for each combination of task and knowledge source, using the best performing LM.555We omitted COPA from the analysis due to its small size. See the appendix for examples. We showed crowdsourcing workers an instance along with a clarification question and its answer, and asked them: 1) whether the question is grammatical, not entirely grammatical but understandable, or completely not understandable; and if the answer was anything but “completely not understandable”, 2) whether the question is relevant, i.e. on topic with the instance. We asked the same questions about the answer, in addition to: 3) whether the answer is factually correct or likely true; and 4) whether the answer adds helpful information to solve the instance.
The annotation task was carried out in Amazon Mechanical Turk. To ensure the quality of annotations, we required that the workers be located in the US, UK, or Canada, and have a 99% approval rate for at least 5,000 prior tasks. We aggregated annotation from 3 workers using majority vote. The annotations yielded moderate levels of agreement, with Fleiss’ Kappa = 0.43 Landis and Koch (1977). Among the different categories of annotations we measured pairwise accuracy, which ranged from 60.41% (the answer is factually correct) to 92.26% (the question is completely not understandable).
For the sake of brevity, we focus on the analysis of the answers to the clarification questions. Figure 6 shows the human evaluation results for each combination of task and knowledge source. The top part of the figure shows that across tasks and resources, most clarifications are grammatical or at least understandable, with the exception of XLNet. The bottom part shows the percentage of clarifications considered relevant, correct, and helpful.666If a worker consider an answer as “completely not understandable”, we marked it as not relevant, correct, or helpful. Most clarifications were considered relevant to the context, around half of them were considered factually correct, and some 20-40% were considered helpful. Considering that these are all clarifications that indeed helped the model, this is an interesting though not completely unexpected finding: the model utilizes knowledge that humans wouldn’t consider as helpful, and likely also vice versa.
Breaking down by knowledge source, we observe that when the datasets were created using a knowledge source (ConceptNet for CommonSenseQA, and Social IQa uses ATOMIC, on which COMET is trained), clarifications from that resource are considered correct. We also note that somewhat surprisingly, relatively few ConceptNet clarifications were considered correct, despite limiting the relation paths up to 2 edges.
6 Related Work
6.1 External Knowledge in Neural Models
Approaches for incorporating external knowledge into a neural model consist of several components: (1) the task addressed; (2) neural model; (3) knowledge sources; and (4) incorporation method. Most models target tasks that require commonsense knowledge, such as the story cloze test (RocStories; Mostafazadeh et al., 2016) and machine comprehension tasks Kočiskỳ et al. (2018); Ostermann et al. (2018); Clark et al. (2018); Talmor et al. (2019). The neural component has recently shifted from biLSTM to transformer-based representations, specifically pre-trained LMs such as BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019).
With respect to the knowledge source, the vast majority of papers rely on ConceptNet to extract relation paths between concepts and entities identified in the input (Speer and Havasi, 2012, see an example in Figure 3). Additional resources include WordNet Lin et al. (2017); Wang and Jiang (2019), mining scripts from corpora Lin et al. (2017), knowledge base embeddings Chen et al. (2019); Xiong et al. (2019), hand-crafted rules Lin et al. (2017); Tandon et al. (2018), and tools such as sentiment analyzers Chen et al. (2019) and knowledge-informed LMs Bosselut and Choi (2019).
The external knowledge is typically incorporated into the neural model by learning a vector representation of the symbolic knowledge (e.g. subgraphs from ConceptNet), and attending to it via attention mechanism when representing the inputsBauer et al. (2018); Paul and Frank (2019); Lin et al. (2019). Alternative approaches include using the knowledge to score answer candidates and prune implausible ones Lin et al. (2017); Tandon et al. (2018), and training in a multi-task setup via auxiliary tasks pertaining to knowledge Xia et al. (2019).
6.2 Extracting Knowledge from LMs
Pre-trained LMs such as GPT2 Radford et al. (2019) and BERT Devlin et al. (2019) capture various types of world knowledge. petroni-etal-2019-language showed that such LMs can be used in a KB completion task over ConceptNet and Wikidata Vrandečić and Krötzsch (2014) by converting KB relations into natural language templates and querying the LM for the missing part in the triplet (concept, relation, concept). For instance, querying BERT for suitable substitutes to the mask in “Dante was born in [MASK]” assigns the highest probability to Rome. davison-etal-2019-commonsense similarly showed that BERT assigns higher scores to natural language fragments of true rather than fictitious ConceptNet triplets, and semi-automated the template creation by using GPT2 to score hand-crafted templates.
While both works have shown somewhat promising results, other work showed that knowledge extracted from LMs is expectantly not always accurate. Specifically, kassner2019negated showed that negated facts are also considered likely by the LM, while logan_baracks_2019 pointed out that LMs may over-generalize and produce incorrect facts such as “Barack Obama’s wife is Hillary”.
6.3 Generating Questions and Explanations
There are numerous research directions investigating automatic question generation Vanderwende (2008). Motivations vary from data augmentation to QA tasks Du et al. (2017); Dhingra et al. (2018); Du and Cardie (2018); Sachan and Xing (2018) through conversational machine reading Saeidi et al. (2018); Pan et al. (2019), simplifying questions to make them more easily answerable Buck et al. (2018); Talmor and Berant (2018); Perez et al. (2020), to using questions as means for other purposes such as sentence representation and summarization Guo et al. (2018); Potash and Suleman (2019).
In particular, our work is pertinent to previous work in producing clarifications questions and explanations. rao-daume-iii-2019-answer worked on question from forums (e.g. Stack Exchange). They proposed a model that generates clarification questions and corresponding answers for a given question, using the question’s comments (clarification questions and answers) as supervision. Question-answer pairs were scored based on how much relevant information they add to the context.
shen2019learning developed an active learning framework for image captioning that learns to detect uncertainty about generated words and ask natural language questions to reduce its uncertainty. A visual question answering (VQA) model provides an answer which is then used to change the caption. The framework is trained with reinforcement learning, but the gold standard captions are used during a warmup steps and the VQA model is supervised.
klein2019learning proposed a joint question generation and question answering framework. They fine-tuned GPT2 on a question answering dataset to generate a question and an answer span for a given passage, and trained BERT to answer the generated question given the passage. Finally, rajani_explain_2019 proposed a model for CommonSenseQA that generates explanations for its predictions. They collected human explanations and used them to fine-tune LMs to automatically generate explanations. These explanations were then added as additional inputs. The shortcoming of this approach is that it requires collecting specific human explanations for each new dataset.
7 Disucssion and Conclusion
We presented an unsupervised framework for multiple choice commonsense tasks that generates and integrates background knowledge from pre-trained LMs. On most tasks, it performs substantially better than the baseline and similarly to a model that had access to external knowledge resources.
By design, our model makes a single additional reasoning step explicit. A preliminary experiment in which we incorporated clarification pairs to facilitate two hops got mixed results. An interesting future direction is to generate each clarification in response to the previous ones, in a dialogue setup Saeidi et al. (2018). Another challenge is the “needle in a haystack” problem of the clarifications, and one way to address it is to develop a model that is capable of “introspection”, specifically knowing what it doesn’t know. A more structured knowledge generation might also make the combination of various knowledge sources more successful.
Filling in knowledge gaps and making implicit intermediate reasoning steps explicit is imperative going forward. We hope that our framework will facilitate future research in this area. Our code and data is available at github.com/vered1986/self_talk.
This research was supported in part by NSF (IIS-1524371, IIS-1714566), DARPA under the CwC program through the ARO (W911NF-15-1-0543), and DARPA under the MCS program through NIWC Pacific (N66001-19-2-4031).
Commonsense for generative multi-hop question answering tasks.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4220–4230. External Links: Cited by: §6.1.
- PIQA: reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, Cited by: §2.
Dynamic knowledge graph construction for zero-shot commonsense question answering. ArXiv abs/1911.03876. Cited by: §1, §6.1.
- COMET: commonsense transformers for automatic knowledge graph construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4762–4779. External Links: Cited by: §3.2.
-  Web 1t 5-gram version 1 (2006). Linguistic Data Consortium, Philadelphia. Cited by: §3.2.
- The act of discovery. Harvard educational review 31, pp. 21–32. Cited by: Unsupervised Commonsense Question Answering with Self-Talk, §1.
- Ask the right questions: active question reformulation with reinforcement learning. In International Conference on Learning Representations, External Links: Cited by: §6.3.
- Incorporating structured commonsense knowledge in story completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6244–6251. Cited by: §1, §6.1.
- Think you have solved question answering? try ARC, the AI2 reasoning challenge. External Links: Cited by: §6.1.
- Recognizing textual entailment: models and applications. Synthesis Lectures on Human Language Technologies 6 (4), pp. 1–220. Cited by: §1.
- Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1173–1178. External Links: Cited by: §1.
- Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota. Cited by: §1, §6.1, §6.2.
- Simple and effective semi-supervised question answering. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 582–587. External Links: Cited by: §6.3.
- Harvesting paragraph-level question-answer pairs from Wikipedia. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1907–1917. External Links: Cited by: §6.3.
- Learning to ask: neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1342–1352. External Links: Cited by: §6.3.
- SemEval-2012 task 7: choice of plausible alternatives: an evaluation of commonsense causal reasoning. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Montréal, Canada, pp. 394–398. External Links: Cited by: §2.
- Reporting bias and knowledge acquisition. In Proceedings of the 2013 workshop on Automated knowledge base construction, pp. 25–30. Cited by: §1.
- Soft layer-specific multi-task summarization with entailment and question generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 687–697. External Links: Cited by: §6.3.
- Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 107–112. Cited by: footnote 2.
- Deep read: a reading comprehension system. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 325–332. Cited by: §1.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. Cited by: §3.3.
- Negated lama: birds cannot fly. arXiv preprint arXiv:1911.03343. Cited by: §1.
- The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6, pp. 317–328. Cited by: §6.1.
- The measurement of observer agreement for categorical data. biometrics, pp. 159–174. Cited by: §5.
- The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, Cited by: §1.
- KagNet: knowledge-aware graph networks for commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2829–2839. External Links: Cited by: §6.1.
- Reasoning with heterogeneous knowledge for commonsense machine comprehension. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2032–2043. External Links: Cited by: §6.1, §6.1.
- Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §6.1.
- A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 839–849. External Links: Cited by: §6.1.
- Semeval-2018 task 11: machine comprehension using commonsense knowledge. In Proceedings of the 12th International Workshop on semantic evaluation, pp. 747–757. Cited by: §6.1.
- Reinforced dynamic reasoning for conversational question generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2114–2124. External Links: Cited by: §6.3.
- Ranking and selecting multi-hop knowledge paths to better predict human needs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3671–3681. External Links: Cited by: §6.1.
- Unsupervised question decomposition for question answering. In RCQA workshop @ AAAI 2020, Cited by: §6.3.
- Language models as knowledge bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2463–2473. External Links: Cited by: §1.
- Playing log (n)-questions over sentences. In EmeCom workshop @ NeurIPS 2019, Cited by: §6.3.
- Improving language understanding by generative pre-training. -. Cited by: §3.1.
- Language models are unsupervised multitask learners. -. Cited by: §1, §3.1, §6.2.
- Self-training for jointly learning to ask and answer questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 629–640. Cited by: §6.3.
- Interpretation of natural language rules in conversational machine reading. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2087–2097. External Links: Cited by: §6.3, §7.
- WINOGRANDE: an adversarial winograd schema challenge at scale. In AAAI, Cited by: §1, §2.
- DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §3.1.
- Atomic: an atlas of machine commonsense for if-then reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3027–3035. Cited by: §2, §3.2.
- Social IQa: commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4463–4473. External Links: Cited by: §2.
- Paraphrase to explicate: revealing implicit noun-compound relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1200–1211. External Links: Cited by: §3.2.
- Representing general relational knowledge in conceptnet 5.. In LREC, pp. 3679–3686. Cited by: §1, §6.1.
- The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 641–651. External Links: Cited by: §6.3.
- CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4149–4158. External Links: Cited by: §2, §6.1.
- Reasoning about actions and state changes by injecting commonsense knowledge. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 57–66. External Links: Cited by: §6.1, §6.1.
- Adaptive pruning of neural language models for mobile devices. arXiv preprint arXiv:1809.10282. Cited by: §4.
- The Importance of Being Important: Question Generation. In Proceedings of the Workshop on the Question Generation Shared Task and Evaluation Challenge, Cited by: §6.3.
- Wikidata: a free collaborative knowledgebase. Communications of the ACM 57 (10), pp. 78–85. Cited by: §6.2.
- Explicit utilization of general knowledge in machine reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2263–2272. External Links: Cited by: §6.1.
- Does it make sense? and why? a pilot study for sense making and explanation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4020–4026. External Links: Cited by: §1.
- HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §3.1.
- Incorporating relation knowledge into commonsense reading comprehension with multi-task learning. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2393–2396. Cited by: §1, §6.1.
- Improving question answering over incomplete KBs with knowledge-aware reader. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4258–4264. External Links: Cited by: §6.1.
- XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Cited by: §3.1.
- SWAG: a large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 93–104. External Links: Cited by: footnote 2.
- “Going on a vacation” takes longer than “going for a walk”: a study of temporal commonsense understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3354–3360. Cited by: §2.