Unsupervised Commonsense Question Answering with Self-Talk

Natural language understanding involves reading between the lines with implicit background knowledge. Current systems either rely on pre-trained language models as the sole implicit source of world knowledge, or resort to external knowledge bases (KBs) to incorporate additional relevant knowledge. We propose an unsupervised framework based on self-talk as a novel alternative to multiple-choice commonsense tasks. Inspired by inquiry-based discovery learning (Bruner, 1961), our approach inquires language models with a number of information seeking questions such as "what is the definition of ..." to discover additional background knowledge. Empirical results demonstrate that the self-talk procedure substantially improves the performance of zero-shot language model baselines on four out of six commonsense benchmarks, and competes with models that obtain knowledge from external KBs. While our approach improves performance on several benchmarks, the self-talk induced knowledge even when leading to correct answers is not always seen as useful by human judges, raising interesting questions about the inner-workings of pre-trained language models for commonsense reasoning.


Knowledge-driven Self-supervision for Zero-shot Commonsense Question Answering

Recent developments in pre-trained neural language modeling have led to ...

Zero-shot Commonsense Question Answering with Cloze Translation and Consistency Optimization

Commonsense question answering (CQA) aims to test if models can answer q...

How to Query Language Models?

Large pre-trained language models (LMs) are capable of not only recoveri...

It’s Common Sense, isn’t it? Demystifying Human Evaluations in Commonsense-enhanced NLG systems

Common sense is an integral part of human cognition which allows us to m...

Kformer: Knowledge Injection in Transformer Feed-Forward Layers

Knowledge-Enhanced Model have developed a diverse set of techniques for ...

Does Knowledge Help General NLU? An Empirical Study

It is often observed in knowledge-centric tasks (e.g., common sense ques...

How Context Affects Language Models' Factual Predictions

When pre-trained on large unsupervised textual corpora, language models ...

1 Introduction

Human level natural language understanding involves reading between the lines and relying on implicit background knowledge. Consider the scene in Figure 1: Alice let Bob stand in front of her at the concert. Using physical and social commonsense – (i) Bob and Alice want to see the stage, and (ii) If Bob is taller, they would block Alice’s view – one can infer that Alice is taller than Bob. Such examples are ubiquitous across natural language understanding (NLU) tasks such as reading comprehension Hirschman et al. (1999) and recognizing textual entailment Dagan et al. (2013), and even more so in tasks dedicated to commonsense reasoning such as the Winograd schema challenge (WSC; Levesque et al., 2012).

Most current NLU models rely on pre-trained language models (LMs; e.g. Radford et al., 2019; Devlin et al., 2019; Liu et al., 2019). The standard practice is to use task-specific data to fine-tune a pre-trained LM in a supervised manner. Alternatively, LM score is used to rank answer choices in a zero-shot setup Wang et al. (2019); Sakaguchi et al. (2020). In both setups, pre-trained LMs yield improved performance upon prior methods, greatly due to the world knowledge that such LMs capture, having been trained on massive texts Petroni et al. (2019); Davison et al. (2019).

Figure 1: Illustration of world knowledge, knowledge captured by LMs, and needed in a specific context.
Dataset Context + Question Choices
COPA The man broke his toe. 1) He got a hole in his sock.
What was the cause of this? 2) He dropped a hammer on his foot.
Common Where on a river can you hold a cup 1) waterfall 2) bridge 3) valley
SenseQA upright to catch water on a sunny day? 4) pebble 5) mountain
MC-TACO […] dream of becoming a judge. How many 1) 63 years 2) 7 weeks
years did it take for Mark to become a judge? 3) 7 years 4) 7 seconds 5) 7 hours
Social IQa In the school play, Robin played a hero in the 1) sorry for the villain
struggle to the death with the angry villain. 2) hopeful that Robin will succeed
How would others feel as a result? 3) like Robin should lose the fight
PIQA To separate egg whites from the yolk using a 1) […] Release, which creates suction and lifts the yolk.
water bottle, you should 2) […] Keep pushing, which creates suction and lifts the yolk.
WinoGrande Katrina had the financial means to afford 1) Katrina
a new car while Monica did not, 2) Monica
since _ had a high paying job.
Table 1: An example from each dataset used in this paper. The correct choice in each example is given in bold text.

Despite the performance boost, LMs as knowledge providers suffer from various shortcomings: (i) insufficient coverage: due to reporting bias, many trivial facts might not be captured by LMs (purple set in Figure 1), because they are rarely written about Gordon and Van Durme (2013)

. (ii) insufficient precision: the distributional training objective increases the probability of non-facts (light green set in Figure 

1) that are semantically similar to true facts, as in negation (“birds cannot fly”; Kassner and Schütze, 2019). LMs excel in predicting the semantic category of a missing word, but might predict the wrong instance in that category (e.g., depending on the phrasing, BERT sometimes predicts red as the color of a dove). Finally, (iii) it is unclear that LMs are capable of performing multiple reasoning steps involving implicit knowledge.

To increase the coverage of high-precision world knowledge and facilitate multi-hop reasoning by making intermediate reasoning steps explicit, prior work incorporated KBs (e.g. ConceptNet; Speer and Havasi, 2012) and knowledge-informed models into LM-based models Xia et al. (2019); Bosselut and Choi (2019); Chen et al. (2019).

In this paper, we study pre-trained LMs as an alternative to external KBs in providing knowledge to commonsense question answering tasks. We propose an unsupervised model that uses an LM as the answer scorer, and a (possibly different) LM as a knowledge source. We formulate the process of obtaining relevant knowledge as a self-talk, inquiry-based discovery learning Bruner (1961), with the following steps: 1) seeking out knowledge by generating natural-language “clarification questions” conditioned on a given context; 2) generating their corresponding answers (“clarifications”); and 3) incorporating the clarifications as additional context.

Our model does not rely on external knowledge or additional supervision. Yet, we show that on 4 out of 6 tasks it substantially improves upon a zero-shot baseline that relies on LM score alone and performs on par, and sometimes better than, models that use external knowledge sources.

Integrating external knowledge warrants discerning relevant and helpful facts for solving a particular instance. LMs further require identifying that a clarification is factually-correct. We show that even among the clarifications that helped the prediction, humans perceived many as unhelpful or even incorrect, demonstrating that LM-based models often solve problems correctly for seemingly incorrect reasons. Our results call for future research on robust and correct knowledge integration to LM-based question answering systems.

2 Tasks

We focused on the multiple-choice question answering tasks exemplified in Table 1 and detailed below. Each instance consists of an optional context, an optional question, and several answer choices. The development sets sizes vary from 100 (COPA) to 1,954 (Social IQa).

COPA: Choice of Plausible Alternatives Gordon et al. (2012):

Asking about either a plausible cause or a plausible result, among two alternatives, of a certain event expressed in a simple sentence.

CommonSenseQA: commonsense Question Answering Talmor et al. (2019).

General questions about concepts from ConceptNet. To increase the challenge, the distractors are related to the target concept either by a relationship in ConceptNet or as suggested by crowdsourcing workers.

MC-TACO: Multiple Choice Temporal commonsense Zhou et al. (2019).

Questions about temporal aspects of events such as ordering (Table 1), duration, stationarity, frequency, and typical time. The distractors were selected in an adversarial way using BERT.111To make this task compatible with the other tasks, we only kept a single correct answer per instance, making our results not comparable to previously reported results.

Figure 2: Model illustration for WinoGrande. Each answer choice (Brett, Ian) is assigned to the concatenation of the context and a clarification. The score for each choice is the best LM score across clarifications (2 in this case).

Social IQa: Social Interaction Question Answering Sap et al. (2019).

Questions regarding social interactions, based on the ATOMIC dataset Sap et al. (2019). Contexts describe social interactions and questions refer to one of a few aspects (e.g. the subject’s motivation, following actions, etc.). The answers were crowdsourced.

PIQA: Physical Interaction Question Answering Bisk et al. (2020).

Questions regarding physical commonsense knowledge. Contexts are goals derived from an instruction website, typically involving less prototypical uses of everyday objects (e.g., using a bottle to separate eggs). The answers were crowdsourced, and an adversarial filtering algorithm was used to remove annotation artifacts.222Word associations and dataset-specific features that are not informative for the task are identified by a strong baseline and removed Gururangan et al. (2018); Zellers et al. (2018).

WinoGrande Sakaguchi et al. (2020).

A large-scale version of WSC that exhibits less bias thanks to adversarial filtering and use of placeholders instead of pronouns. As opposed to WSC that was curated by experts, WinoGrande was crowdsourced with a carefully designed approach that produces diverse examples which are trivial for humans.

3 Models

A given instance consists of an optional context , an optional question , and answer choices: . We first describe the baseline model, which makes the prediction based on the instance alone (Section 3.1). We then describe a knowledge-informed model that relies on external resources (Section 3.2). Finally, we discuss the proposed inquiry-based model, which uses a pre-trained LMs to produce clarifications (Section 3.3).

3.1 LM-only Baseline

We use a pre-trained language model to score the plausibility of different text fragments. We experiment with the various LMs provided by the transformers package Wolf et al. (2019): GPT Radford et al. (2018), GPT2 (Radford et al., 2019, all sizes), a distilled GPT2 Sanh et al. (2019), and XLNet (Yang et al., 2019, both sizes).

We assign each of the answer choices into the combination of the context and the question, and obtain . The function is computed differently for each task. For example, in COPA, where the question might be either about the cause or the effect of the context, we create the following texts for cause: “[context]. As a result, [choice]” and for effect: “[context]. The cause for it was that [choice]”.

We denote the score of each answer choice as , where is cross-entropy loss defined as:
We predict the with the lowest score as the correct answer, which is the most likely option according to : .

3.2 Baseline Model with External Knowledge

Figure 3: Generating a single clarification using ConceptNet, Google Ngrams, and COMET (Social IQa instance).

In the setup illustrated in Figure 2, each instance consists of an additional clarification list: . Those are text fragments containing potentially relevant knowledge for solving the instance. For instance, the clarification “The purpose of the internship is to help people find jobs” might help answering the question “which of Brett and Ian found a job less quickly after graduation?”. We don’t expect all the clarifications to be relevant and helpful for answering the main question. Instead, the model relies on the single clarification that increases its belief of a certain answer choice. Thus, the score of each answer choice is selected as the score of the text containing the clarification that most supports it, i.e., whose combination with it yields the minimal loss: .
Again we predict .

We extract clarifications from the following sources, exemplified in Figure 3.


Similarly to previous work, we extract relation paths between words from the context and the question, and words from the answer choices. Since we incorporate the knowledge into the model as text, we convert each ConceptNet relation to a natural language template as in davison-etal-2019-commonsense. We limit the path length to 2 edges in order to maintain high precision.


For pairs of words from the context and question and from the answer choices, we extract their joint occurrences (with minimum frequency of 100) in Google N-grams

Brants and Franz . This yields text fragments of up to 5 words rather than well-formed sentences, with the potential of describing the relationship between the two words Shwartz and Dagan (2018).


COMET Bosselut et al. (2019) is a knowledge base construction model trained on the ATOMIC resource Sap et al. (2019)

which consists of everyday situations along with multiple commonsense dimensions such as their causes, effects, pre- and post-conditions, etc. We generate all the dimensions unless we can generate specific relations that are more likely to help. Specifically, in Social IQa, we heuristically try to understand which type of relation in COMET the question asks for. In COPA, we use the pre-condition relations for cause questions (

xIntent, xNeed) and the post-condition relations for effect questions (xEffect, xReact, xWant, oEffect, oReact, oWant). When possible, we replace personX with the syntactic subject of the context or the question.

3.3 Self-talk Model

Figure 4:

Generating a clarification with LM: 1) Generate a question, conditioned on the context (pink) and question prefix (yellow). 2) Generate an answer, conditioned on the context, generated question and a corresponding answer prefix. The clarification is a concatenation of the answer prefix and generated text (green).

Our proposed model makes the prediction identically to Figure 2, but extracts the clarifications from pre-trained LMs. We treat the knowledge extraction from LMs as a process of self-asking clarification questions about the context and “discovering” their answers. Figure 4 exemplifies this process for WinoGrande with a generator language model . For the sake of simplicity, the illustration depicts the process of generating a single pair of clarification question and answer.

We start by generating multiple clarification questions conditioned on the context, by 1) concatenating one of several question prefixes, which we curated for each task (e.g. “What is the purpose of”, see the appendix); and 2) generating 5 questions for each prefix using Nucleus sampling with , i.e., sampling from the top 20% tokens Holtzman et al. (2019).333This value was chosen in preliminary experiments and is significantly lower than the standard value for in the literature, which is typically around 0.9. We use a low value because we optimize for factual correctness, and our preliminary experiments have shown that lower values produce texts that are more “faithfull” to their training corpus (at the price of being more bland). We limit the question length to up to 6 tokens excluding the prefix.

For each well-formed question that we obtained at the previous step, e.g. “What is the purpose of the internship?”, we generate multiple answers using a similar method. Each question prefix corresponds to an answer prefix. We use the concatenation of the context, generated clarification question, and answer prefix as the prompt for generating an answer (clarification). We limit the answer length to 10 generated tokens, and use Nucleus sampling with . We generate 10 answers for each clarification question and keep all the well-formed clarifications. Note that the clarification questions themselves are only means to generate the clarifications, and they are not used by our model.

Figure 5: Generating a clarification for Social IQa conditioned on the context, the given question (pink), and a heuristically matched answer prefix (yellow).

In some datasets, an instance consists of both a context and a question. In this case, we can use the instance question as a “clarification” question and generate additional clarification questions similar to it. Figure 5 exemplifies this shortcut for Social IQa: instead of generating a clarification question, the given question “Why did Austin do this?” is used, and together with a heuristically matched answer prefix, the model can generate a potentially direct solution: “Austin did this because they wanted to keep him alive”.

Since we did not train the clarification generator to ask sensical, relevant, and helpful questions, nor did we train the answer generator to generate coherent and factually correct answers, we can assume that some of the generated clarifications do not provide useful information to the model.

4 Results

Dataset Model LM Knowledge Dev Test
Source Acc. Acc.
COPA Majority 55.0
Baseline Distil-GPT2 53.0
Ext. Knowledge GPT2-L COMET 69.0
Self-talk Distil-GPT2 Distil-GPT2 66.0
Pre. Sup T5 94.8
Human 100.0
Majority 20.9
Baseline GPT-L 37.2
Common Ext. Knowledge GPT-XL COMET 39.7
SenseQA Self-talk GPT-L GPT-M 32.4
Pre. Sup Albert ensemble 83.7 76.5
Human 88.9 88.9
Majority 40.3 43.0
MC Baseline GPT2-M 53.1 50.6
TACO External Knowledge GPT2-XL COMET 58.8 55.6
Self-talk GPT2-XL GPT2-XL 59.9 58.0
Majority 33.6 33.7
Baseline GPT2-L 41.1 41.1
Social Ext. Knowledge GPT2-XL COMET 47.5 45.3
IQa Self-talk GPT2-XL GPT2-L 46.2 43.9
Pre. Sup RoBERTa-large 76.6 77.1
Human 86.9 84.4
PIQA Majority 50.5 50.4
Baseline GPT2-XL 62.6 63.4
Ext. Knowledge GPT2-XL COMET 69.6 68.4
Self-talk GPT2-XL GPT2-M 70.2 69.5
Pre. Sup RoBERTa-large 79.2 77.1
Human 94.9 94.9
Majority 50.4 50.4
Baseline GPT2-XL 54.8 54.8
Wino Ext. Knowledge GPT2-XL COMET 55.4 53.7
Grande Self-talk GPT2-XL GPT 54.7 55.1
Pre. Sup RoBERTa-large 79.3 79.1
Human 94.1 94.0
Table 2: Best setup for each model type on each task, according to development accuracy. Test accuracy is reported when the labels are available or when the development accuracy justified leaderboard submission.

Table 2 displays the performance of the best model in each category according to the development accuracy. We report the performance of the following models: majority baseline, LM baseline (Baseline), LM-based model with external knowledge (Ext. Knowledge), Self-talk, supervised models from prior work when applicable (Pre. Sup),444Excluding unpublished leaderboard submissions. and human performance. Our zero-shot models are highlighted in purple.

Dataset Rank (Mean Dev Acc.)
COPA Distil-GPT2 (63.7) GPT2-M (61.8) GPT2-L (60.6) GPT2 (59.7) GPT (58.6) GPT2-XL (57.9) XLNet-base (51.9) XLNet-L (49.5)
CSQA GPT2-L (31.8) GPT2-XL (31.2) GPT2-M (27.7) GPT (27.6) GPT2 (25.6) Distil-GPT2 (25.4) XLNet-base (21.5) XLNet-L (20.8)
MC-TACO GPT2-XL (58.1) GPT2-L (56.6) GPT2-M (53) GPT2 (50.1) Distil-GPT2 (48.8) GPT (47.7) XLNet-L (37) XLNet-base (34.2)
Social IQa GPT2-XL (45.5) GPT2-L (44.4) GPT2-M (43.4) GPT2 (41.8) GPT (41.6) Distil-GPT2 (40.4) XLNet-L (33.6) XLNet-base (33.1)
PIQA GPT2-XL (69.6) GPT2-L (67.9) GPT2-M (65.6) GPT2 (62) Distil-GPT2 (59.6) GPT (57.9) XLNet-base (49.2) XLNet-L (48.8)
Wino. GPT2-XL (54) GPT2-L (52.9) GPT (52.2) GPT2 (51.2) Distil-GPT2 (50.9) GPT2-M (50.2) XLNet-base (49.1) XLNet-L (48.7)
Table 3: Ranking of LMs according to their dev accuracy averaged across knowledge sources for each dataset.
Dataset Rank (Mean Dev Acc.)
COPA COMET (61.1) GPT2-XL (58.6) Google Ngrams (58.4) GPT2-M (58.2) XLNet-L (58.2) GPT (58.1) GPT2 (58.0)
CSQA COMET (29.8) Google Ngrams (29.1) GPT2-M (26.3) ConceptNet (26.1) GPT2-L (26.1) XLNet-L (25.8) GPT2 (25.8)
MC-TACO Google Ngrams (49.1) ConceptNet (48.9) GPT2 (48.7) GPT2-L (48.6) GPT2-XL (48.5) Distil-GPT2 (48.1) GPT2-M (48.1)
Social IQa COMET (41.4) GPT2-XL (40.9) GPT2-L (40.6) Distil-GPT2 (40.5) XLNet-L (40.4) GPT2-M (40.4) XLNet-base (40.4)
PIQA Google Ngrams (60.5) XLNet-L (60.2) ConceptNet (60.2) GPT (60.1) GPT2-XL (60.1) GPT2-M (60.0) GPT2-L (60.0)
WinoGrande GPT (51.3) GPT2-XL (51.3) GPT2-L (51.2) COMET (51.2) ConceptNet (51.2) GPT2 (51.2) GPT2-M (51.2)
Table 4: Ranking of knowledge sources according to their dev accuracy averaged across LMs for each dataset (for the sake of space, only the top 7 are listed).

As expected, the overall performance is worse for the zero-shot models compared to the state-of-the-art supervised models, but they perform substantially better than the majority baselines on most tasks, with the exception of WinoGrande where they only slightly outperform it. Among the LM-based models, self-talk performs on par or within a few points from the external knowledge model.

Best LM.

Table 3 shows the ranking of the LMs according to their development accuracy averaged across the different knowledge sources. In general there is a preference to GPT-2, and in particular to the larger models, except for COPA in which the distilled version works best. A possible explanation might be that the language model distillation reduces the likelihood of rare words Tang and Lin (2018), which works well for the simple sentences in COPA. The XLNet models perform poorly, perhaps due to their smaller training corpus (16GB vs 40GB in GPT-2, both using web text).

Best Knowledge Source.

Among the knowledge informed models, COMET achieves the best performance across tasks. This likely happens first because COMET can dynamically generate predictions for any context, while the other two knowledge sources are static and lack coverage. Second, as expected, COMET improves the predictions for Social IQa, which was built based on the ATOMIC resource on which COMET is trained.

Table 4 sorts the knowledge sources based on the average development accuracy across LMs. PIQA and MC-TACO, tasks that require different types of knowledge from social commonsense, perform well with ConceptNet and Google Ngrams. With respect to self-talk models, there is a rather small difference in performance between the different LMs used as knowledge sources, with slight preference to GPT-2 in most datasets.

We also experimented with combining the clarifications from all the knowledge sources, which didn’t prove beneficial except for MC-TACO (where it added +7.9 points to the dev accuracy, bringing it to 66.7). We assume that some resources added noise, making the whole smaller than the sum of its parts.

5 Human Evaluation of the Clarifications

Figure 6: Human evaluation of the clarifications, for each combination of task and knowledge source. Top: ratio of grammatical, not entirely grammatical but understandable, and completely not understandable clarifications. Bottom: percent of clarifications considered relevant, correct, and helpful. Answers in Social IQa were only evaluated for helpfulness when the clarification question was different from the main question (e.g. in ConceptNet).

While the performance on the end task serves as an extrinsic evaluation for the quality of the generated clarifications, we are also interested in evaluating it intrinsically. From preliminary experiments we know that there is a high ratio of noisy clarifications. Thus, we analyze the clarifications that help predict the correct answer, i.e. clarifications with the best LM score in their instance and whose existence change the answer from an incorrect prediction by the baseline to a correct prediction by the model.

We sampled up to 50 such clarifications for each combination of task and knowledge source, using the best performing LM.555We omitted COPA from the analysis due to its small size. See the appendix for examples. We showed crowdsourcing workers an instance along with a clarification question and its answer, and asked them: 1) whether the question is grammatical, not entirely grammatical but understandable, or completely not understandable; and if the answer was anything but “completely not understandable”, 2) whether the question is relevant, i.e. on topic with the instance. We asked the same questions about the answer, in addition to: 3) whether the answer is factually correct or likely true; and 4) whether the answer adds helpful information to solve the instance.

The annotation task was carried out in Amazon Mechanical Turk. To ensure the quality of annotations, we required that the workers be located in the US, UK, or Canada, and have a 99% approval rate for at least 5,000 prior tasks. We aggregated annotation from 3 workers using majority vote. The annotations yielded moderate levels of agreement, with Fleiss’ Kappa = 0.43 Landis and Koch (1977). Among the different categories of annotations we measured pairwise accuracy, which ranged from 60.41% (the answer is factually correct) to 92.26% (the question is completely not understandable).

For the sake of brevity, we focus on the analysis of the answers to the clarification questions. Figure 6 shows the human evaluation results for each combination of task and knowledge source. The top part of the figure shows that across tasks and resources, most clarifications are grammatical or at least understandable, with the exception of XLNet. The bottom part shows the percentage of clarifications considered relevant, correct, and helpful.666If a worker consider an answer as “completely not understandable”, we marked it as not relevant, correct, or helpful. Most clarifications were considered relevant to the context, around half of them were considered factually correct, and some 20-40% were considered helpful. Considering that these are all clarifications that indeed helped the model, this is an interesting though not completely unexpected finding: the model utilizes knowledge that humans wouldn’t consider as helpful, and likely also vice versa.

Breaking down by knowledge source, we observe that when the datasets were created using a knowledge source (ConceptNet for CommonSenseQA, and Social IQa uses ATOMIC, on which COMET is trained), clarifications from that resource are considered correct. We also note that somewhat surprisingly, relatively few ConceptNet clarifications were considered correct, despite limiting the relation paths up to 2 edges.

6 Related Work

6.1 External Knowledge in Neural Models

Approaches for incorporating external knowledge into a neural model consist of several components: (1) the task addressed; (2) neural model; (3) knowledge sources; and (4) incorporation method. Most models target tasks that require commonsense knowledge, such as the story cloze test (RocStories; Mostafazadeh et al., 2016) and machine comprehension tasks Kočiskỳ et al. (2018); Ostermann et al. (2018); Clark et al. (2018); Talmor et al. (2019). The neural component has recently shifted from biLSTM to transformer-based representations, specifically pre-trained LMs such as BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019).

With respect to the knowledge source, the vast majority of papers rely on ConceptNet to extract relation paths between concepts and entities identified in the input (Speer and Havasi, 2012, see an example in Figure 3). Additional resources include WordNet Lin et al. (2017); Wang and Jiang (2019), mining scripts from corpora Lin et al. (2017), knowledge base embeddings Chen et al. (2019); Xiong et al. (2019), hand-crafted rules Lin et al. (2017); Tandon et al. (2018), and tools such as sentiment analyzers Chen et al. (2019) and knowledge-informed LMs Bosselut and Choi (2019).

The external knowledge is typically incorporated into the neural model by learning a vector representation of the symbolic knowledge (e.g. subgraphs from ConceptNet), and attending to it via attention mechanism when representing the inputs

Bauer et al. (2018); Paul and Frank (2019); Lin et al. (2019). Alternative approaches include using the knowledge to score answer candidates and prune implausible ones Lin et al. (2017); Tandon et al. (2018), and training in a multi-task setup via auxiliary tasks pertaining to knowledge Xia et al. (2019).

6.2 Extracting Knowledge from LMs

Pre-trained LMs such as GPT2 Radford et al. (2019) and BERT Devlin et al. (2019) capture various types of world knowledge. petroni-etal-2019-language showed that such LMs can be used in a KB completion task over ConceptNet and Wikidata Vrandečić and Krötzsch (2014) by converting KB relations into natural language templates and querying the LM for the missing part in the triplet (concept, relation, concept). For instance, querying BERT for suitable substitutes to the mask in “Dante was born in [MASK]” assigns the highest probability to Rome. davison-etal-2019-commonsense similarly showed that BERT assigns higher scores to natural language fragments of true rather than fictitious ConceptNet triplets, and semi-automated the template creation by using GPT2 to score hand-crafted templates.

While both works have shown somewhat promising results, other work showed that knowledge extracted from LMs is expectantly not always accurate. Specifically, kassner2019negated showed that negated facts are also considered likely by the LM, while logan_baracks_2019 pointed out that LMs may over-generalize and produce incorrect facts such as “Barack Obama’s wife is Hillary”.

6.3 Generating Questions and Explanations

There are numerous research directions investigating automatic question generation Vanderwende (2008). Motivations vary from data augmentation to QA tasks Du et al. (2017); Dhingra et al. (2018); Du and Cardie (2018); Sachan and Xing (2018) through conversational machine reading Saeidi et al. (2018); Pan et al. (2019), simplifying questions to make them more easily answerable Buck et al. (2018); Talmor and Berant (2018); Perez et al. (2020), to using questions as means for other purposes such as sentence representation and summarization Guo et al. (2018); Potash and Suleman (2019).

In particular, our work is pertinent to previous work in producing clarifications questions and explanations. rao-daume-iii-2019-answer worked on question from forums (e.g. Stack Exchange). They proposed a model that generates clarification questions and corresponding answers for a given question, using the question’s comments (clarification questions and answers) as supervision. Question-answer pairs were scored based on how much relevant information they add to the context.

shen2019learning developed an active learning framework for image captioning that learns to detect uncertainty about generated words and ask natural language questions to reduce its uncertainty. A visual question answering (VQA) model provides an answer which is then used to change the caption. The framework is trained with reinforcement learning, but the gold standard captions are used during a warmup steps and the VQA model is supervised.

klein2019learning proposed a joint question generation and question answering framework. They fine-tuned GPT2 on a question answering dataset to generate a question and an answer span for a given passage, and trained BERT to answer the generated question given the passage. Finally, rajani_explain_2019 proposed a model for CommonSenseQA  that generates explanations for its predictions. They collected human explanations and used them to fine-tune LMs to automatically generate explanations. These explanations were then added as additional inputs. The shortcoming of this approach is that it requires collecting specific human explanations for each new dataset.

7 Disucssion and Conclusion

We presented an unsupervised framework for multiple choice commonsense tasks that generates and integrates background knowledge from pre-trained LMs. On most tasks, it performs substantially better than the baseline and similarly to a model that had access to external knowledge resources.

By design, our model makes a single additional reasoning step explicit. A preliminary experiment in which we incorporated clarification pairs to facilitate two hops got mixed results. An interesting future direction is to generate each clarification in response to the previous ones, in a dialogue setup Saeidi et al. (2018). Another challenge is the “needle in a haystack” problem of the clarifications, and one way to address it is to develop a model that is capable of “introspection”, specifically knowing what it doesn’t know. A more structured knowledge generation might also make the combination of various knowledge sources more successful.

Filling in knowledge gaps and making implicit intermediate reasoning steps explicit is imperative going forward. We hope that our framework will facilitate future research in this area. Our code and data is available at github.com/vered1986/self_talk.


This research was supported in part by NSF (IIS-1524371, IIS-1714566), DARPA under the CwC program through the ARO (W911NF-15-1-0543), and DARPA under the MCS program through NIWC Pacific (N66001-19-2-4031).


  • L. Bauer, Y. Wang, and M. Bansal (2018) Commonsense for generative multi-hop question answering tasks. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 4220–4230. External Links: Link, Document Cited by: §6.1.
  • Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi (2020) PIQA: reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, Cited by: §2.
  • A. Bosselut and Y. Choi (2019)

    Dynamic knowledge graph construction for zero-shot commonsense question answering

    ArXiv abs/1911.03876. Cited by: §1, §6.1.
  • A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, and Y. Choi (2019) COMET: commonsense transformers for automatic knowledge graph construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4762–4779. External Links: Link, Document Cited by: §3.2.
  • [5] T. Brants and A. Franz Web 1t 5-gram version 1 (2006). Linguistic Data Consortium, Philadelphia. Cited by: §3.2.
  • J. S. Bruner (1961) The act of discovery. Harvard educational review 31, pp. 21–32. Cited by: Unsupervised Commonsense Question Answering with Self-Talk, §1.
  • C. Buck, J. Bulian, M. Ciaramita, W. Gajewski, A. Gesmundo, N. Houlsby, and W. Wang. (2018) Ask the right questions: active question reformulation with reinforcement learning. In International Conference on Learning Representations, External Links: Link Cited by: §6.3.
  • J. Chen, J. Chen, and Z. Yu (2019) Incorporating structured commonsense knowledge in story completion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6244–6251. Cited by: §1, §6.1.
  • P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018) Think you have solved question answering? try ARC, the AI2 reasoning challenge. External Links: Link Cited by: §6.1.
  • I. Dagan, D. Roth, M. Sammons, and F. M. Zanzotto (2013) Recognizing textual entailment: models and applications. Synthesis Lectures on Human Language Technologies 6 (4), pp. 1–220. Cited by: §1.
  • J. Davison, J. Feldman, and A. Rush (2019) Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1173–1178. External Links: Link, Document Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota. Cited by: §1, §6.1, §6.2.
  • B. Dhingra, D. Danish, and D. Rajagopal (2018) Simple and effective semi-supervised question answering. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 582–587. External Links: Link, Document Cited by: §6.3.
  • X. Du and C. Cardie (2018) Harvesting paragraph-level question-answer pairs from Wikipedia. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1907–1917. External Links: Link, Document Cited by: §6.3.
  • X. Du, J. Shao, and C. Cardie (2017) Learning to ask: neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1342–1352. External Links: Link, Document Cited by: §6.3.
  • A. Gordon, Z. Kozareva, and M. Roemmele (2012) SemEval-2012 task 7: choice of plausible alternatives: an evaluation of commonsense causal reasoning. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Montréal, Canada, pp. 394–398. External Links: Link Cited by: §2.
  • J. Gordon and B. Van Durme (2013) Reporting bias and knowledge acquisition. In Proceedings of the 2013 workshop on Automated knowledge base construction, pp. 25–30. Cited by: §1.
  • H. Guo, R. Pasunuru, and M. Bansal (2018) Soft layer-specific multi-task summarization with entailment and question generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 687–697. External Links: Link, Document Cited by: §6.3.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 107–112. Cited by: footnote 2.
  • L. Hirschman, M. Light, E. Breck, and J. D. Burger (1999) Deep read: a reading comprehension system. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 325–332. Cited by: §1.
  • A. Holtzman, J. Buys, M. Forbes, and Y. Choi (2019) The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. Cited by: §3.3.
  • N. Kassner and H. Schütze (2019) Negated lama: birds cannot fly. arXiv preprint arXiv:1911.03343. Cited by: §1.
  • T. Kočiskỳ, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette (2018) The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics 6, pp. 317–328. Cited by: §6.1.
  • J. R. Landis and G. G. Koch (1977) The measurement of observer agreement for categorical data. biometrics, pp. 159–174. Cited by: §5.
  • H. Levesque, E. Davis, and L. Morgenstern (2012) The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, Cited by: §1.
  • B. Y. Lin, X. Chen, J. Chen, and X. Ren (2019) KagNet: knowledge-aware graph networks for commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2829–2839. External Links: Link, Document Cited by: §6.1.
  • H. Lin, L. Sun, and X. Han (2017) Reasoning with heterogeneous knowledge for commonsense machine comprehension. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2032–2043. External Links: Link, Document Cited by: §6.1, §6.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §6.1.
  • N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen (2016) A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 839–849. External Links: Link, Document Cited by: §6.1.
  • S. Ostermann, M. Roth, A. Modi, S. Thater, and M. Pinkal (2018) Semeval-2018 task 11: machine comprehension using commonsense knowledge. In Proceedings of the 12th International Workshop on semantic evaluation, pp. 747–757. Cited by: §6.1.
  • B. Pan, H. Li, Z. Yao, D. Cai, and H. Sun (2019) Reinforced dynamic reasoning for conversational question generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2114–2124. External Links: Link, Document Cited by: §6.3.
  • D. Paul and A. Frank (2019) Ranking and selecting multi-hop knowledge paths to better predict human needs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3671–3681. External Links: Link, Document Cited by: §6.1.
  • E. Perez, P. Lewis, W. Yih, K. Cho, and D. Kiela (2020) Unsupervised question decomposition for question answering. In RCQA workshop @ AAAI 2020, Cited by: §6.3.
  • F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller (2019) Language models as knowledge bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2463–2473. External Links: Link, Document Cited by: §1.
  • P. Potash and K. Suleman (2019) Playing log (n)-questions over sentences. In EmeCom workshop @ NeurIPS 2019, Cited by: §6.3.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. -. Cited by: §3.1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. -. Cited by: §1, §3.1, §6.2.
  • M. Sachan and E. Xing (2018) Self-training for jointly learning to ask and answer questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 629–640. Cited by: §6.3.
  • M. Saeidi, M. Bartolo, P. Lewis, S. Singh, T. Rocktäschel, M. Sheldon, G. Bouchard, and S. Riedel (2018) Interpretation of natural language rules in conversational machine reading. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2087–2097. External Links: Link, Document Cited by: §6.3, §7.
  • K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi (2020) WINOGRANDE: an adversarial winograd schema challenge at scale. In AAAI, Cited by: §1, §2.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §3.1.
  • M. Sap, R. Le Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith, and Y. Choi (2019) Atomic: an atlas of machine commonsense for if-then reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3027–3035. Cited by: §2, §3.2.
  • M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y. Choi (2019) Social IQa: commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4463–4473. External Links: Link, Document Cited by: §2.
  • V. Shwartz and I. Dagan (2018) Paraphrase to explicate: revealing implicit noun-compound relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1200–1211. External Links: Link, Document Cited by: §3.2.
  • R. Speer and C. Havasi (2012) Representing general relational knowledge in conceptnet 5.. In LREC, pp. 3679–3686. Cited by: §1, §6.1.
  • A. Talmor and J. Berant (2018) The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 641–651. External Links: Link, Document Cited by: §6.3.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4149–4158. External Links: Link, Document Cited by: §2, §6.1.
  • N. Tandon, B. Dalvi, J. Grus, W. Yih, A. Bosselut, and P. Clark (2018) Reasoning about actions and state changes by injecting commonsense knowledge. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 57–66. External Links: Link, Document Cited by: §6.1, §6.1.
  • R. Tang and J. Lin (2018) Adaptive pruning of neural language models for mobile devices. arXiv preprint arXiv:1809.10282. Cited by: §4.
  • L. Vanderwende (2008) The Importance of Being Important: Question Generation. In Proceedings of the Workshop on the Question Generation Shared Task and Evaluation Challenge, Cited by: §6.3.
  • D. Vrandečić and M. Krötzsch (2014) Wikidata: a free collaborative knowledgebase. Communications of the ACM 57 (10), pp. 78–85. Cited by: §6.2.
  • C. Wang and H. Jiang (2019) Explicit utilization of general knowledge in machine reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2263–2272. External Links: Link, Document Cited by: §6.1.
  • C. Wang, S. Liang, Y. Zhang, X. Li, and T. Gao (2019) Does it make sense? and why? a pilot study for sense making and explanation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4020–4026. External Links: Link, Document Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §3.1.
  • J. Xia, C. Wu, and M. Yan (2019) Incorporating relation knowledge into commonsense reading comprehension with multi-task learning. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2393–2396. Cited by: §1, §6.1.
  • W. Xiong, M. Yu, S. Chang, X. Guo, and W. Y. Wang (2019) Improving question answering over incomplete KBs with knowledge-aware reader. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4258–4264. External Links: Link, Document Cited by: §6.1.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. CoRR abs/1906.08237. External Links: Link, 1906.08237 Cited by: §3.1.
  • R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi (2018) SWAG: a large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 93–104. External Links: Link, Document Cited by: footnote 2.
  • B. Zhou, D. Khashabi, Q. Ning, and D. Roth (2019) “Going on a vacation” takes longer than “going for a walk”: a study of temporal commonsense understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3354–3360. Cited by: §2.