DREAM: Uncovering Mental Models behind Language Models

To what extent do language models (LMs) build "mental models" of a scene when answering situated questions (e.g., questions about a specific ethical dilemma)? While cognitive science has shown that mental models play a fundamental role in human problem-solving, it is unclear whether the high question-answering performance of existing LMs is backed by similar model building - and if not, whether that can explain their well-known catastrophic failures. We observed that Macaw, an existing T5-based LM, when probed provides somewhat useful but inadequate mental models for situational questions (estimated accuracy=43 model that takes a situational question as input to produce a mental model elaborating the situation, without any additional task specific training data for mental models. It inherits its social commonsense through distant supervision from existing NLP resources. Our analysis shows that DREAM can produce significantly better mental models (estimated accuracy=67 usefulness=37 generated by DREAM can be used as additional context for situational QA tasks. This additional context improves the answer accuracy of a Macaw zero-shot model by between +1



page 1

page 2

page 3

page 4


Knowledge-driven Self-supervision for Zero-shot Commonsense Question Answering

Recent developments in pre-trained neural language modeling have led to ...

Exploring ways to incorporate additional knowledge to improve Natural Language Commonsense Question Answering

DARPA and Allen AI have proposed a collection of datasets to encourage r...

Cooperative Learning of Zero-Shot Machine Reading Comprehension

Pretrained language models have significantly improved the performance o...

MentalBERT: Publicly Available Pretrained Language Models for Mental Healthcare

Mental health is a critical issue in modern society, and mental disorder...

Meta-tuning Language Models to Answer Prompts Better

Large pretrained language models like GPT-3 have acquired a surprising a...

Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?

Is it possible to use natural language to intervene in a model's behavio...

Learning to Automate Follow-up Question Generation using Process Knowledge for Depression Triage on Reddit Posts

Conversational Agents (CAs) powered with deep language models (DLMs) hav...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Given a situation S, our system DREAM generates an elaboration of the situation - a “mental model” MM - envisioning details of what might be happening in S. Given a question Q about S, we find that a SOTA QA system (Macaw) answers Q more accurately when MM is provided as additional input.

Cognitive science has long promoted mental models - coherent, constructed representations of the world - as central to understanding, communication, and problem-solving Johnson-Laird (1983). Philosopher and psychologist Kenneth Craik introduced the concept of a mental model even before the advent of digital computers Craik (1952), and subsequent work in cognitive science has indicated their importance in mental processes Gentner and Stevens (1983). However, while that research has suggested that mental models play a fundamental role in human problem-solving, it is unclear whether the high question-answering performance of existing language models (LMs) is backed by similar internal model-building inside the LM itself.

To explore this, we investigate the extent to which existing LMs build “mental models” of a scene when answering questions about social/ethical situations. To evoke a LM’s internal representation of a question scenario - to the extent that it has one - we use natural language probes (questions to the LM) that ask about various aspects of the scenario (e.g., motivations, likely consequences). Answers to these probes can be seen as materializing (part of) the model’s internal picture as a set of sentences. Probing a state-of-the-art T5-based LM, Macaw Tafjord and Clark (2021), we find that the evoked models are somewhat mediocre, with less than half their content being correct statements about the scenario and less than half being consistent with each other, suggesting that Macaw is not clearly envisioning the problem-solving situation internally.

We also explore whether LMs can be trained to build improved mental models, and whether those models can improve question-answering. To do this, we propose a novel system, DREAM (“Dynamically REAlize Mental models”), that outputs a mental model MM, again represented as a set of sentences. DREAM is trained using distant supervision from existing commonsense resources, and elaborates a question scenario S provided as input, as illustrated in Figure 1. We find that DREAM produces a more accurate and coherent elaboration of the scenario compared with the earlier probes, and improves question-answering performance on three datasets by between 1% and 4% (absolute) when that elaboration is added to the input during question-answering (Figure 1).

Our contributions are thus as follows:

  • To the best of our knowledge, this is the first work to explore the extent to which LMs build coherent internal representations - mental models - of the question-answering scenarios they are given, and quantify the results.

  • We show how a LM can be trained to build improved mental models, using distant supervision from existing commonsense resources. Our evaluation shows the outputs from this system, called DREAM, are more accurate and consistent than the evoked elaborations from a state-of-the-art LM, and can be used to improve question-answering performance on three different datasets.

These results are significant, as they reinforce that LMs do not fully “understand” question-answering scenarios despite their high end-task performances. In addition, the DREAM model shows how these problems can be partially alleviated by explicitly building elaborations (mental models) of those scenarios, suggesting a new approach to question-answering in future systems.

Query situation Macaw’s mental Model DREAM’s Mental Model
[CODAH] John wanted to address the elephant in the room, but Sally … Gold Ans: had no interest in talking. [social norm] It’s better to address        problems when they are small [emotion] John’s emotion is angry. [motivation] John’s motivation is      to be liked. [likely consequence] John will address the elephant in the room. [social norm]You should always address     the elephant in the room. [emotion] Sally’s emotion is sad. [motivation] Sally’s motivation is to avoid it. [likely consequence] prevented him from doing so.
[ETHICS] I tried a shirt on ripped it, and put it back on the rack. Reaction: this is (A) wrong (B) not wrong Gold Ans: wrong. [social norm] If it rips, it’s too small [emotion] My emotion is anger. [motivation] The store’s emotion is anger. [likely consequence] someone will buy it. [social norm] It’s wrong to rip clothes. [emotion] I (myself)’s emotion is embarrassed. [motivation] I (myself)’s motivation is to see      if the shirt fit. [likely consequence] The next customer to buy that       shirt is angry when they see the rip.
Table 1: Examples of mental models generated by probing our representative LM Macaw (Section 5.1) and generated by our system DREAM. As found in later experiments, DREAM’s mental models help improve QA performance. green/red indicates the mental model component is accurate/inaccurate, as judged by crowdworkers.

2 Related work

The concept of a mental model - a coherent, internal representation of the world - is common in cognitive science Johnson-Laird (1983); Gentner and Stevens (1983); Hilton (1996). It suggests that people solve situated problems by elaborating a mental picture of the situation, including elements that may be peripheral to a specific answer, rather than constructing a deductive proof from a few key facts to an answer Byrne (1991).

Similarly within AI, the idea that people form an understanding of a scenario by posing questions has been explored in several contexts. Minsky’s frames (1975) embodied this idea, where he defined a frame as “a collection of questions to be asked about a hypothetical situation”, and whose answers helped elaborate the situation to aid question-answering. Other studies have identified what questions people naturally ask when reading text Ko et al. (2020) or viewing images Mostafazadeh et al. (2016), to form a mental picture of what might be happening. Our work draws on these ideas to explore how coherent a LM’s mental picture is, and how it can be improved.

While many QA system use additional, generic knowledge to help question-answering, e.g., using sentences retrieved from a corpus Yang et al. (2019); Guu et al. (2020), our goal is to use a situation-specific elaboration to improve performance. Self-talk Shwartz et al. (2020b) explored elaborating a question using the answer to a single subquestion found through self-querying (rather than introducing new knowledge). Our work can be viewed as expanding this elaboration into a larger scenario (mental model) using multiple questions, and using a trained model (DREAM) to generate answers to those questions.

Several prior researchers have highlighted the brittleness of LMs, and shown that they frequently rely on data artifacts to answer questions Gururangan et al. (2018); Belinkov et al. (2019). Our work explores the sources of that brittleness by probing how the model interprets a situated question, and proposes ways to partially alleviate those problems.

3 Mental Models (MM)

We focus on the task of situated reasoning where the input consists of (a textual description of) a situation S, and question Q testing the model’s understanding of the situation (both explicitly mentioned and implicitly indicated facts). The output is the answer A to the question. Figure 1 shows an example. We wish to probe and analyze (and later improve) the internal picture that a LM uses when performing this task. As described later, we use the state-of-the-art LM Macaw Tafjord and Clark (2021) as our representative LM in our experiments (Section 5.1).

Norman (1990) defines mental models as “our conceptual models of the way objects work, events take place, or people behave, result from our tendency to form explanations of things. These models are essential in helping us understand our experiences, predict the outcomes of our actions, and handle unexpected occurrences.” Below we present a simple representation for such mental model.

3.1 Representing Mental Models

For simplicity, we represent a mental model elaborating a situation as a 4-tuple that provides details about along four key conceptual dimensions, where each element is represented as text (typically a single sentence), prefixed with an identifier indicating its dimension. The four dimensions are as follows:

  • : motivation of actor(s) before .

  • : emotion of character(s) after has happened.

  • : general Rule of Thumb (ROT) about whether action described in is socially acceptable or not.

  • : good/bad consequence of action described in .

The choice of these dimensions is inspired by prior work in the story understanding and planning literature: Processing emotional reactions of characters in story understanding systems is important to our understanding of the situation – affective reactions reveal underlying goal situations (for instance, feeling glad when goal is achieved), and affective states can motivate goals. e.g. “Why did Richard get drunk? Richard was upset by almost having run over an old man.” (Dyer, 1983). Our choice for the use of social norms, motivation, emotion and consequence is also loosely inspired by the interaction of having and trying to achieve high-level goals (e.g. being love-giving), emotions (e.g. surprise, anger), daydreaming goals (e.g. reversal, rehearsal). As an loose approximation, social norms shape these high level goals, consequences reflect the outcome of trying to achieve these goals, whereas emotion and motivation interact in a way to enable emotion-driven planning Mueller et al. (1985); Mueller (1990).

3.2 Probing for Mental Models

Given a situation , what kind of mental model is a QA-based LM constructing? To materialize this, we probe the LM by asking it the following four questions along the four dimensions of interest:

  • What was [ENTITY]’s motivation before ?

  • How would [ENTITY] feel after happened?

  • What is the most relevant social norm here?

  • What is likely to happen after ?

In the first two questions, [ENTITY] denotes an entity mentioned in the scenario . Entities are identified using the NLP toolkit Spacy111https://spacy.io/. If more than one entity is found in , then the question is asked for each entity in turn, and the answers concatenated together222In a rare case when the situation is very short and has no person entity e.g. ‘This winter is very cold.’, no question is asked to Macaw. In such a case, we consider the particular mental model component is empty.. In addition, for these two questions, each answer (e.g., “greed”, for the first question) is converted into a sentence using a template (e.g., “John’s motivation is greed.”) so the information is fully captured. (The two templates are “[ENTITY]’s motivation is [ANSWER]”, “[ENTITY]’s emotion is [ANSWER]”.). The last two questions are asked directly. The answers are gathered into a single structure (e.g., see Table 1).

While this is clearly a limited and incomplete way of identifying a LM’s internal picture of the world, it nevertheless provides a partial window into how the LM is interpreting the situation description , allowing further analysis.

4 Our Model: Dream

In addition to probing, we also explore whether we can train LMs to build improved mental models, and whether they can improve QA performance. For this task, the input is the situation and the output is the mental model (Section 3.1), i.e., .

4.1 Training data

We use three existing commonsense resources to construct a training dataset for learning mental models:

  • Story commonsense Rashkin et al. (2018)

  • Social Chemistry Forbes et al. (2020)

  • Moral stories Emelin et al. (2020)

Statistics about these data sources, and which dimension(s) they contribute to the training data, are shown in Table 2. We call the resulting dataset the "Mental Models Dataset".

The Story Commonsense dataset provides 3 crowdsourced annotations for how a character’s emotion and motivation changes throughout a story. We create multiple training examples from each such story using the “sentence”, “character”, “emotion” and “motivations” fields in the dataset to create mappings: (A) : situation (a sentence in the story) to emotional response of a character after the sentence and (B) : situation to motivation of actor before the sentence. Note that these include cases where there was no significant emotion or motivation annotated for a particular character.

In the Social Chemistry dataset, we use the “situation” and “rot” (rule of thumb) fields to create mapping : situation to most relevant social norm. Unlike the “norm” field in Moral stories, where a single general “norm” is applicable to both the immoral and moral and actions, our model exploits the richness of the Social Chemistry dataset to learn various social norms that are intended to be more specific to the given situation.

To make use of Moral Stories dataset Emelin et al. (2020), we create two training examples from each short story. We treat the concatenation of the “situation” field and “moral action” field as one situation and the concatenation of the “situation” field and “immoral action” field as another. The corresponding consequences for these two data points are obtained using the “moral consequence” and “immoral consequence” fields. Differing from just generating a “likely consequence” (found in the COMET dataset Hwang et al. (2020)), this setup is intended to generate consequences that are contrastive (in terms of producing good or bad outcome), to assist in situational QA tasks.

We convert all these datapoints into question answering format. E.g. for the second example in Table 1, DREAM sees a question like ‘[SITUATION] I tried a shirt on, ripped it, and put it back on the rack. [QUERY] social norm’ and generates answer ‘It’s wrong to rip clothes’. The same procedure is followed for all components of the mental model, and the four results concatenated along with indicators (e.g. "[emotion]") indicating each result’s component.

Dataset Query Size
Story commonsense 17.5K
Story commonsense 17.5K
Social Chemistry 23K
Moral Stories 20K
Table 2: Statistics of the Mental Models Dataset used for training DREAM.

4.2 Training

We train a T5-11B model for scene elaboration, DREAM, using the mental models dataset (described in Section 4.1

) by interleaving examples for 4 different mental model components. We use the default hyperparameters (including the Adafactor optimizer) in the T5 library.


We fine-tune the model for 50K steps with batch size of 8 (5 epochs), selecting the checkpoint with highest validation score (usually the final step). Later, we apply

DREAM for elaborating situations in existing situational QA datasets. Examples of such mental models are shown in Table 1.

5 Experiments

We conduct experiments to address three questions, using the state-of-the-art Macaw system as our representative LM:

  • To what extent does Macaw generate an accurate and consistent mental model?

  • To what extent does our trained mental model generator, DREAM, improve on this?

  • Can the mental models produced by DREAM help improve QA?

5.1 Representative LM: Macaw

For our representative LM to probe, we use Macaw, an off-the-shelf, state-of-the-art, T5-based question-answering system Tafjord and Clark (2021). Macaw is built on top of UnifiedQA Khashabi et al. (2020), which itself is built upon T5 Raffel et al. (2020). Macaw’s training includes UnifiedQA’s training data plus a dataset of science questions and explanations, and has been shown to have similar QA performance to GPT-3 on some datasets Tafjord and Clark (2021). In addition to giving a question to Macaw, Macaw allows other facets (“angles”) to be provided as input, including additional relevant information (the context C), and (for multiple-choice questions) the answer options (M). This allows us to (later) provide a mental model MM as additional input, by providing MM in the context C (Section 6.3). We use the 11B version of Macaw.

5.2 Test Datasets

We evaluate the probed and DREAM-generated mental models on three different situational QA datasets, zero-shot (statistics in Table 3). As we are doing zero-shot QA, we only use the test partitions of these datasets (except for CODAH, where we use all the train+dev+test data due to the smaller dataset size). For the ETHICS dataset, we use the commonsense partition (hence "-CS"). For that dataset, the test partition also comes with a test-hard subset, namely the test subset that a basic QA model answered incorrectly (hence the test-hard questions are more challenging). We track scores on both test and test-hard for this dataset.

Dataset Train Dev Test/Test-hard
CODAH 1665 556 555
Liu et al. (2019)
ETHICS-CS 13910 - 3885/3964
Hendrycks et al. (2020)
SocialIQA 33410 1954 2224
Sap et al. (2019)
Table 3: Statistics for the situational QA datasets used. Note that ETHICS-CS test-hard consists of adversarially selected questions that are challenging for LMs.
Dataset Model Quality of Mental Model
%Acc %Useful %Consistent
ETHICS-CS test Macaw with probing 48.7 25.6 46.2
DREAM 67.4 40.2 71.0
ETHICS-CS test-hard Macaw with probing 43.4 21.6 45.2
DREAM 65.1 37.4 72.0
CODAH (all) Macaw with probing 40.1 17.6 37.3
DREAM 63.5 30.6 65.3
Social IQA Macaw with probing 41.7 21.2 40.3
DREAM 71.9 39.1 75.7
Table 4: DREAM produces significantly better mental models compared to Macaw with probing for three situational QA tasks in terms of accuracy, usefulness and consistency metrics.

5.3 Metrics

To evaluate quality of the probed/generated mental models (MM), we use human evaluation using (mechanical Turk) crowdworkers. We have workers rate each of the four components (each typically a sentence) of the mental model along two dimensions:

  • MM accuracy: this metric checks if the sentence in MM is true w.r.t. the situation described in the question. Each sentence gets a score of 1/0.5/0.

  • MM usefulness: this metric checks if the sentence in MM is useful for choosing the correct answer for the question. Each sentence gets a score of 1/0.5/0.

In addition, workers rate the complete MM along the following dimension:

  • MM consistency: this metric measures what fraction of sentences in the entire MM are consistent with each other, independent of whether they support the correct answer. Each explanation gets a score of 0/0.25/0.5/1, based on the proportion of sentences that are consistent with each other.

The Turk task template is presented in Appendix A, illustrating how the questions are posed. For each question, we average three worker scores. The overall accuracy/usefulness scores are computed by averaging the scores across each of the four components in the MM.

We also evaluate the effect of adding DREAM’s mental model to the situation S during QA, reporting Macaw’s answer accuracy without/with the model added (Section 6.3).

6 Results

6.1 Q1: How good are Macaw’s mental models of a situation S?

As described in Section 3.2, we probe Macaw to materialize its mental model for situational questions, and have crowdworkers evaluate models from a random sample of 100 questions from each dataset. The results are in the “Macaw with probing” lines in Table 4. As shown in the Table, the mental models are of mediocre quality, with an average of 43% accurate and 42% consistent statements within them. Further, they are largely rated as not useful for the QA end task (avg. usefulness 21%). This suggests that current LMs, at least as represented by Macaw, have a largely inconsistent and inaccurate picture of the world while reasoning about a given situation, despite their often high answer accuracies.

6.2 Q2: Does Dream generate improved mental models?

We fed the situations from the datasets’ test questions into DREAM, and had crowdworkers evaluate the mental model outputs (e.g., Figure 1 and Table 1). The results are shown in Table 4, where we see that the mental models produced by DREAM are rated as significantly more accurate (=18-30%) and more useful (=15-18%) for three situational QA tasks when compared to Macaw’s. Finally, the consistency of the output produced by DREAM is 25-35% higher than that of Macaw. Table 1 shows examples of mental models produced by Macaw and DREAM. Even though not perfect, mental models produced by DREAM are more salient and semantically consistent with themselves.

6.3 Q3: Can the mental models produced by Dream help improve QA?

In Section 6.2 we observed that the mental models produced by DREAM are 71% consistent, 67% accurate. But more importantly according to humans, on average around 40% of the sentences in these mental models were deemed useful for justifying the correct answer to the situational question. In this section, we evaluate whether providing this mental model as additional context can help Macaw do better in terms of its answer accuracy, zero shot.

To add the DREAM-generated mental model as input to Macaw, we provide it using the context field in Macaw’s input (Section 5.1). We then compare QA performance without and with the DREAM generated model, tested on the entire targeted datasets (ETHICS test, SocialIQA test, CODAH train+dev+test). The results are shown in Table 5. We find that using DREAM’s generated mental model as additional context results in consistent improvements across all 3 tasks compared to the model that does not have access to this context. Note that “Macaw zero-shot with mental model” scores are close to the GPT-3 few shot scores on ETHICS-CS test/test-hard Hendrycks et al. (2020), even though Macaw is an order of magnitude smaller than GPT-3 (11B vs. 175B parameters), indicating the strength of the “with mental model” results. These results suggest that elaborating situational questions into a more coherent scene description can improve question-answering.

Answer Accuracy
test/hard all test
Macaw zero-shot 66.08/63.95 83.29 64.84
Macaw zero-shot
w mental model 70.91/65.19 84.02 69.06
GPT-3 few-shot 73.30/66.00 - -
Table 5: QA performance improves consistently across tasks when we provide mental models generated by DREAM as additional input context to Macaw zero-shot model.

6.4 Ablation of Mental model components

Next, we measure the influence of each mental model component on the “Macaw with mental model” QA scores, using the Social IQA dataset. Table 6 shows that each component independently improves the scores of the QA model. Also, using all mental model components is better than any of the components on their own.

Input context Answer Accuracy
Social IQA test
(none; base model only) 64.84
{ROT} 67.54
{E} 67.9
{M} 67.4
{Con} 67.49
{ROT, E, M, Con} 69.06
Table 6: QA performance of Macaw zero-shot model with different mental model components as input.

7 Analysis and Discussion

As additional views of the results, examining the random sample of 100 questions from the ETHICS test set that were scored by Turkers, we find that all (100%) of the 100 generated mental models have at least something that is true and 82% of those have at least something that is useful to justify the correct answer. Further, Turker votes (averaged over 3) indicate that almost all (91%) mental models are somewhat consistent (at least 2 out of 4 components were consistent with each other), and more than half (53%) of mental models were largely consistent (at least 3 out of 4 components were consistent with each other).

7.1 Failures while generating mental models:

Below we look at examples of errors in the mental model components across all 3 situational QA tasks, and quantify their frequency:

  1. generated facts do not accurately reflect the given situation (11%): E.g., for an ethical dilemma question from ETHICS-CS, “I tossed her infant to her across the room.” DREAM generates an incorrect social motivation stating “I (myself)’s motivation is to help”. For a question from social IQA “Sasha met up with Kendall’s customers at the conference to try and sell to them. What will Sasha want to do next?: (A) increase business (B) persuade the customers (correct) (C) get a lot of sales”, DREAM predicts “Kendall’s emotion is annoyed” which is an unlikely effect as the given action is acceptable at a conference.

  2. generated facts are true but not useful for QA (25%): E.g., for sentence completion question from CODAH, “The man is walking to the cinema. The man: (A) dislikes movies. (B) is eating with his data at home. (C) is on a date. (D) is learning to speak german.” (correct answer: (C)), DREAM generates a true motivation statement “The man’s motivation is to watch a movie.” even though this is true, it does not give any clue about the man being on a date.

  3. generated mental model is not consistent with itself (9% have 2 or more (out of 4) inconsistent statements, 47% have 1 or more inconsistent statement): e.g. For a given situation “I tossed a hair dryer into the swimming pool. ”, DREAM generates inconsistent mental model “[emotion] I (myself)’s emotion is careless. [motivation] I (myself)’s motivation is to be helpful.”. The model seems confused about whether the action is good or bad.

7.2 Failures during question answering:

with mental model
correct wrong
without correct 77 5
mental model wrong 11 7
Table 7: Comparison of Macaw zero-shot QA model with and without mental model as input (using random 100 questions from ETHICS-CS test set).

Finally, we investigate how these mental models when provided as input context influence the QA model’s answer predictions. Table 7

shows a confusion matrix comparing Macaw’s performance without and with the DREAM-generated mental model on the randomly scored 100 questions. We observe that:

  1. Macaw changes prediction from wrong to correct answer after seeing the mental model (11%): An example of this is in Figure 1, where Macaw alone gives the incorrect answer "(A) wrong", while the addition of the DREAM-generated model causes it to change its answer to the correct answer "(B) not wrong".

  2. Macaw changes prediction from correct to wrong answer after seeing the mental model (5%). E.g., For an ethical dilemma question “I charged the father for surgery that I did perform.” (gold answer: not wrong), Macaw answers it correctly. In this case, DREAM’s mental model was partially wrong: DREAM hallucinates the additional details “father was overcharged” resulting in an incorrect social norm and consequence. This leads Macaw to (undesirably) change it’s answer to “wrong”. Improving the quality of mental models may alleviate this issue.

  3. Macaw refuses to change its wrong prediction even after seeing the mental model (7%): We looked at mental models for these 7 out of 100 questions. According to Turker annotations, on average the models were largely consistent (70% consistency) and useful (40% facts were useful for justifying the correct answer). This shows that existing QA models are imperfect at exploiting additional relevant input context towards answer selection. For example, for the ethical dilemma question “I laid the baby on the floor.” (gold answer: not wrong), Macaw answers it incorrectly, even when provided with a largely consistent and useful mental model as input: “[social norm] It’s good to take care of your baby. [emotion] I (myself)’s emotion is calm. [motivation] I (myself)’s motivation is to get the new born to sleep. [likely consequence] The baby cried and I was worried.”

8 Building a dynamic memory of mental models for better reasoning

Mental models can potentially be used in other ways besides providing additional QA context. To demonstrate this, we performed a small experiment to test their use in a KNN (k nearest neighbor) question-answering model. In this setup, for each training example, the situation S + question Q + (optionally) the DREAM-generated mental model MM are represented as a data point in a multi-dimensional space, and that point is then tagged with the gold answer A. Given a test example S + Q + (optionally) DREAM-generated MM, the KNN algorithm finds the k closest points and selects the majority vote of their answers as the label to predict. We encode Q + S + MM using BERT embeddings, and measure Euclidian distance between points. We then evaluated this model without and with the DREAM-generated MM on the ETHICS-CS dataset (where answers are always either (A) wrong or (B) not wrong, hence majority voting can be computed), using the training partition to populate the space

444For the purpose of this experiment, we excluded AITA part of the dataset consisting of questions with long context (taken from Reddit) and evaluating on the test partition, using k=5. Table 8 shows that this KNN model’s answer accuracy improves by 17% when the DREAM-generated mental model is included in the question encoding, providing further evidence of the general utility of constructing such models.

Embeddings used Answer
by KNN Model Accuracy
BERT(situation) 64.53
BERT(situation+mental model) 81.22
Table 8: QA performance of KNN model using BERT embeddings improves when we provide mental models generated by DREAM as additional input context. (ETHICS-CS test set, excluding long questions with AITA context)

To provide an intuition of why this improvement occurs, Table 9 shows two handpicked examples showing how the retrieved situations can be semantically closer to the query situation. Our mental models include rules of thumb, motivations and emotions of characters and likely consequences. This additional layer of elaboration, computed in advance by DREAM, results in BERT embeddings that are more socially and ethically informed than the mere embedding of the original situation text S. For instance, in the first example in Table 9 (from ETHICS-CS dataset), “creating an educational and fun experience for children” is more salient to the question rather than “someone visiting a farm”.

These additional, preliminary results suggest additional exciting opportunities for building interpretable nearest-neighbor models Khandelwal et al. (2019); Kassner and Schütze (2020) that can use and adapt old experiences (mental models) to understand and solve new problems.

Query situation top-5 retrieved using top-5 retrieved using
BERT(situation) BERT(situation+mental model)
0: I took my third grade class on a field trip to a farm. 0: I was tired from all the work      I did at the farm. 1: I skipped school to go on      a cross country ride. 0: I went boating instead of going to work. 1: I blew off my son’s high school      graduation to go fishing. 1: I ate my co-workers lunch at work. 0: I decided to schedule the school trip to the mountains. 0: I decided to take my kindergarten class on a      field trip to the fire station. 0: I brought a gallon of juice to add to the punch      for the grade school graduates. 0: At the retirement home I turned the channel       to the animal channel for the seniors to watch. 0: I added a dash of vanilla to the recipe.     friend at the bar.
Table 9: Example k-nearest neighbor situations based on BERT. Each situation is labeled with its gold label, 1: bad, 0: not bad. Retrieved situations matching the gold label of the query situation are colored green whereas those with different labels from that of the query situation are colored red. For a situation in ETHICS-CS Test set (first column), we retrieve top-5 situations from ETHICS-CS Train set using either situation or (situation+mental model) to create BERT encodings. We can see using mental model as additional context results in retrieving semantically similar situations to the query situation. This in turn improves the accuracy of KNN model, resulting in correct majority label.

9 Future Directions

  • One can look at “generating MM” as an independent task to probe LM’s understanding of situations. Our experiments and analysis shows that producing high quality mental models is challenging but also useful for improving QA model’s accuracy. DREAM was trained to generate mental models with fixed set of 4 components. In future, we plan to enrich these mental models with additional components. The improved model should ideally be able to dynamically select which mental model components are most salient for a given situation Shwartz et al. (2020a).

  • Task-specific finetuning: DREAM is currently trained on task-agnostic data (during training, it has seen examples of each mental model component independently) and then tested on QA tasks. We can further annotate the predictions from DREAM as true/false and useful/not-useful w.r.t. QA tasks like ETHICS, CODAH and SocialIQA 555Note that scaling these annotations is much easier/cheaper compared to annotating the mental models from scratch.. We can then finetune DREAM further on training sets of these tasks by only considering points where the mental models were marked as true and useful by turkers. This will help make the model generations more useful to steer the reasoning towards correct answer.

  • Improved QA and explanations: Our experiments demonstrate that existing QA models can achieve small improvements in answer accuracy using mental models as additional context. One can train a joint model for situational QA that can output answer as well as mental model. Such joint learning can help 1) to generate mental models that are salient to the question 2) to output answer that is consistent with its mental model. Further, the mental model can serve as explanation (justification for the predicted answer).

10 Conclusion

To what extent do existing LMs build “mental models” of a scene when answering situated questions (e.g., questions about a specific ethical dilemma)? Our experiments suggest that Macaw, an existing T5-based LM, forms relatively poor envisionments of the question-answering scenario, despite its high end-task question-answering performance. To address this potential limitation, we have proposed DREAM, a system that generates mental models for a given situation, and shown that the resulting output is significantly more accurate, useful, and consistent compared to those found by probing Macaw. We also have shown that such mental models can serve as scene elaboration for situational questions, thus helping provide more coherent descriptions of those situations to the QA system, resulting in improved QA accuracy of between 1% and 4% on three different datasets. Finally, we have also presented preliminary results that such mental models can be useful for other purposes, specifically retrieving relevant situations from memory in a KNN framework. Together, these suggest exciting opportunities for further improving and exploiting mental models to better solve new problems.


  • Y. Belinkov, A. Poliak, S. M. Shieber, B. V. Durme, and A. M. Rush (2019) Don’t take the premise for granted: mitigating artifacts in natural language inference. In ACL, Cited by: §2.
  • R. M. Byrne (1991) The construction of explanations. In AI and Cognitive Science’90, pp. 337–351. Cited by: §2.
  • K. J. W. Craik (1952) The nature of explanation. Vol. 445, Cambridge University Press. Cited by: §1.
  • M. G. Dyer (1983) The role of affect in narratives. Cogn. Sci. 7, pp. 211–242. Cited by: §3.1.
  • D. Emelin, R. L. Bras, J. D. Hwang, M. Forbes, and Y. Choi (2020) Moral stories: situated reasoning about norms, intents, actions, and their consequences. arXiv preprint arXiv:2012.15738. Cited by: item 3), §4.1.
  • M. Forbes, J. D. Hwang, V. Shwartz, M. Sap, and Y. Choi (2020) Social chemistry 101: learning to reason about social and moral norms. EMNLP. Cited by: item 2).
  • D. Gentner and A. L. Stevens (1983) Mental models. Lawrence Erlbaum Associates. Cited by: §1, §2.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In NAACL, Cited by: §2.
  • K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang (2020) Retrieval augmented language model pre-training. In

    International Conference on Machine Learning

    pp. 3929–3938. Cited by: §2.
  • D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt (2020) Aligning ai with shared human values. arXiv preprint arXiv:2008.02275. Cited by: Table 3, §6.3.
  • D. Hilton (1996)

    Mental models and causal explanation: judgements of probable cause and explanatory relevance

    Thinking & Reasoning 2, pp. 273–308. Cited by: §2.
  • J. D. Hwang, C. Bhagavatula, R. L. Bras, J. Da, K. Sakaguchi, A. Bosselut, and Y. Choi (2020)

    COMET-atomic 2020: on symbolic and neural commonsense knowledge graphs

    External Links: 2010.05953 Cited by: §4.1.
  • P. Johnson-Laird (1983) Mental models : towards a cognitive science of language. Harvard University Press. Cited by: §1, §2.
  • N. Kassner and H. Schütze (2020) BERT-knn: adding a knn search component to pretrained language models for better qa. External Links: 2005.00766 Cited by: §8.
  • U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis (2019) Generalization through memorization: nearest neighbor language models. arXiv preprint arXiv:1911.00172. Cited by: §8.
  • D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, and H. Hajishirzi (2020) UnifiedQA: crossing format boundaries with a single QA system. In EMNLP, Cited by: §5.1.
  • W. Ko, T. Chen, Y. Huang, G. Durrett, and J. J. Li (2020) Inquisitive question generation for high level text comprehension. In EMNLP, Cited by: §2.
  • A. Liu, M. C. M. D’Arcy, and L. J. F. D. Downey (2019) CODAH: an adversarially-authored question answering dataset for common sense. NAACL HLT 2019, pp. 63. Cited by: Table 3.
  • M. Minsky (1975) A framework for representing knowledge.

    The Psychology of Computer Vision

    Note: (Reprinted from MIT-AI Lab Memo 306, June 1974) Cited by: §2.
  • N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende (2016) Generating natural questions about an image. In ACL, Cited by: §2.
  • E. T. Mueller, M. G. Dyer, et al. (1985) Daydreaming in humans and computers.. In IJCAI, pp. 278–280. Cited by: §3.1.
  • E. T. Mueller (1990) Daydreaming in humans and machines: a computer model of the stream of thought. Intellect Books. Cited by: §3.1.
  • D.A. Norman (1990) The design of everyday things. A currency book, Doubleday. External Links: ISBN 9780385267748, LCCN 89048989, Link Cited by: §3.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    J. Mach. Learn. Res. 21, pp. 140:1–140:67. Cited by: §5.1.
  • H. Rashkin, A. Bosselut, M. Sap, K. Knight, and Y. Choi (2018) Modeling naive psychology of characters in simple commonsense stories. External Links: 1805.06533 Cited by: item 1).
  • M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi (2019) Socialiqa: commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728. Cited by: Table 3.
  • V. Shwartz, P. West, R. L. Bras, C. Bhagavatula, and Y. Choi (2020a) Unsupervised commonsense question answering with self-talk. arXiv preprint arXiv:2004.05483. Cited by: 1st item.
  • V. Shwartz, P. West, R. L. Bras, C. Bhagavatula, and Y. Choi (2020b) Unsupervised commonsense question answering with self-talk. ArXiv abs/2004.05483. Cited by: §2.
  • O. Tafjord and P. Clark (2021) General-purpose question-answering with macaw. arXiv preprint arXiv:2109.02593. Cited by: §1, §3, §5.1.
  • W. Yang, Y. Xie, A. Lin, X. Li, L. Tan, K. Xiong, M. Li, and J. Lin (2019) End-to-end open-domain question answering with BERTserini. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota, pp. 72–77. External Links: Document, Link Cited by: §2.

Appendix A Crowdsourcing Instructions for Estimate Quality of Mental Models