SocialIQA: Commonsense Reasoning about Social Interactions

04/22/2019 ∙ by Maarten Sap, et al. ∙ University of Washington Allen Institute for Artificial Intelligence 0

We introduce SocialIQa, the first large-scale benchmark for commonsense reasoning about social situations. This resource contains 45,000 multiple choice questions for probing *emotional* and *social* intelligence in a variety of everyday situations (e.g., Q: "Skylar went to Jan's birthday party and gave her a gift. What does Skylar need to do before this?" A: "Go shopping"). Through crowdsourcing, we collect commonsense questions along with correct and incorrect answers about social interactions, using a new framework that mitigates stylistic artifacts in incorrect answers by asking workers to provide the right answer to the wrong question. While humans can easily solve these questions (90 question-answering (QA) models, such as those based on pretrained language models (77 transfer learning of commonsense knowledge, achieving state-of-the-art performance on several commonsense reasoning tasks (Winograd Schemas, COPA).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Social and emotional intelligence enables humans to reason about others’ mental states and predict their behavior Ganaie and Mudasir (2015). For example, when someone spills food all over the floor, we can infer that they will likely want to clean up the mess, rather than taste the food off the floor or run around in the mess (Figure 1, middle). This example illustrates how Theory of Mind, i.e., the ability to reason about the emotions and behavior of others, enables humans to navigate social situations ranging from simple conversations with friends to complex negotiations in courtroom settings Apperly (2010).

Figure 1: Three context-question-answers triples from SocialIQa, along with the type of reasoning required to answer them. In the top example, humans can trivially infer that Tracy pressed upon Austin because there was no room in the elevator. Similarly, in the bottom example, commonsense knowledge tells us that people typically root for the hero, not the villain.

While humans trivially acquire and develop such social reasoning skills Moore (2013)

, this is still a challenge for machine learning models, in part due to the lack of large-scale resources to train and evaluate modern AI systems’ social and emotional intelligence. Although recent advances in pretraining large language models have yielded promising improvements on several commonsense tasks, these models still struggle to reason about social situations, as shown in this and previous work

Davis and Marcus (2015); Nematzadeh et al. (2018); Talmor et al. (2019). This is partly due to language models being trained on written text corpora, where reporting bias of knowledge limits the scope of commonsense knowledge that can be learned Gordon and Van Durme (2013); Lucy and Gauthier (2017).

In this work, we introduce Social Intelligence QA (SocialIQa), the first large-scale resource to learn and measure social and emotional intelligence in computational models.111Available at http://tinyurl.com/socialiqa SocialIQa contains 45k multiple choice questions regarding the commonsense implications of everyday, social events. (see Figure 1). To collect this data, we design a crowdsourcing framework to gather contexts and questions that explicitly address social commonsense reasoning. Additionally, by combining handwritten negative answers with adversarial question-switched answers (Section 3.3), we minimize annotation artifacts that can arise from crowdsourcing incorrect answers Schwartz et al. (2017); Gururangan et al. (2018).

Human performance on SocialIQa is high (90%). However, this dataset remains challenging for AI systems, with our best performing baseline reaching 77.0% (BERT-large). We further establishe SocialIQa as a resource that enables transfer learning for other commonsense challenges, through sequential finetuning of a pretrained language model on SocialIQa before other tasks. Specifically, we use SocialIQa to set a new state-of-the-art on three commonsense challenge datasets: COPA Roemmele et al. (2011) (84.4%), the original Winograd (Levesque, 2011) (72.9%), and the extended Winograd dataset from Rahman and Ng (2012) (86.1%).

Our contributions are as follows: (1) We create SocialIQa, the first large-scale QA dataset aimed at testing social and emotional intelligence, containing over 45k QA pairs. (2) We introduce question-switching, a technique to collect incorrect answers that minimizes stylistic artifacts due to annotator cognitive biases. (3) We establish baseline performance on our dataset, with BERT-large performing at 77.0%, well below human performance. (4) We achieve new state-of-the-art accuracies on COPA and Winograd through sequential finetuning on SocialIQa, which implicitly endows models with social commonsense knowledge.

SocialIQa
# QA tuples train 33,871
dev 5,369
test 5,571
total 44,811
Train statistics
Average
# tokens
context 14.02
question 6.12
answers (all) 3.63
answers (correct) 3.67
answers (incorrect) 3.61
Unique
# tokens
context 16,088
question 1,165
answers (all) 12,993
answers (correct) 7,620
answers (incorrect) 11,111
Average freq. of answers answers (correct) 1.37
answers (incorrect) 1.47
Table 1: Data statistics for SocialIQa.

2 Task description

SocialIQa aims to measure the social and emotional intelligence of computational models through multiple choice question answering (QA). In our setup, models are confronted with a question explicitly pertaining to an observed context, where the correct answer can be found among three competing options.

By design, the questions require inferential reasoning about the social causes and effects of situations, in line with the type of intelligence required for an AI assistant to interact with human users (e.g., know to call for help when an elderly person falls; Pollack, 2005). As seen in Figure 1, correctly answering questions requires reasoning about motivations, emotional reactions, or likely preceding and following actions. Performing these inferences is what makes us experts at navigating social situations, and is closely related to Theory of Mind, i.e., the ability to reason about the beliefs, motivations, and needs of others Baron-Cohen et al. (1985).222 Theory of Mind is well developed in most neurotypical adults Ganaie and Mudasir (2015), but can be influenced by age, culture, or developmental disorders Korkmaz (2011). Endowing machines with this type of intelligence has been a longstanding but elusive goal of AI Gunning (2018).

Atomic

As a starting point for our task creation, we draw upon social commonsense knowledge from ATOMIC Sap et al. (2019)

to seed our contexts and question types. ATOMIC is a large knowledge graph that contains inferential knowledge about the causes and effects of 24k short events. Each triple in ATOMIC consists of an event phrase with person-centric variables, one of nine inference dimensions, and an inference object (e.g., “PersonX pays for PersonY ___”, “xAttrib”, “generous”).

Given this base, we generate natural language contexts that represent specific instantiations of the event phrases found in the knowledge graph. Furthermore, the questions created probe the commonsense reasoning required to navigate such contexts. Critically, since these contexts are based off of ATOMIC, they explore a diverse range of motivations and reactions, as well as likely preceding or following actions.

3 Dataset creation

SocialIQa contains 44,811 multiple choice questions with three answer choices per question. Questions and answers are gathered through three phases of crowdsourcing aimed to collect the context, the question, and a set of positive and negative answers. We run crowdsourcing tasks on Amazon Mechanical Turk (MTurk) to create each of the three components, as described below.

3.1 Event Rewriting

In order to cover a variety of social situations, we use the base events from ATOMIC as prompts for context creation. As a pre-processing step, we run an MTurk task that asks workers to turn an ATOMIC event (e.g., “PersonX spills ___ all over the floor”) into a sentence by adding names, fixing potential grammar errors, and filling in placeholders (e.g., “Alex spilled food all over the floor.”).333This task paid $0.35 per event.

Figure 2: Question-Switching Answers (QSA) are collected as the correct answers to the wrong question that targets a different type of inference (here, reasoning about what happens before instead of after an event).

3.2 Context, Question, & Answer Creation

Next, we run a task where annotators create full context-question-answers triples. We automatically generate question templates covering the nine commonsense inference dimensions in ATOMIC.444We do not generate templates if the ATOMIC dimension is annotated as “none.” Crowdsourcers are prompted with an event sentence and an inference question to turn into a more detailed context555Workers were asked to contribute a context 7-25 words longer than the event sentence. (e.g. “Alex spilled food all over the floor and it made a huge mess.”) and an edited version of the question if needed for improved specificity (e.g. “What will Alex want to do next?”). Workers are also asked to contribute two potential correct answers.

Figure 3: SocialIQa contains several question types which cover different types of inferential reasoning. Question types are based on ATOMIC inference dimensions.

3.3 Negative Answers

In addition to correct answers, we collect four incorrect answer options, of which we filter our two. To create incorrect options that are adversarial for models but easy for humans, we use two different approaches to the collection process. These two methods are specifically designed to avoid different types of annotation artifacts, thus making it more difficult for models to rely on data biases. We integrate and filter answer options and validate final QA tuples with human rating tasks.

Handwritten Incorrect Answers (HIA)

The first method involves eliciting handwritten incorrect answers that require reasoning about the context. These answers are handwritten to be similar to the correct answers in terms of topic, length, and style but are subtly incorrect. Two of these answers are collected during the same MTurk task as the original context, questions, and correct answers. We will refer to these negative responses as handwritten incorrect answers (HIA).

Question-Switching Answers (QSA)

We collect a second set of negative (incorrect) answer candidates by switching the questions asked about the context, as shown in Figure 2. We do this to avoid cognitive biases and annotation artifacts in the answer candidates, such as those caused by writing incorrect answers or negations Schwartz et al. (2017); Gururangan et al. (2018). In this crowdsourcing task, we provide the same context as the original question, as well as a question automatically generated from a different but similar ATOMIC dimension,666Using the following three groupings of ATOMIC dimensions: {xWant, oWant, xNeed, xIntent}, {xReact oReact, xAttr}, and {xEffect, oEffect}. and ask workers to write two correct answers. We refer to these negative responses as question-switching answers (QSA).

By including correct answers to a different question about the same context, we ensure that these adversarial responses have the stylistic qualities of correct answers and strongly relate to the context topic, while still being incorrect, making it difficult for models to simply perform pattern-matching. To verify this, we compare sentiment distributions across answer types, computed using NRC Canada’s EmoLex

(Mohammad and Turney, 2013). We show effect sizes (Cohen’s ) of the differences in sentiment averages in Table 2. Indeed, QSA and correct answers differ less than HIA answers, as evidenced by small effect sizes.777Cohen’s is considered small Sawilowsky (2009).

Pos sent Neg sent
HIA QSA HIA QSA
corr 0.305 0.113 -0.428 -0.045
HIA n/a -0.203 n/a 0.390
Table 2: Effect sizes (Cohen’s ) when comparing average positive and negative sentiment values of different answer types (: mean sentiment of row label was higher than mean sentiment of column label). Effect sizes are much smaller when comparing QSA and correct answers, indicating that those answer types are more similar tonally.

3.4 QA Tuple Creation

As the final step of the pipeline, we aggregate the data into three-way multiple choice questions. For each created context-question pair contributed by crowdsource workers, we select a random correct answer and the incorrect answers that are least entailed by the correct one, following inspiration from Zellers et al. (2019). We then validate our QA tuples through a multiple-choice crowdsourcing task where three workers are asked to select the right answer to the question provided.888Agreement on this task was high (Cohen’s =.70) Discarding tuples where workers each selected a different answer (2% of tuples), our final dataset contains questions for which the correct answer was determined by human majority voting, with which workers had 90% agreement. By design, this human-majority vote is 100% correct.

3.5 Data Statistics

To keep contexts separate across train/dev/test sets, we assign SocialIQa contexts to the same partition as the ATOMIC event the context was based on. Shown in Table 1 (top), this yields a total set of 34k training, 5.4k dev, and 5.6k test tuples. We additionally include statistics on word counts and vocabulary of the training data. We report the averages of correct and incorrect answers in terms of: token length, number of unique tokens, and number of times a unique answer appears in the dataset. Note that due to our three-way multiple choice setup, there are twice as many incorrect answers which influences these statistics.

We also include a breakdown (Figure 3) across question types, which we derive from ATOMIC inference dimensions.999We group agent and theme ATOMIC dimensions together (e.g., “xReact” and “oReact” become the “reactions” question type). In general, questions relating to what someone will feel afterwards or what they will likely do next are more common in SocialIQa. Conversely, questions pertaining to (potentially involuntary) effects of situations on people are less frequent.

4 Methods

We establish baseline performance on SocialIQa, using large pretrained language models based on the Transformer architecture Vaswani et al. (2017). Namely, we finetune OpenAI-GPT Radford et al. (2018) and BERT Devlin et al. (2019), which have both shown remarkable improvements on a variety of tasks. OpenAI-GPT is a uni-directional language model trained on the BookCorpus Zhu et al. (2015)

, whereas BERT is a bidirectional language model trained on both the BookCorpus and English Wikipedia. As per previous work, we finetune the language model representations but fully learn the classifier specific parameters described below.

Multiple choice classification

To classify sequences using these language models, we follow the multiple-choice setup implementation by the respective authors, as described below. First, we concatenate the context, question, and answer, using the model specific separator tokens. For OpenAI-GPT, the format becomes _start_ <context> _delimiter_ <question> _delimiter_ <answer> _classify_, where _start_, _delimiter_, and _classify_ are special function tokens. For BERT, the format is similar, but the classifier token comes before the context.101010BERT’s format is [CLS] <context> [SEP] <question> [SEP] <answer> [SEP]

For each triple, we then compute a score

by passing the hidden representation from the classifier token

through an MLP:

where , and

. Finally, we normalize scores across all triples for a given context-question pair using a softmax layer. The model’s predicted answer corresponds to the triple with the highest probability.

Model Dev accuracy Test accuracy
GPT 72.9 72.5
BERT-base 75.1 74.9
BERT-large 77.2 77.0
Human 100 100
Table 3: Experimental results
Context Question Answer

(1)
Casey achieved their dream and was made the first female president of the class. What will Casey want to do next? (a) convince the student body to vote for her
(b) be power-hungry
(c) serve the student’s interests

(2)
Sydney took part in the team ritual and rubbed Austin’s head for good luck before the game. What will Sydney want to do next? (a) win the game
(b) go back home
(c) lost the contest


(3)
Taylor asked the teacher for an extension on the paper and subsequently received one. How would others feel as a result? (a) thankful to the teacher
(b) happy to have more time
(c) that they missed out on an opportunity

(4)
Although Aubrey was older and stronger, they lost to Alex in arm wrestling. How would Alex feel as a result? (a) they need to practice more
(b) ashamed
(c) boastful
Table 4: Example CQA triples from the SocialIQa dev set for which BERT-large made the wrong prediction (: BERT’s prediction, : true correct answer). Examples (1) and (2) illustrate the model choosing answers that might have happened before, or that might happen much later after the context, as opposed to right after the context situation. In Examples (3) and (4), the model chooses answers that may apply to people other than the ones being asked about.
Figure 4: Dev accuracy when training BERT-large with various number of examples (multiple runs per training size). The model plateaus around 20k examples, yielding comparable performance to 34k examples.
Figure 5: Average dev accuracy of BERT-large on different question types. While questions about needs and motivations are easier, the model still finds effects and descriptions more challenging.

5 Experiments

5.1 Experimental Set-up

We train our models on the 34k SocialIQa

training instances, selecting hyperparameters based on the best performing model on our dev set, for which we then report test results. Specifically, we perform finetuning through a grid search over the hyper-parameter settings (with a learning rate in

, a batch size in

, and a number of epochs in

) and report the maximum performance.

Models used in our experiments vary in sizes: OpenAI-GPT (117M parameters) has a hidden size =768, BERT-base (110M parameters) a hidden size =768, and BERT-large (340M parameters) a hidden size =1024.

We train using the PyTorch

Paszke et al. (2017) implementation by HuggingFace.111111https://github.com/huggingface/pytorch-pretrained-BERT

5.2 Results

Our results (Table 3) suggest SocialIQa is still a challenging benchmark for existing computational models, compared to human performance. Our best performing model, BERT-large, outperforms other models by several points, both on the dev and test sets.

Learning Curve

To better understand the effect of dataset scale on model performance on our social commonsense task, we simulate training situations with limited knowledge. We present the learning curve of BERT-large’s performance on the dev set as it is trained on more training set examples (Figure 4). Although the model does significantly improve over a random baseline of 33% with only a few hundred examples, the performance only starts to converge after around 20k examples, providing evidence that large scale benchmarks are required for this type of reasoning.

Error Analysis

We include a breakdown of our best model’s performance on various question types in Figure 5, and show specific examples that it gets wrong in Table 4. Overall, questions related to pre-conditions of the context (people’s motivations, actions needed before the context) are less challenging for the model. Conversely, the model seems to struggle more with questions relating to (potentially involuntary) effects, stative descriptions, and what people will want to do next.

In examples (1) and (2) of Table 4, the model selects answers which are incorrectly timed with respect to the context and question (e.g., “convincing the student body to vote for her” is something Casey likely did before being elected, not afterwards). Additionally, the model tends to choose answers related to a person other than the one asked about. In (4), after the arm wrestling, though it is likely that Aubrey will feel ashamed, the question relates to what Alex might feel–not Aubrey.

This illustrates how challenging social situations can be for models to make nuanced inferences about, compared to humans who can trivially reason about the causes and effects for multiple participants.

Task Model Acc. (%)
best mean std
COPA Sasaki et al. (2017) 71.2
BERT-large 80.8 75.0 3.0
BERT-large+SocialIQa 84.4 81.3 1.9
WSC Radford et al. (2019) 70.7*
BERT-large 67.0 65.5 1.0
BERT-large+SocialIQa 72.9 68.9 1.6
DPR Peng et al. (2015) 76.4
BERT-large 79.4 71.2 3.8
BERT-large+SocialIQa 86.1 83.4 1.3
Table 5: Sequential finetuning of BERT-large on SocialIQa before the task yields state of the art results (bolded) on COPA and the Winograd Schema Challenge (Roemmele et al., 2011; Levesque, 2011; Rahman and Ng, 2012). For comparison, we include previous published state of the art performance (* denotes unpublished work).

6 SocialIQa for Transfer Learning

In addition to being the first large-scale benchmark for social commonsense, we also show that SocialIQa can improve performance on downstream tasks that require commonsense, namely the Winograd Schema Challenge and the Choice of Plausible Alternatives task. We improve state of the art on both tasks by sequentially finetuning on SocialIQa before the task itself.

Copa

The Choice of Plausible Alternatives task (COPA; Roemmele et al., 2011) is a two-way multiple choice task which aims to measure commonsense reasoning abilities of models. The dataset contains 1,000 questions (500 dev, 500 test) that ask about the causes and effects of a premise. This has been a challenging task for computational systems, partially due to the limited amount of training data available. As done previously (Goodwin et al., 2012; Luo et al., 2016), we finetune our models on the dev set, and report performance only on the test set.

Winograd Schema

The Winograd Schema Challenge (WSC; Levesque, 2011) is a well-known commonsense knowledge challenge framed as a coreference resolution task. It contains a collection of short sentences in which a pronoun must be resolved to one of two antecedents (e.g., in “The city councilmen refused the demonstrators a permit because they feared violence”, they refers to the councilmen). Because of the limited amount of data in WSC, Rahman and Ng (2012) created a corpus of Winograd-style sentence pairs ( sentences in total), henceforth referred to as DPR, which has been shown to be slightly less challenging than WSC for computational models.

We evaluate on these two benchmarks. While the DPR dataset is split into train and test sets, the WSC dataset only contains instances. Therefore, we use the DPR dataset as training set when evaluating on the WSC dataset.

6.1 Sequential Finetuning

We first finetune BERT-large on SocialIQa, which reaches 77% on our dev set (see Table 3). We then finetune that model further on the task-specific datasets, considering the same set of hyperparameters as in §5.1

. On each of the test sets, we report best, mean, and standard deviation of all models. We compare results of sequential finetuning (BERT-large+

SocialIQa) to a BERT-large baseline.

Results

Our results are outlined in Table 5, along with the previous known state-of-the-art performance on our tasks. Our sequential finetuning results show substantial improvements over the BERT-only baseline (between 3.6 and 6.7 % max performance increases), as well as the general increase in performance stability (i.e., lower standard deviations). This suggests that BERT-large can benefit from both the large scale and the QA format of commonsense knowledge in SocialIQa, which it struggles to learn from small benchmarks only.

Finally, we find that sequentially finetuned BERT-large+SocialIQa achieves state-of-the-art results on all three tasks, showing improvements of previous best performing models.121212Note that unlike our model, Radford et al. (2019) obtained 70.7% on WSC in a zero-shot setting. Also note that OpenAI-GPT was reported to achieve 78.6% on COPA, but that result was not published, nor discussed in the OpenAI-GPT white paper Radford et al. (2018).

7 Related Work

Commonsense Benchmarks:

Commonsense benchmark creation has been well-studied by previous work. Notably, the WinoGrad Schema Challenge (WSC; Levesque, 2011) and the Choice Of Plausible Alternatives dataset (COPA; Roemmele et al., 2011) are expert-curated collections of commonsense QA pairs that are trivial for humans to solve. Whereas WSC requires physical and social commonsense knowledge to solve, COPA targets the knowledge of causes and effects surrounding social situations. While both benchmarks are of high-quality and created by experts, their small scale (150 and 1000 examples, respectively) poses a challenge for modern modelling techniques, which require many training instances.

More recently, Talmor et al. (2019) introduce CommonsenseQA, containing 12k multiple-choice questions. Crowdsourced using ConceptNet Speer and Havasi (2012), these questions mostly probe knowledge related to factual and physical commonsense (e.g., “Where would I not want a fox?”). In contrast, SocialIQa explicitly separates contexts from questions, and focuses on the types of commonsense inferences humans perform when navigating social situations.

Commonsense Knowledge Bases:

In addition to large-scale benchmarks, there is a wealth of work aimed at creating commonsense knowledge repositories (Speer and Havasi, 2012; Sap et al., 2019; Zhang et al., 2017; Lenat, 1995; Espinosa and Lieberman, 2005; Gordon and Hobbs, 2017) that can be used as resources in downstream reasoning tasks. While SocialIQa is formatted as a natural language QA benchmark, rather than a taxonomic knowledge base, it also can be used as a resource for external tasks, as we have demonstrated experimentally.

Constrained or Adversarial Data Collection:

Various work has investigated ways to circumvent annotation artifacts that result from crowdsourcing. Sharma et al. (2018) extend the Story Cloze data by severely restricting the incorrect story ending generation task, reducing some of the sentiment and negation artifacts. Rajpurkar et al. (2018) create an adversarial version of the extractive question-answering challenge, SQuAD Rajpurkar et al. (2016), by creating 50k unanswerable questions. Instead of using human-generated incorrect answers, Zellers et al. (2018) use adversarial filtering of machine generated incorrect answers to minimize their surface patterns. Our dataset also aims to reduce annotation artifacts by using a multi-stage annotation pipeline in which we collect negative responses from multiple methods including a unique adversarial question-switching technique.

8 Conclusion

We present SocialIQa, the first large-scale benchmark for social commonsense. Consisting of 45k multiple-choice questions, SocialIQa covers various types of inference about people’s actions being described in situational contexts. We design a crowdsourcing framework for collecting QA pairs that reduces stylistic artifacts of negative answers through an adversarial question-switching method. Despite human performance of 90%, computational approaches based on large pretrained language models only achieve accuracies up to 77%, suggesting that these social inferences are still a challenge for AI systems. In addition to providing a new benchmark, we demonstrate how transfer learning from SocialIQa to other commonsense challenges can yield significant improvements, achieving new state-of-the-art performance on both COPA and Winograd Schema Challenge datasets.

References

  • Apperly (2010) Ian Apperly. 2010. Mindreaders: the cognitive basis of” theory of mind”. Psychology Press.
  • Baron-Cohen et al. (1985) Simon Baron-Cohen, Alan M Leslie, and Uta Frith. 1985. Does the autistic child have a “theory of mind”? Cognition, 21(1):37–46.
  • Davis and Marcus (2015) Ernest Davis and Gary Marcus. 2015. Commonsense reasoning and commonsense knowledge in artificial intelligence. Commun. ACM, 58:92–103.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
  • Espinosa and Lieberman (2005) José H. Espinosa and Henry Lieberman. 2005. Eventnet: Inferring temporal relations between commonsense events. In MICAI.
  • Ganaie and Mudasir (2015) MY Ganaie and Hafiz Mudasir. 2015. A study of social intelligence & academic achievement of college students of district srinagar, j&k, india. Journal of American Science, 11(3):23–27.
  • Goodwin et al. (2012) Travis Goodwin, Bryan Rink, Kirk Roberts, and Sanda M Harabagiu. 2012. Utdhlt: Copacetic system for choosing plausible alternatives. In Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pages 461–466. Association for Computational Linguistics.
  • Gordon and Hobbs (2017) Andrew S Gordon and Jerry R Hobbs. 2017. A Formal Theory of Commonsense Psychology: How People Think People Think. Cambridge University Press.
  • Gordon and Van Durme (2013) Jonathan Gordon and Benjamin Van Durme. 2013. Reporting bias and knowledge acquisition. In Proceedings of the 2013 Workshop on Automated Knowledge Base Construction, AKBC ’13, pages 25–30, New York, NY, USA. ACM.
  • Gunning (2018) David Gunning. 2018. Machine common sense concept paper.
  • Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In NAACL-HLT.
  • Korkmaz (2011) Baris Korkmaz. 2011. Theory of mind and neurodevelopmental disorders of childhood. Pediatr Res, 69(5 Pt 2):101R–8R.
  • Lenat (1995) Douglas B Lenat. 1995. Cyc: A large-scale investment in knowledge infrastructure. Communications of the ACM, 38(11):33–38.
  • Levesque (2011) Hector J. Levesque. 2011. The winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
  • Lucy and Gauthier (2017) Li Lucy and Jon Gauthier. 2017.

    Are distributional representations ready for the real world? evaluating word vectors for grounded perceptual meaning.

    In RoboNLP@ACL.
  • Luo et al. (2016) Zhiyi Luo, Yuchen Sha, Kenny Q Zhu, Seung-won Hwang, and Zhongyuan Wang. 2016. Commonsense causal reasoning between short texts. In Fifteenth International Conference on the Principles of Knowledge Representation and Reasoning.
  • Mohammad and Turney (2013) Saif M. Mohammad and Peter D. Turney. 2013.

    Crowdsourcing a word-emotion association lexicon.

    29(3):436–465.
  • Moore (2013) Chris Moore. 2013. The development of commonsense psychology. Psychology Press.
  • Nematzadeh et al. (2018) Aida Nematzadeh, Kaylee Burns, Erin Grant, Alison Gopnik, and Thomas L. Griffiths. 2018. Evaluating theory of mind in question answering. In EMNLP.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPS-W.
  • Peng et al. (2015) Haoruo Peng, Daniel Khashabi, and Dan Roth. 2015. Solving hard coreference problems. In HLT-NAACL.
  • Pollack (2005) Martha E. Pollack. 2005. Intelligent technology for an aging population: The use of ai to assist elders with cognitive impairment. AI Magazine, 26:9–24.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative Pre-Training.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
  • Rahman and Ng (2012) Altaf Rahman and Vincent Ng. 2012. Resolving complex cases of definite pronouns: The winograd schema challenge. In EMNLP, EMNLP-CoNLL ’12, pages 777–789, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy S. Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP.
  • Roemmele et al. (2011) Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
  • Sap et al. (2019) Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. In AAAI.
  • Sasaki et al. (2017) Shota Sasaki, Sho Takase, Naoya Inoue, Naoaki Okazaki, and Kentaro Inui. 2017.

    Handling multiword expressions in causality estimation.

    In IWCS.
  • Sawilowsky (2009) Shlomo S. Sawilowsky. 2009. New effect size rules of thumb. Journal of Modern Applied Statistical Methods, 8(2):597–599.
  • Schwartz et al. (2017) Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles, Yejin Choi, and Noah A Smith. 2017. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In CoNLL.
  • Sharma et al. (2018) Rishi Kant Sharma, James Allen, Omid Bakhshandeh, and Nasrin Mostafazadeh. 2018. Tackling the story ending biases in the story cloze test. In ACL.
  • Speer and Havasi (2012) Robert Speer and Catherine Havasi. 2012. Representing general relational knowledge in conceptnet 5. In LREC.
  • Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In NAACL.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  • Zellers et al. (2019) Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. In CVPR.
  • Zellers et al. (2018) Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In EMNLP.
  • Zhang et al. (2017) Sheng Zhang, Rachel Rudinger, Kevin Duh, and Benjamin Van Durme. 2017. Ordinal common-sense inference. Transactions of the Association of Computational Linguistics, 5(1):379–395.
  • Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan R. Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books.

    2015 IEEE International Conference on Computer Vision (ICCV)

    , pages 19–27.