Commonsense Reasoning for Natural Language Understanding: A Survey of Benchmarks, Resources, and Approaches

04/02/2019 ∙ by Shane Storks, et al. ∙ 0

Commonsense knowledge and commonsense reasoning are some of the main bottlenecks in machine intelligence. In the NLP community, many benchmark datasets and tasks have been created to address commonsense reasoning for language understanding. These tasks are designed to assess machines' ability to acquire and learn commonsense knowledge in order to reason and understand natural language text. As these tasks become instrumental and a driving force for commonsense research, this paper aims to provide an overview of existing tasks and benchmarks, knowledge resources, and learning and inference approaches toward commonsense reasoning for natural language understanding. Through this, our goal is to support a better understanding of the state of the art, its limitations, and future challenges.

READ FULL TEXT VIEW PDF

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Commonsense knowledge and commonsense reasoning play a vital role in all aspects of machine intelligence, from language understanding to computer vision and robotics. A detailed account of challenges with commonsense reasoning is provided by davisCommonsenseReasoningCommonsense2015, which spans from difficulties in understanding and formulating commonsense knowledge for specific or general domains to complexities in various forms of reasoning and their integration for problem solving. To move the field forward, as pointed out by davisCommonsenseReasoningCommonsense2015, there is a critical need for methods that can integrate different modes of reasoning (e.g., symbolic reasoning through deduction and statistical reasoning based on large amount of data), as well as benchmarks and evaluation metrics that can quantitatively measure research progress.

In the NLP community, recent years have seen a surge of research activities that aim to tackle commonsense reasoning through ever-growing benchmark tasks. These range from earlier textual entailment tasks, e.g., the RTE Challenges [Dagan, Glickman,  MagniniDagan et al.2005], to more recent tasks that require a comprehensive understanding of everyday physical and social commonsense, e.g., the Story Cloze Test [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016] or SWAG [Zellers, Bisk, Schwartz,  ChoiZellers et al.2018]. An increasing effort has been devoted to extracting commonsense knowledge from existing data (e.g., Wikipedia) or acquiring it directly from crowd workers. Many learning and inference approaches have been developed for these benchmark tasks which range from earlier symbolic and statistical approaches to more recent neural approaches. To facilitate quantitative evaluation and encourage broader participation from the community, various leaderboards for these benchmarks have been set up and maintained.

As more and more resources become available for commonsense reasoning for NLP, it is useful for researchers to have a quick grasp of this fast evolving space. As such, this paper aims to provide an overview of existing benchmarks, knowledge resources, and approaches, and discuss current limitations and future opportunities. We hope this paper will provide an entry point to those who are not familiar with but interested in pursuing research in this topic area.

Figure 1: Main research efforts in commonsense knowledge and reasoning from the NLP community occur in three areas: benchmarks and tasks, knowledge resources, and learning and inference approaches.111Although videos and images are often used for creating various benchmarks, they are not the focus of this paper.

As shown in Figure 1, ongoing research effort has focused on three main areas. The first, discussed further in Section 2, is the creation of benchmark datasets for various NLP tasks. Different from earlier benchmark tasks that mainly focus on linguistic processing, these tasks are designed so that the solutions may not be obvious from the linguistic context but rather require commonsense knowledge and reasoning. These benchmarks vary in terms of the scope of reasoning, for example, from the specific and focused coreference resolution task in the Winograd Schema Challenge [LevesqueLevesque2011] to broader tasks such as textual entailment, e.g., the RTE Challenges [Dagan, Glickman,  MagniniDagan et al.2005]. While some benchmarks only focus on one specific task, others are comprised of a variety of tasks, e.g., GLUE [Wang, Singh, Michael, Hill, Levy,  BowmanWang et al.2018]. Some benchmarks target a specific type of commonsense knowledge, e.g., social psychology in Event2Mind [Rashkin, Sap, Allaway, Smith,  ChoiRashkin et al.2018b], while others intend to address a variety of types of knowledge, e.g., bAbI [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016]. How benchmark tasks are formulated also differs among existing benchmarks. Some are in the form of multiple-choice questions, requiring a binary decision or a selection from a candidate list, while others are more open-ended. Different characteristics of benchmarks serve different goals, and a critical question is what criteria to consider in order to create a benchmark that can support technology development and measurable outcome of research progress on commonsense reasoning abilities. Section 2 gives a detailed account of existing benchmarks and their common characteristics. It also summarizes important criteria to consider in building these benchmarks.

A parallel ongoing research effort is the population of commonsense knowledge bases. Apart from earlier efforts where knowledge bases are manually created by domain experts, e.g., Cyc [Lenat  GuhaLenat  Guha1989] or WordNet [FellbaumFellbaum1999], or by crowdsourcing, e.g., ConceptNet [Liu  SinghLiu  Singh2004]

, recent work has focused on applying NLP techniques to automatically extract information (e.g., facts and relations) and build knowledge representations such as knowledge graphs. Section 

3 gives an introduction to existing commonsense knowledge resources and several recent efforts in building such resources.

To tackle benchmark tasks, many computational models have been developed from earlier symbolic and statistical approaches to more recent approaches that learn deep neural models from training data. These models are often augmented with external data or knowledge resources, or pre-trained word embeddings, e.g., BERT [Devlin, Chang, Lee,  ToutanovaDevlin et al.2018]. Section 4 summarizes the state-of-the-art approaches, and discusses their performance and limitations.

It is worth pointing out that as commonsense reasoning plays an important role in almost every aspect of machine intelligence, recent years have also seen an increasing effort in other related disciplines such as computer vision, robotics, and the intersection between language, vision, and robotics. This paper will only focus on the development in the NLP field without concerning these other related areas. Nevertheless, the current practice, and the progress and limitations from the NLP field may also provide some insight to these related disciplines and vice versa.

2 Benchmarks and Tasks

The NLP community has a long history of creating benchmarks to facilitate algorithm development and evaluation for language processing tasks, e.g., named entity recognition, coreference resolution, and question answering. Although it is often the case that some type of commonsense reasoning may be required to reach an oracle performance on these tasks, earlier benchmarks have mainly targeted approaches that apply linguistic context to solve these tasks. As significant progress has been made by using the earlier benchmarks, recent years have seen a shift in benchmark tasks which are beyond the use of linguistic context, but rather require commonsense knowledge and reasoning to solve the tasks. For instance, consider this question from the Winograd Schema Challenge

[LevesqueLevesque2011]: "The trophy would not fit in the brown suitcase because it was too big. What was too big?" To answer this question, linguistic constraints will not be able to resolve whether it refers to the trophy or the brown suitcase. Only based on commonsense knowledge (i.e., an object must be bigger than another object in order to contain it) is it possible to resolve the pronoun it and answer the question correctly. It is the goal of this paper to survey these kinds of recent benchmarks which are geared toward the use of commonsense reasoning for NLP tasks beyond the use of linguistic discourse.

In this section, we first give an overview of recent benchmarks that address commonsense reasoning in NLP. We then summarize important considerations in creating these benchmarks and lessons learned from ongoing research.

2.1 An Overview of Existing Benchmarks

Since the Recognizing Textual Entailment (RTE) Challenges introduced by daganPASCALRecognisingTextual2005 in the early 2000s, there has been an explosion of commonsense-directed benchmarks being created. Figure 2 shows a trend of this growth in the field among the benchmarks introduced in this section. The RTE challenges, which were inspired by difficulties encountered in semantic processing for machine translation, question answering, and information extraction [Dagan, Glickman,  MagniniDagan et al.2005]

, had been the main dominating reasoning task for many years. Since 2013, however there has been a surge of a variety of benchmarks. A majority of them (e.g., those released in 2018) have provided more than 10,000 instances, aiming to facilitate the development of machine learning approaches.

Figure 2: Since the early 2000s, there has been a surge of benchmark tasks geared towards commonsense reasoning for language understanding. In 2018, we saw the creation of more benchmarks of larger sizes than ever before.

Many commonsense benchmarks are based upon classic language processing problems. The scope of these benchmark tasks ranges from more focused tasks, such as coreference resolution and named entity recognition, to more comprehensive tasks and applications, such as question answering and textual entailment. More focused tasks tend to be useful in creating component technology for NLP systems, building upon each other toward more comprehensive tasks.

Meanwhile, rather than restricting tasks by the types of language processing skills required to perform them, a common characteristic of earlier benchmarks, recent benchmarks are more commonly geared toward particular types of commonsense knowledge and reasoning. Some benchmark tasks focus on singular commonsense reasoning processes, e.g., temporal reasoning, requiring a small amount of commonsense knowledge, while others focus on entire domains of knowledge, e.g., social psychology, thus requiring a larger set of related reasoning skills. Further, some benchmarks include a more comprehensive mixture of everyday commonsense knowledge, demanding a more complete commonsense reasoning skill set.

Next, we give a review of widely used benchmarks, introduced by the following groupings: coreference resolution, question answering, textual entailment, plausible inference, psychological reasoning, and multiple tasks. These groupings are not necessarily exclusive, but they highlight the recent trends in commonsense. While all tasks’ training and/or development data are available for free download, it is important to note that test data are often not distributed publicly so that testing of systems can occur privately in a standard, unbiased way.

2.1.1 Coreference Resolution

One particularly fundamental focused task is coreference resolution, i.e., determining which entity or event in a text a particular pronoun refers to. This task becomes challenging, however, due to ambiguities which arise from the presence of multiple entities in a sentence with pronouns and significantly complicate the process, creating a need for commonsense knowledge to inform decisions [LevesqueLevesque2011]. Even informed by a corpus or knowledge graph, current coreference resolution systems are plagued by data bias, e.g., gender bias in nouns for occupations [Rudinger, Naradowsky, Leonard,  Van DurmeRudinger et al.2018a]. Since challenging examples in commonsense-informed coreference resolution are typically handcrafted or handpicked [Morgenstern, Davis,  OrtizMorgenstern et al.2016, Rahman  NgRahman  Ng2012, Rudinger, Naradowsky, Leonard,  Van DurmeRudinger et al.2018a], there exists a small magnitude of data for this skill, and it remains an unsolved problem [Rahman  NgRahman  Ng2012, Davis, Morgenstern,  OrtizDavis et al.2017]. Consequently, there is a need for more benchmarks here. In the following paragraphs, we introduce the few benchmarks and tasks that currently exist for coreference resolution.

Winograd Schema Challenge.

The classic coreference resolution benchmark is the Winograd Schema Challenge (WSC), inspired by winogradUnderstandingNaturalLanguage1972, originally proposed by levesqueWinogradSchemaChallenge2011, later developed by levesqueWinogradSchemaChallenge2012, and actualized by morgensternWinogradSchemaChallenge2015 and morgensternPlanningExecutingEvaluating2016. In this challenge, systems are presented with questions about sentences known as Winograd schemas. To answer a question, a system must disambiguate a pronoun whose coreferent may be one of two entities, and can be changed by replacing a single word in the sentence. The first Challenge dataset consisted of just 60 testing examples [MorgensternMorgenstern2016], but there are more available online.222See http://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html,
http://commonsensereasoning.org/winograd.html, and
http://commonsensereasoning.org/disambiguation.html.
Additionally, rahmanResolvingComplexCases2012 curated nearly 1,000 similar pronoun resolution problems to aid in training systems for the WSC; this set of problems is included within a later-mentioned dataset, and thus is not introduced separately. Further information about the Winograd Schema Challenge is available at http://commonsensereasoning.org/winograd.html.

Other coreference resolution tasks.

Other coreference resolution tasks include the basic and compound coreference tasks within bAbI [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016], the definite pronoun resolution problems by rahmanResolvingComplexCases2012 within Inference is Everything [White, Rastogi, Duh,  Van DurmeWhite et al.2017], the Winograd NLI task within GLUE [Wang, Singh, Michael, Hill, Levy,  BowmanWang et al.2018], and the gendered anaphora-based Winogender task originally by gender-bias-in-coreference-resolution within DNC [Poliak, Haldar, Rudinger, Hu, Pavlick, White,  Van DurmePoliak et al.2018a]. The full benchmarks to which these tasks belong are introduced in Section 2.1.6, while in Figure 3, we list several examples of challenging coreference resolution problems across all existing benchmarks, where commonsense knowledge, such as the fact that predators eat their prey, is useful in disambiguating pronouns.

  1. [label=()]

  2. Winograd Schema Challenge [LevesqueLevesque2011]
    The trophy would not fit in the brown suitcase because it was too big. What was too big?

    1. [label=.]

    2. The trophy

    3. The suitcase

    The trophy would not fit in the brown suitcase because it was too small. What was too small?

    1. [label=.]

    2. The trophy

    3. The suitcase

  3. Winogender [Rudinger, Naradowsky, Leonard,  Van DurmeRudinger et al.2018a]
    The paramedic performed CPR on the passenger even though she knew it was too late. Who knew it was too late?

    1. [label=.]

    2. The paramedic

    3. The passenger

  4. [Rahman  NgRahman  Ng2012]
    Lions eat zebras because they are predators. Who are predators?

    1. [label=.]

    2. Lions

    3. Zebras

Figure 3: Examples from existing coreference resolution benchmark tasks. Answers in bold.

2.1.2 Question Answering

Instead of providing one or more focused language processing or reasoning tasks, many benchmarks instead provide a more comprehensive mix of language processing and reasoning skills within a single task. Question answering (QA) is one such comprehensive task. QA is a fairly well-investigated area of artificial intelligence, and there are many existing benchmarks to support work here, however not all of them require commonsense to solve

[Ostermann, Modi, Roth, Thater,  PinkalOstermann et al.2018]. Some QA benchmarks which directly address commonsense are MCScript [Ostermann, Modi, Roth, Thater,  PinkalOstermann et al.2018], CoQA [Reddy, Chen,  ManningReddy et al.2018], OpenBookQA [Mihaylov, Clark, Khot,  SabharwalMihaylov et al.2018], and ReCoRD [Rajpurkar, Zhang, Lopyrev,  LiangRajpurkar et al.2016], all of which contain questions requiring commonsense knowledge alongside questions requiring comprehension of a given text. Examples of such questions are listed in Figure 4; here, the commonsense knowledge facts that diapers are typically thrown away, and that steel is a metal, are particularly useful in answering questions. Other QA benchmarks indirectly address commonsense by demanding advanced reasoning processes best informed by commonsense. SQuAD 2.0 is one such example, as it includes unanswerable questions about passages [Rajpurkar, Jia,  LiangRajpurkar et al.2018], which may take outside knowledge to identify. In the following paragraphs, we introduce all surveyed QA benchmarks.

  1. [label=()]

  2. MCScript [Ostermann, Modi, Roth, Thater,  PinkalOstermann et al.2018]
    Did they throw away the old diaper?

    1. [label=.]

    2. Yes, they put it into the bin.

    3. No, they kept it for a while.

  3. OpenBookQA [Mihaylov, Clark, Khot,  SabharwalMihaylov et al.2018]
    Which of these would let the most heat travel through?

    1. [label=.]

    2. a new pair of jeans.

    3. a steel spoon in a cafeteria.

    4. a cotton candy at a store.

    5. a calvin klein cotton hat.

    Evidence: Metal is a thermal conductor.

  4. CoQA [Reddy, Chen,  ManningReddy et al.2018]
    The Virginia governor’s race, billed as the marquee battle of an otherwise anticlimactic 2013 election cycle, is shaping up to be a foregone conclusion. Democrat Terry McAuliffe, the longtime political fixer and moneyman, hasn’t trailed in a poll since May. Barring a political miracle, Republican Ken Cuccinelli will be delivering a concession speech on Tuesday evening in Richmond. In recent …

    Who is the democratic candidate?
    Terry McAuliffe
    Evidence: Democrat Terry McAuliffe

    Who is his opponent?
    Ken Cuccinelli
    Evidence: Republican Ken Cuccinelli

Figure 4: Examples from QA benchmarks which require common and/or commonsense knowledge. Answers in bold.
Arc.

The AI2 Reasoning Challenge (ARC) from clarkThinkYouHave2018 provides a dataset of almost 8,000 four-way multiple-choice science questions and answers, as well as a corpus of 14 million science-related sentences which are claimed to contain most of the information needed to answer the questions. As many of the questions require systems to draw information from multiple sentences in the corpus to answer correctly, it is not possible to accurately solve the task by simply searching the corpus for keywords. As such, this task encourages advanced reasoning which may be useful to commonsense. ARC can be downloaded at http://data.allenai.org/arc/.

MCScript.

The MCScript benchmark by ostermannMCScriptNovelDataset2018 is one of few QA benchmarks which emphasizes commonsense. The dataset consists of about 14,000 2-way multiple-choice questions based on short passages, with a large proportion of its questions requiring pure commonsense knowledge to answer, and thus can be answered without the passage. Questions are conveniently labeled with whether they are answered with the provided text or commonsense. A download link to MCScript can be found at http://www.sfb1102.uni-saarland.de/?page_id=2582.

ProPara.

ProPara by mishraTrackingStateChanges2018 consists of 488 annotated paragraphs of procedural text. These paragraphs describe various processes such as photosynthesis and hydroelectric power generation in order for systems to learn object tracking in processes which involve changes of state. The authors assert that recognizing these state changes can require world knowledge, so proficiency in commonsense reasoning is necessary to perform well. Annotations of the paragraphs are in the form of a grid which describes the state of each participant in the paragraph after each sentence of the paragraph. If a system understands a paragraph in this dataset, it is said that for each entity mentioned in the paragraph, it can answer any question about whether the entity is created, destroyed, or moved, and when and where this happens. To answer all possible perfectly accurately, a system must produce a grid identical to the annotations for the paragraph. Thus, systems are evaluated by their performance on this task. ProPara data is linked from http://data.allenai.org/propara/.

MultiRC

MultiRC by KCRUR18 is a reading comprehension dataset consisting of about 10,000 questions posed on over 800 paragraphs across a variety of topic domains. It differs from a traditional reading comprehension dataset in that most questions can only be answered by reasoning over multiple sentences in the accompanying paragraphs, answers are not spans of text from the paragraph, and the number of answer choices as well as the number of correct answers for each question is entirely variable. All of these features make it difficult for shallow and artificial approaches to perform well on the benchmark, encouraging a deeper understanding of the passage. Further, the benchmark includes a variety of nontrivial semantic phenomena in passages, such as coreference and causal relationships, which often require commonsense to recognize and parse. MultiRC can be downloaded at http://cogcomp.org/multirc/.

SQuAD.

The Stanford Question Answering Dataset (SQuAD) from rajpurkarSQuAD1000002016 was originally a set of about 100,000 open-ended questions posed on passages from Wikipedia articles, which are provided with the questions. The initial dataset did not require commonsense to solve; questions required little reasoning, and answers were spans of text directly from the passage. To make this dataset more challenging, SQuAD 2.0 [Rajpurkar, Jia,  LiangRajpurkar et al.2018] was later released to add about 50,000 additional questions which are unanswerable given the passage. Determining whether a question is answerable may require outside knowledge, or at least some more advanced reasoning. All SQuAD data can be downloaded at http://rajpurkar.github.io/SQuAD-explorer/.

CoQA.

The Conversational Question Answering (CoQA) dataset by reddyCoQAConversationalQuestion2018 contains passages each accompanied with a set of questions in conversational form, as well as their answers and evidence for the answers. There are about 127,000 questions in the dataset total, but as they are conversational, questions pertaining to a passage must be answered together in order. While the conversational element of CoQA is not new, e.g., QuAC by choiQuACQuestionAnswering2018, CoQA becomes unique by including questions pertaining directly to commonsense reasoning, following ostermannMCScriptNovelDataset2018 in addressing a growing need for reading comprehension datasets which require various forms of reasoning. CoQA also includes out-of-domain types of questions appearing only in the test set, and unanswerable questions throughout. CoQA data can be downloaded at http://stanfordnlp.github.io/coqa/.

OpenBookQA.

OpenBookQA by mihaylovCanSuitArmor2018 intends to address the shortcomings of previous QA datasets. Earlier datasets often do not require commonsense or any advanced reasoning to solve, and those that do require vast domains of knowledge which are difficult to capture. OpenBookQA contains about 6,000 4-way multiple choice science questions which may require science facts or other common and commonsense knowledge. Instead of providing no knowledge resource like MCScript [Ostermann, Modi, Roth, Thater,  PinkalOstermann et al.2018], or an inhibitively large corpus of facts to support answering the questions like ARC [Clark, Cowhey, Etzioni, Khot, Sabharwal, Schoenick,  TafjordClark et al.2018], OpenBookQA provides an "open book" of about 1,300 science facts to support answering the questions, each associated directly with the question(s) they apply to. For required common and commonsense knowledge, the authors expect that outside resources will be used in answering the questions. Information for downloading OpenBookQA can be found at http://github.com/allenai/OpenBookQA.

CommonsenseQA.

CommonsenseQA by talmorCommonsenseQAQuestionAnswering2019 is a QA benchmark which directly targets commonsense, like CoQA [Reddy, Chen,  ManningReddy et al.2018] and ReCoRD [Zhang, Liu, Liu, Gao, Duh,  Van DurmeZhang et al.2018], consisting of 9,500 three-way multiple-choice questions. To ensure an emphasis on commonsense, each question requires one to disambiguate a target concept from three connected concepts in ConceptNet, a commonsense knowledge graph [Liu  SinghLiu  Singh2004]. Utilizing a large knowledge graph like ConceptNet ensures not only that questions target commonsense relations directly, but that the types of commonsense knowledge and reasoning required by questions are highly varied. CommonsenseQA data can be downloaded at http://www.tau-nlp.org/commonsenseqa.

2.1.3 Textual Entailment

Recognizing textual entailment is another such comprehensive task. Textual entailment is defined by daganPASCALRecognisingTextual2005 as a directional relationship between a text and a hypothesis, where it can be said that the text entails the hypothesis if a typical person would infer that the hypothesis is true given the text. Some tasks expand this by also requiring recognition of contradiction, e.g., the fourth and fifth RTE Challenges [Giampiccolo, Dang, Magnini, Dagan,  DolanGiampiccolo et al.2008, Bentivogli, Dagan, Dang, Giampiccolo,  MagniniBentivogli et al.2009]. Performing such a task requires the utilization of several simpler language processing skills, such as paraphrase, object tracking, and causal reasoning, but since it also requires a sense of what a typical person would infer, commonsense knowledge is often essential to textual entailment tasks. The RTE Challenges [Dagan, Glickman,  MagniniDagan et al.2005] are the classic benchmarks for entailment, but there are now several larger benchmarks inspired by them. Examples from these benchmarks are listed in Figure 5, and they require commonsense knowledge such as the process of making snow angels, and the relationship between the presence of a crowd and loneliness. In the following paragraphs, we introduce all textual entailment benchmarks in detail.

  1. [label=()]

  2. RTE Challenge [Dagan, Glickman,  MagniniDagan et al.2005]
    Text: American Airlines began laying off hundreds of flight attendants on Tuesday, after a federal judge turned aside a union’s bid to block the job losses.
    Hypothesis: American Airlines will recall hundreds of flight attendants as it steps up the number of flights it operates.

    Label: not entailment

  3. SICK [Marelli, Menini, Baroni, Bentivogli, Bernardi,  ZamparelliMarelli et al.2014a]333Example extracted from the SICK data available at http://clic.cimec.unitn.it/composes/sick.html.
    Sentence 1: Two children are lying in the snow and are drawing angels.
    Sentence 2: Two children are lying in the snow and are making snow angels.

    Label: entailment


  4. SNLI [Bowman, Angeli, Potts,  ManningBowman et al.2015]
    Text: A black race car starts up in front of a crowd of people.
    Hypothesis: A man is driving down a lonely road.

    Label: contradiction

  5. MultiNLI, Telephone [Williams, Nangia,  BowmanWilliams et al.2017]
    Context: that doesn’t seem fair does it
    Hypothesis: There’s no doubt that it’s fair.

    Label: contradiction

  6. SciTail [Khot, Sabharwal,  ClarkKhot et al.2018]
    Premise: During periods of drought, trees died and prairie plants took over previously forested regions.
    Hypothesis: Because trees add water vapor to air, cutting down forests leads to longer periods of drought.

    Label: neutral

Figure 5: Examples from RTE benchmarks. Answers in bold.
RTE Challenges.

An early attempt at an evaluation scheme for commonsense reasoning was the Recognizing Textual Entailment (RTE) Challenge [Dagan, Glickman,  MagniniDagan et al.2005]. The inaugural challenge provided a task where given a text and hypothesis, systems were expected to predict whether the text entailed the hypothesis. In following years, more similar Challenges took place [Bar-Haim, Dagan, Dolan, Ferro, Giampiccolo, Magnini,  SzpektorBar-Haim et al.2006, Giampiccolo, Magnini, Dagan,  DolanGiampiccolo et al.2007]. The fourth and fifth Challenge added a new three-way decision task where systems were additionally expected to recognize contradiction relationships between texts and hypotheses [Giampiccolo, Dang, Magnini, Dagan,  DolanGiampiccolo et al.2008, Bentivogli, Dagan, Dang, Giampiccolo,  MagniniBentivogli et al.2009]. The main task for the sixth and seventh Challenges instead provided one hypothesis and several potential entailing sentences in a corpus [Bentivogli, Clark, Dagan,  GiampiccoloBentivogli et al.2010, Bentivogli, Clark, Dagan,  GiampiccoloBentivogli et al.2011]. The eighth Challenge [Dzikovska, Nielsen, Brew, Leacock, Giampiccolo, Bentivogli, Clark, Dagan,  DangDzikovska et al.2013]

addressed a bit different problem which focused on classifying student responses as an effort toward providing automatic feedback in an educational setting. The first five RTE Challenge datasets consisted of around 1,000 examples each

[Dagan, Glickman,  MagniniDagan et al.2005, Bar-Haim, Dagan, Dolan, Ferro, Giampiccolo, Magnini,  SzpektorBar-Haim et al.2006, Giampiccolo, Magnini, Dagan,  DolanGiampiccolo et al.2007, Giampiccolo, Dang, Magnini, Dagan,  DolanGiampiccolo et al.2008, Bentivogli, Dagan, Dang, Giampiccolo,  MagniniBentivogli et al.2009], while the sixth and seventh consisted of about 33,000 and 49,000 examples respectively. Data from all RTE Challenges can be downloaded at http://tac.nist.gov/.

Sick.

The Sentences Involving Compositional Knowledge (SICK) benchmark by marelliSICKCureEvaluation2014 is a collection of close to 10,000 pairs of sentences. Two tasks are presented with this dataset, one for sentence relatedness, and one for entailment. The entailment task, which is more related to our survey, is a 3-way decision task in the style of RTE-4 [Giampiccolo, Dang, Magnini, Dagan,  DolanGiampiccolo et al.2008] and RTE-5 [Bentivogli, Dagan, Dang, Giampiccolo,  MagniniBentivogli et al.2009]. SICK can be downloaded at http://clic.cimec.unitn.it/composes/sick.html.

Snli.

The Stanford Natural Language Inference (SNLI) benchmark from bowmanLargeAnnotatedCorpus2015 contains nearly 600,000 sentence pairs, and also provides a 3-way decision task similar to the fourth and fifth RTE Challenges [Giampiccolo, Dang, Magnini, Dagan,  DolanGiampiccolo et al.2008, Bentivogli, Dagan, Dang, Giampiccolo,  MagniniBentivogli et al.2009]. In addition to gold labels for entailment, contradiction, or neutral, the SNLI data includes five crowd judgments for the label, which may indicate a level of confidence or agreement for it. This benchmark is later expanded into MultiNLI [Williams, Nangia,  BowmanWilliams et al.2017], which follows the same format, but includes sentences of various genres, such as telephone conversations. MultiNLI is included within the previously introduced GLUE benchmark [Wang, Singh, Michael, Hill, Levy,  BowmanWang et al.2018], while SNLI can be downloaded at http://nlp.stanford.edu/projects/snli/.

SciTail.

SciTail by khotSciTailTextualEntailment2018 consists of about 27,000 premise-hypothesis sentence pairs adapted from science questions into a 2-way entailment task similar to the first RTE Challenge [Dagan, Glickman,  MagniniDagan et al.2005]. Unlike other entailment tasks, this one is primarily science-based, which may require some knowledge more advanced than everyday commonsense. SciTail can be downloaded at http://data.allenai.org/scitail/.

2.1.4 Plausible Inference

While textual entailment benchmarks require one to draw concrete conclusions, others require hypothetical, intermediate, or uncertain conclusions, defined as plausible inference [Davis  MarcusDavis  Marcus2015]. Such benchmarks often focus on everyday events, which contain a wide variety of practical commonsense relations. Examples from these benchmarks are listed in Figure 6, and they require commonsense knowledge of everyday interactions, e.g., answering a door, and activities, e.g., cooking. In the following paragraphs, we introduce all plausible inference commonsense benchmarks.

  1. [label=()]

  2. COPA [Roemmele, Bejan,  GordonRoemmele et al.2011]
    I knocked on my neighbor’s door. What happened as result?

    1. [label=.]

    2. My neighbor invited me in.

    3. My neighbor left his house.

  3. ROCStories [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016]
    Tom and Sheryl have been together for two years. One day, they went to a carnival together. He won her several stuffed bears, and bought her funnel cakes. When they reached the Ferris wheel, he got down on one knee.

    Ending:

    1. [label=.]

    2. Tom asked Sheryl to marry him.

    3. He wiped mud off of his boot.

  4. SWAG [Zellers, Bisk, Schwartz,  ChoiZellers et al.2018]
    He pours the raw egg batter into the pan. He

    1. [label=.]

    2. drops the tiny pan onto a plate

    3. lifts the pan and moves it around to shuffle the eggs.

    4. stirs the dough into a kite.

    5. swirls the stir under the adhesive.

  5. JOCI [Zhang, Rudinger, Duh,  Van DurmeZhang et al.2016]
    Context: John was excited to go to the fair
    Hypothesis: The fair opens.
    Label: 5 (very likely)

    Context: Today my water heater broke
    Hypothesis: A person looks for a heater.
    Label: 4 (likely)

    Context: John’s goal was to learn how to draw well
    Hypothesis: A person accomplishes the goal.
    Label: 3 (plausible)

    Context: Kelly was playing a soccer match for her University
    Hypothesis: The University is dismantled.
    Label: 2 (technically possible)

    Context: A brown-haired lady dressed all in blue denim sits in a group of pigeons.
    Hypothesis: People are made of the denim.
    Label: 1 (impossible)



Figure 6: Examples from commonsense benchmarks requiring plausible inference. Answers in bold.
Copa.

The Choice of Plausible Alternatives (COPA) task by roemmeleChoicePlausibleAlternatives2011 provides a premise, each with two possible alternatives of causes or effects of the premise. Examples require both forward and backward causal reasoning, meaning that the premise could either be a cause or effect of the correct alternative. COPA data, which consists of 1,000 examples total, can be downloaded at http://people.ict.usc.edu/~gordon/copa.html.

Cbt.

The Children’s Book Test (CBT) from hillGoldilocksPrincipleReading2015 consists of about 687,000 cloze-style questions from 20-sentence passages mined from publicly available children’s books. These questions require a system to fill in a blank in a line of a story passage given a set of 10 candidate words to fill the blank. Questions are classified into 4 tasks based on the type of the missing word to predict, which can be a named entity, common noun, verb, or proposition. CBT can be downloaded at http://research.fb.com/downloads/babi/.

ROCStories.

ROCStories by mostafazadehCorpusClozeEvaluation2016 is a corpus of about 50,000 five-sentence everyday life stories, containing a host of causal and temporal relationships between events, ideal for learning commonsense knowledge rules. Of these 50,000 stories, about 3,700 are designated as test cases, which include a plausible and implausible alternate story ending for trained systems to choose between. The task of solving the ROCStories test cases is called the Story Cloze Test, a more challenging alternative to the narrative cloze task proposed by chambersUnsupervisedLearningNarrative2008. The most recent release of ROCStories, which can be requested at http://cs.rochester.edu/nlp/rocstories/, adds about 50,000 stories to the dataset.

Joci.

The JHU Ordinal Commonsense Inference (JOCI) benchmark by zhangOrdinalCommonsenseInference2016 consists of about 39,000 sentence pairs, each consisting of a context and hypothesis. Given these, systems must rate how likely the hypothesis is on a scale from 1 to 5, where 1 corresponds to impossible, 2 to technically possible, 3 to plausible, 4 to likely, and 5 to very likely. This task is similar to SNLI [Bowman, Angeli, Potts,  ManningBowman et al.2015] and other 3-way entailment tasks, but provides more options which essentially range between entailment and contradiction. This is fitting, considering the fuzzier nature of plausible inference tasks compared to textual entailment tasks. JOCI can be downloaded from http://github.com/sheng-z/JOCI.

Cloth.

The Cloze Test by Teachers (CLOTH) benchmark by xieLargescaleClozeTest2017 is a collection of nearly 100,000 4-way multiple-choice cloze-style questions from middle- and high school-level English language exams, where the answer fills in a blank in a given text. Each question is labeled with the type of reasoning it involves, where the four possible types are grammar, short-term reasoning, matching/paraphrasing, and long-term reasoning. CLOTH data can be downloaded at http://www.cs.cmu.edu/~glai1/data/cloth/.

Swag.

Situations with Adversarial Generations (SWAG) from zellersSWAGLargeScaleAdversarial2018 is a benchmark dataset of about 113,000 beginnings of small texts each with four possible endings. Given the context each text provides, systems decide which of the four endings is most plausible. Examples also include labels for the source of the correct ending, and ordinal labels for the likelihood of each possible ending and the correct ending. SWAG data can be downloaded at http://rowanzellers.com/swag/.

ReCoRD.

The Reading Comprehension with Commonsense Reasoning (ReCoRD) benchmark by zhangReCoRDBridgingGap2018, similar to SQuAD [Rajpurkar, Zhang, Lopyrev,  LiangRajpurkar et al.2016], consists of questions posed on passages, particularly news articles. However, questions in ReCoRD are in cloze format, requiring more hypothetical reasoning, and many questions explicitly require commonsense reasoning to answer. Named entities are identified in the data, and are used to fill the blanks for the cloze task. The benchmark data consist of over 120,000 examples, most of which are claimed to require commonsense reasoning. ReCoRD can be downloaded at https://sheng-z.github.io/ReCoRD-explorer/.

2.1.5 Psychological Reasoning

An especially significant domain of knowledge in plausible inference tasks is the area of human sociopsychology, as inference of emotions and intentions through behavior is a fundamental capability of humans [GordonGordon2016]. Several benchmarks touch on social psychology in some examples, e.g., the marriage proposal example in ROCStories [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016] from Figure 6, but some are entirely focused here. Examples from each benchmark are listed in Figure 7, and they require sociopsychological commonsense knowledge such as plausible reactions to being punched or yelled at. In the following paragraphs, we introduce these benchmarks in detail.

  1. [label=()]

  2. Triangle-COPA [GordonGordon2016]
    A circle and a triangle are in the house and are arguing. The circle punches the triangle. The triangle runs out of the house. Why does the triangle leave the house?

    1. [label=.]

    2. The triangle leaves the house because it wants the circle to come fight it outside.

    3. The triangle leaves the house because it is afraid of being further assaulted by the circle.

  3. SC [Rashkin, Bosselut, Sap, Knight,  ChoiRashkin et al.2018a]444Example extracted from Story Commonsense development data available at http://uwnlp.github.io/storycommonsense/
    Jervis has been single for a long time.
    He wants to have a girlfriend.
    One day he meets a nice girl at the grocery store.

    Motivation: to be loved, companionship
    Emotions: shocked, excited, hope, shy, fine, wanted

    Maslow: love
    Reiss: contact, romance
    Plutchik: joy, surprise, anticipation, trust, fear

  4. Event2Mind [Rashkin, Sap, Allaway, Smith,  ChoiRashkin et al.2018b]
    PersonX starts to yell at PersonY

    PersonX’s intent: to express anger, to vent their frustration, to get PersonY’s full attention
    PersonX’s reaction: mad, frustrated, annoyed
    Other’s reactions: shocked, humiliated, mad at PersonX

Figure 7: Examples from sociopsychological inference benchmarks. Answers in bold.
Triangle-COPA.

Triangle-COPA by gordonCommonsenseInterpretationTriangle2016 is a variation of COPA [Roemmele, Bejan,  GordonRoemmele et al.2011] based on a popular social psychology experiment. It contains 100 examples in the format of COPA, and accompanying videos. Questions focus specifically on emotions, intentions, and other aspects of social psychology. The data also includes logical forms of the questions and alternatives, as the paper focuses on logical formalisms for psychological commonsense. Triangle-COPA can be downloaded at http://github.com/asgordon/TriangleCOPA.

Story Commonsense.

As mentioned earlier, the stories from ROCStories by mostafazadehCorpusClozeEvaluation2016 are rich in sociological and psychological instances of commonsense. Motivated by classical theories of motivation and emotions from psychology, rashkinModelingNaivePsychology2018 created the Story Commonsense (SC) benchmark containing about 160,000 annotations of the motivations and emotions of characters in ROCStories to enable more concrete reasoning in this area. In addition to the tasks of generating motivational and emotional annotations, the dataset introduces three classification tasks: one for inferring the basic human needs theorized by maslowTheoryHumanMotivation1943, one for inferring the human motives theorized by reissMultifacetedNatureIntrinsic2004, and one for inferring the human emotions theorized by plutchikGeneralPsychoevolutionaryTheory1980. SC can be downloaded at http://uwnlp.github.io/storycommonsense/.

Event2Mind.

In addition to motivations and emotions, systems may need to infer intentions and reactions surrounding events. To support this, rashkinEvent2MindCommonsenseInference2018 introduce Event2Mind, a benchmark dataset of about 57,000 annotations of intentions and reactions for about 25,000 unique events extracted from other corpora, including ROCStories [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016]. Each event involves one or two participants, and presents three tasks of predicting the primary participant’s intentions and reactions, and predicting the reactions of others. Event2Mind can be downloaded at http://uwnlp.github.io/event2mind/.

2.1.6 Multiple Tasks

Some benchmarks consist of several focused language processing or reasoning tasks so that reading comprehension skills can be learned one by one in a consistent format. While bAbI includes some tasks for coreference resolution, it also includes other prerequisite skills in other tasks, e.g., relation extraction [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016]. Similarly, the recognition of sentiment, paraphrase, grammaticality, and even puns are focused on in different tasks within the DNC benchmark [Poliak, Haldar, Rudinger, Hu, Pavlick, White,  Van DurmePoliak et al.2018a]. Table 1 provides a comparison of all such multi-task commonsense benchmarks by the types of language processing tasks they provide. In the following paragraphs, we introduce these benchmarks in detail.

bAbI.

The bAbI benchmark from westonAICompleteQuestionAnswering2016 consists of 20 prerequisite tasks, each with 1,000 examples for training and 1,000 for testing. Each task presents systems with a passage, then asks a reading comprehension question, but each task focuses on a different type of reasoning or language processing task, allowing systems to learn basic skills one at a time. Tasks are as follows:

  1. [nolistsep]

  2. Single supporting fact

  3. Two supporting facts

  4. Three supporting facts

  5. Two argument relations

  6. Three argument relations

  7. Yes/no questions

  8. Counting

  9. Lists/sets

  10. Simple negation

  11. Indefinite knowledge

  12. Basic coreference

  13. Conjunction

  14. Compound coreference

  15. Time reasoning

  16. Basic deduction

  17. Basic induction

  18. Positional reasoning

  19. Size reasoning

  20. Path finding

  21. Agent’s motivations

In addition to providing focused language processing tasks as previously discussed, bAbI also provides focused commonsense reasoning tasks requiring particular kinds of logical and physical commonsense knowledge, such as its tasks for deduction and induction, and time, positional, and size reasoning [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016]. Selected examples of these tasks are given in Figure 8, and demand commonsense knowledge such as the fact that members of an animal species are typically all the same color, and the relationship between objects’ size and their ability to contain each other. bAbI can be downloaded at http://research.fb.com/downloads/babi/.

Reasoning Type bAbI IIE GLUE DNC
Semantic Role Labeling
Relation Extraction
Event Factuality
Named Entity Recognition
Coreference Resolution
Grammaticality
Lexicosyntactic Inference
Sentiment Analysis
Figurative Language
Paraphrase
Textual Entailment
Table 1: Comparison of language processing tasks present in the bAbI [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016], IIE [White, Rastogi, Duh,  Van DurmeWhite et al.2017], GLUE [Wang, Singh, Michael, Hill, Levy,  BowmanWang et al.2018], and DNC [Poliak, Haldar, Rudinger, Hu, Pavlick, White,  Van DurmePoliak et al.2018a] benchmarks. Recent multi-task benchmarks focus on the recognition of an increasing variety of linguistic and semantic phenomena.
  1. [label=()]

  2. Task 15: Basic Deduction
    Sheep are afraid of wolves.
    Cats are afraid of dogs.
    Mice are afraid of cats.
    Gertrude is a sheep.

    What is Gertrude afraid of?
    wolves

  3. Task 16: Basic Induction
    Lily is a swan.
    Lily is white.
    Bernhard is green.
    Greg is a swan.

    What color is Greg?
    white

  4. Task 17: Positional Reasoning
    The triangle is to the right of the blue square.
    The red square is on top of the blue square.
    The red sphere is to the right of the blue square.

    Is the red square to the left of the triangle?
    yes

  5. Task 18: Size Reasoning
    The football fits in the suitcase.
    The suitcase fits in the cupboard.
    The box is smaller than the football.

    Will the box fit in the suitcase?
    yes

Figure 8: Examples of selected logically- and physically-grounded commonsense reasoning tasks from bAbI by westonAICompleteQuestionAnswering2016. Answers in bold.
Inference is Everything.

Inference is Everything (IIE) by whiteInferenceEverythingRecasting2017 follows bAbI [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016] in creating a suite of tasks, where each task is deliberately geared toward a different language processing task: semantic proto-role labeling, paraphrase, and pronoun resolution. Each task is in classic RTE Challenge format [Dagan, Glickman,  MagniniDagan et al.2005], i.e., given context and hypothesis texts, one must determine whether the context entails the hypothesis. Between the tasks, IIE includes about 300,000 examples, all of which are recast from previously existing datasets. IIE can be downloaded with another multi-task suite at http://github.com/decompositional-semantics-initiative/DNC.

Glue.

The General Language Understanding Evaluation (GLUE) dataset from wangGLUEMultiTaskBenchmark2018 consists of 9 focused to more comprehensive tasks in various forms, including single-sentence binary classification and 2- or 3-way entailment comparable to the dual tasks in RTE-4 and RTE-5 [Giampiccolo, Dang, Magnini, Dagan,  DolanGiampiccolo et al.2008, Bentivogli, Dagan, Dang, Giampiccolo,  MagniniBentivogli et al.2009]. Most of these tasks either directly relate to commonsense, or may be useful in creating systems which utilize traditional linguistic processes like paraphrase in performing commonsense reasoning. The GLUE tasks are recast or included directly from other benchmark data and corpora:

GLUE includes a small analysis set for diagnostic purposes, which has manual annotations of fine-grained categories pairs of sentences fall into (e.g., commonsense), and labeled reversed versions of examples. Overall, GLUE has over 1 million examples, which can be downloaded at http://gluebenchmark.com/tasks.

Dnc.

The Diverse Natural Language Inference Collection (DNC) by poliak2018emnlp-DNC consists of 9 textual entailment tasks requiring 7 different types of reasoning. Like IIE by whiteInferenceEverythingRecasting2017, data for each task follows the form of the original RTE Challenge [Dagan, Glickman,  MagniniDagan et al.2005]. Some of the tasks within DNC cover fundamental reasoning skills which are required for any reasoning system, while others cover more challenging reasoning skills which require commonsense. Each task is recast from a previously existing dataset:

  1. [nolistsep]

  2. Named Entity Recognition, recast from the Groningen Meaning Bank [Bos, Basile, Evang, Venhuizen,  BjervaBos et al.2017] and the ConLL-2003 shared task [Tjong Kim Sang  De MeulderTjong Kim Sang  De Meulder2003]

  3. Gendered Anaphora Resolution, recast from the Winogender dataset [Rudinger, Naradowsky, Leonard,  Van DurmeRudinger et al.2018a]

  4. Lexicosyntactic Inference, recast from MegaVeridicality [White  RawlinsWhite  Rawlins2018], VerbNet [SchulerSchuler2005], and VerbCorner [Hartshorne, Bonial,  PalmerHartshorne et al.2013]

  5. Figurative Language, recast from puns by D15-1284 and miller-hempelmann-gurevych:2017:SemEval

  6. Relation Extraction, partially from FACC1 [Gabrilovich, Ringgaard,  SubramanyaGabrilovich et al.2013]

  7. Subjectivity, recast from kotzias2015group

The DNC benchmark consists of about 570,000 examples total, and can be downloaded at http://github.com/decompositional-semantics-initiative/DNC.

2.2 Criteria and Considerations for Creating Benchmarks

The goal of benchmarks is to support technology development and provide a platform to measure research progress in commonsense reasoning. Whether this goal can be achieved depends on the nature of the benchmark. This section identifies the successes and lessons learned from the existing benchmarks and summarizes key considerations and criteria that should guide the creation of the benchmarks, particularly in the areas of task format, evaluation schemes, balance of data, and data collection methods.

2.2.1 Task Format

In creating benchmarks, determining the formulation of the problem is an important step. Among existing benchmarks, there exist a few common task formats, and while some task formats are interchangeable, others are only suited for particular tasks. We provide a review of these formats, indicating the types of tasks they are suitable for.

Classification tasks.

Most benchmark tasks are classification problems, where each response is a single choice from a finite number of options. These include textual entailment tasks, which most commonly require a binary or ternary decision about a pair of sentences, cloze tasks, which require a multiple-choice decision to fill in a blank, and traditional multiple-choice question answering tasks.

Textual entailment tasks. A highly popular format was originally introduced by the RTE Challenges, where given a pair of texts, i.e., a context and hypothesis, one must determine whether the context entails the hypothesis [Dagan, Glickman,  MagniniDagan et al.2005]. In the fourth and fifth RTE Challenges, this format was extended to a three-way decision problem where the hypothesis may contradict the context [Giampiccolo, Dang, Magnini, Dagan,  DolanGiampiccolo et al.2008, Bentivogli, Dagan, Dang, Giampiccolo,  MagniniBentivogli et al.2009]. The JOCI benchmark further extends the problem to a five-way decision task, where the hypothesis text may range from impossible to very likely given the context [Zhang, Rudinger, Duh,  Van DurmeZhang et al.2016].

While this format is typically used for textual entailment problems like the RTE Challenges, it can be used for nearly any type of inference problem. Some multi-task benchmarks have adopted the format for several different reasoning tasks, for example, the Inference is Everything [White, Rastogi, Duh,  Van DurmeWhite et al.2017], GLUE [Wang, Singh, Michael, Hill, Levy,  BowmanWang et al.2018], and DNC [Poliak, Haldar, Rudinger, Hu, Pavlick, White,  Van DurmePoliak et al.2018a] benchmarks use the classic RTE format for most of their tasks by automatically recasting previously existing datasets into it. Many of these problems deal either with reasoning processes more focused than the RTE Challenges, or more advanced reasoning than the RTE Challenges demand, showing the flexibility of the format. Some such tasks include coreference resolution, recognition of puns, and question answering. Examples from these recasted tasks are listed in Figure 9.

  1. [label=()]

  2. GLUE, Question Answering NLI
    Context: Who was the main performer at this year’s halftime show?

    Hypothesis: The Super Bowl 50 halftime show was headlined by the British rock group Coldplay with special guest performers Beyoncé and Bruno Mars, who headlined the Super Bowl XLVII and Super Bowl XLVIII halftime shows, respectively.

    Label: entailed

  3. IIE, Definite Pronoun Resolution
    Context: The bird ate the pie and it died.

    Hypothesis: The bird ate the pie and the bird died.

    Label: entailed

  4. DNC, Figurative Language
    Context: Carter heard that a gardener who moved back to his home town rediscovered his roots.

    Hypothesis: Carter heard a pun.

    Label: entailed

Figure 9: Examples from tasks within the General Language Understanding Evaluation (GLUE) benchmark by wangGLUEMultiTaskBenchmark2018, Inference is Everything (IIE) by whiteInferenceEverythingRecasting2017, and the Diverse NLI Collection (DNC) by poliak2018emnlp-DNC. Each task is recast from preexisting data into classic RTE format.

Cloze tasks. Another popular format is the cloze task, originally conceived by taylorClozeProcedureNew1953. Such a task typically involves the deletion of one or more words in a text, essentially requiring one to fill in the blank from a set of choices, a format suitable for language modeling problems. This format has been used in several commonsense benchmarks, including CBT [Hill, Bordes, Chopra,  WestonHill et al.2015], the Story Cloze Test for the ROCStories benchmark [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016], CLOTH [Xie, Lai, Dai,  HovyXie et al.2017], SWAG [Zellers, Bisk, Schwartz,  ChoiZellers et al.2018], and ReCoRD [Rajpurkar, Zhang, Lopyrev,  LiangRajpurkar et al.2016]. These benchmarks provide anywhere from two to ten options to fill in the blank, and range from requiring the prediction of a single word to parts of sentences and entire sentences. Examples of cloze tasks are listed in Figure 10.

  1. [label=()]

  2. CBT [Hill, Bordes, Chopra,  WestonHill et al.2015]

    1. [label=0 ,itemindent=0pt,leftmargin=0pt,itemsep=0em]

    2. Mr. Cropper was opposed to our hiring you .

    3. Not , of course , that he had any personal objection to you , but he is set against female teachers , and when a Cropper is set there is nothing on earth can change him .

    4. He says female teachers ca n’t keep order .

    5. He ’s started in with a spite at you on general principles , and the boys know it .

    6. They know he ’ll back them up in secret , no matter what they do , just to prove his opinions .

    7. Cropper is sly and slippery , and it is hard to corner him . ”

    8. “ Are the boys big ? ”

    9. queried Esther anxiously .

    10. “ Yes .

    11. Thirteen and fourteen and big for their age .

    12. You ca n’t whip ’em – that is the trouble .

    13. A man might , but they ’d twist you around their fingers .

    14. You ’ll have your hands full , I ’m afraid .

    15. But maybe they ’ll behave all right after all . ”

    16. Mr. Baxter privately had no hope that they would , but Esther hoped for the best .

    17. She could not believe that Mr. Cropper would carry his prejudices into a personal application .

    18. This conviction was strengthened when he overtook her walking from school the next day and drove her home .

    19. He was a big , handsome man with a very suave , polite manner .

    20. He asked interestedly about her school and her work , hoped she was getting on well , and said he had two young rascals of his own to send soon .

    21. Esther felt relieved .

    22. She thought that Mr. _____ had exaggerated matters a little .

      Blank: Baxter, Cropper, Esther, course, fingers, manner, objection, opinions, right, spite

  3. ROCStories [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016]
    Karen was assigned a roommate her first year of college. Her roommate asked her to go to a nearby city for a concert. Karen agreed happily. The show was absolutely exhilarating.

    Ending:

    1. [label=.]

    2. Karen became good friends with her roommate.

    3. Karen hated her roommate.

  4. CLOTH [Xie, Lai, Dai,  HovyXie et al.2017]
    She pushed the door open and found nobody there. "I am the _____ to arrive." She thought and came to her desk.

    1. [label=.]

    2. last

    3. second

    4. third

    5. first

  5. SWAG [Zellers, Bisk, Schwartz,  ChoiZellers et al.2018]
    On stage, a woman takes a seat at the piano. She

    1. [label=.]

    2. sits on a bench as her sister plays with the doll.

    3. smiles with someone as the music plays.

    4. is in the crowd, watching the dancers.

    5. nervously sets her fingers on the keys.

  6. ReCoRD [Rajpurkar, Zhang, Lopyrev,  LiangRajpurkar et al.2016]
    … Daniela Hantuchova knocks Venus Williams out of Eastbourne 6-2 5-7 6-2 …

    Query:

    Hantuchova breezed through the first set in just under 40 minutes after breaking Williams’ serve twice to take it 6-2 and led the second 4-2 before _____ hit her stride.


    Venus Williams

Figure 10: Examples from cloze tasks. Answers in bold.

Traditional multiple-choice tasks. If not in entailment or cloze form, benchmark classification tasks tend to use traditional multiple-choice questions. Benchmarks which use this format include COPA [Roemmele, Bejan,  GordonRoemmele et al.2011], Triangle-COPA [GordonGordon2016], the Winograd Schema Challenge [Davis, Morgenstern,  OrtizDavis et al.2018], ARC [Clark, Cowhey, Etzioni, Khot, Sabharwal, Schoenick,  TafjordClark et al.2018], MCScript [Ostermann, Modi, Roth, Thater,  PinkalOstermann et al.2018], and OpenBookQA [Mihaylov, Clark, Khot,  SabharwalMihaylov et al.2018]. Two-way and four-way decision questions are the most common among these benchmarks.

Open-ended tasks.

On the other hand, some benchmarks require open-ended responses rather than providing a small list of alternatives to choose from. Answers may be restricted to spans of a given text, e.g., SQuAD [Rajpurkar, Zhang, Lopyrev,  LiangRajpurkar et al.2016, Rajpurkar, Jia,  LiangRajpurkar et al.2018] or CoQA [Reddy, Chen,  ManningReddy et al.2018]. They may be less restricted to a subset of a large number of category labels, e.g., the Maslow, Reiss, and Plutchik tasks in Story Commonsense [Rashkin, Bosselut, Sap, Knight,  ChoiRashkin et al.2018a]. Of course, they may be purely open-ended, e.g., the motivation and emotion tasks in Story Commonsense, Event2Mind [Rashkin, Sap, Allaway, Smith,  ChoiRashkin et al.2018b] or bAbI [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016]. Examples of these open-ended formats are listed in Figure 11.

  1. [label=()]

  2. SQuAD 2.0 [Rajpurkar, Jia,  LiangRajpurkar et al.2018]55footnotemark: 5
    In February 2016, over a hundred thousand people signed a petition in just twenty-four hours, calling for a boycott of Sony Music and all other Sony-affiliated businesses after rape allegations against music producer Dr. Luke were made by musical artist Kesha. Kesha asked a New York City Supreme Court to free her from her contract with Sony Music but the court denied the request, prompting widespread public and media response.

    How many people signed a petition to boycott Sony Music in 2016?
    over a hundred thousand

  3. SC [Rashkin, Bosselut, Sap, Knight,  ChoiRashkin et al.2018a]666Example extracted from SQuAD training data available at http://rajpurkar.github.io/SQuAD-explorer/.
    Valerie was getting ready for a formal dance. She had been preparing for hours. As she was ready to leave, her acrylic nail broke. She snapped off all of her faux nails.

    Maslow: esteem, stability
    Reiss: status, approval, order
    Plutchik: surprise, sadness, disgust, anger

  4. bAbI [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016]
    The kitchen is north of the hallway.
    The bathroom is west of the bedroom.
    The den is east of the hallway.
    The office is south of the bedroom.

    How do you go from den to kitchen?
    west, north

Figure 11: Examples from open-ended response tasks. Answers in bold.
77footnotetext: Example extracted from Story Commonsense test data available at http://uwnlp.github.io/storycommonsense/

2.2.2 Evaluation Schemes

The Turing Test has long been criticized by AI researchers as it does not truly evaluate machine intelligence. There is a critical need for new intelligence benchmarks to support incremental development and evaluation of AI techniques, as described by ortizWhyWeNeed2016. These benchmarks should not merely provide a pass or fail grade, rather they should provide feedback on a continuous scale which enables both incremental development and comparison of approaches. One key consideration for these benchmarks is informative evaluation metrics that are objective and easy to calculate. These metrics can be used to compare different approaches and compare machine performance against human performance.

Evaluation metrics.

Choice of evaluation metrics is highly dependent on the type of task, and thus so is the difficulty of calculating them. Multiple-choice tasks often use exact-match accuracy if correct answers or class labels are evenly distributed through benchmark data. If this is not the case, common practice is to additionally present F-measure as an evaluation metric [Wang, Singh, Michael, Hill, Levy,  BowmanWang et al.2018]

. The precision and recall may be presented, however the F-measure (which considers both) is much more common in the recent surveyed benchmarks. Multiple-choice and classification task formats such as RTE, cloze, and traditional multiple-choice can all use these metrics.

Open-ended tasks are by nature more difficulty to evaluate, but they can still be objective and informative. Open-ended tasks like SQuAD [Rajpurkar, Zhang, Lopyrev,  LiangRajpurkar et al.2016] or CoQA [Reddy, Chen,  ManningReddy et al.2018] where answers can only be spans of a provided text can be evaluated similarly to multiple-choice tasks. Exact-match accuracy and F-measure are used as evaluation metrics on both of these tasks, where the collection of tokens in the predicted and true spans (excluding punctuation and articles) are compared. Where answers are a subset of a large group of category labels, e.g., the Maslow, Reiss, and Plutchik tasks in Story Commonsense, evaluation is similar, but for these tasks particularly, precision and recall are additionally included [Rashkin, Bosselut, Sap, Knight,  ChoiRashkin et al.2018a].

Where multiple purely open-ended responses are given, as in Event2Mind [Rashkin, Sap, Allaway, Smith,  ChoiRashkin et al.2018b], evaluation is more difficult. Event2Mind particularly uses the average cross-entropy and the "recall @ 10," i.e., the percentage of times human-produced ground truth labels fall within the top 10 predictions from a system. In bAbI, where a purely open-ended response is compared to a single ground truth answer, exact-match accuracy is used [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016]. Both Event2Mind and bAbI are able to use such exact evaluation measures because responses are short. In bAbI particularly, correct responses are limited to one word or lists of words. Such restrictions are essential for such simple and accurate evaluation. For longer generated responses, evaluation metrics like BLEU for machine translation [Papineni, Roukos, Ward,  ZhuPapineni et al.2002]

, or the modified ROUGE for text summarization

[LinLin2004], are useful.

Comparison of approaches.

Benchmarks provide common datasets and experimental setups for researchers to compare different approaches. When a new benchmark is first released, it often reports results from simple baseline approaches. Ideal baselines should lead to relatively low performance, thus leaving room for improvement from more advanced approaches. For multiple-choice problems, baseline approaches are often calculated by random choice, choosing the class appearing the most in the test data, or choosing the alternative with the highest overlap in n-grams with the question or provided text

[Richardson, Burges,  RenshawRichardson et al.2013]. Examples of these baseline approaches can be found in the Story Cloze Test baselines, which include most of these approaches and more [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016]. For open-ended problems, shallow lexical approaches (e.g., using language models) may again be used, as in the bAbI benchmark[Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016]. When competitive approaches are developed for existing benchmarks, they are often used as baselines in new benchmarks, thus boosting baseline performance over time and encouraging the development of new or improved models to solve problems.

Human performance measurement.

To evaluate the progress of machine intelligence, human performance on benchmark tasks is often measured to provide a reference point, e.g., through crowdsourcing techniques. ROCStories [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016], SQuAD [Rajpurkar, Zhang, Lopyrev,  LiangRajpurkar et al.2016], and CoQA [Reddy, Chen,  ManningReddy et al.2018] all provide metrics for human performance. The goal of computational models is to come close to or exceed human performance.

2.2.3 Data Biases

When creating benchmarks, one challenge is the bias of data unintentionally introduced to the benchmark. For example, in the first release of the Visual Question Answering (VQA) benchmark [Agrawal, Lu, Antol, Mitchell, Zitnick, Parikh,  BatraAgrawal et al.2017], researchers found that machine learning models were learning several statistical biases in the data, and could answer up to 48% of questions in the validation set without seeing the image [Manjunatha, Saini,  DavisManjunatha et al.2018]. This artificially high system performance is problematic, as it is not accredit to the underlying technology. Here we summarize several key dimensions of biases encountered in previous commonsense research. Some of these (e.g., class label distributions) are easier to avoid, while others (e.g., hidden correlation biases) are more difficult to address.

Label distribution bias.

Class label distribution bias is the easiest to avoid. In multiple-choice problems, correct answers or class labels should be entirely randomized so that each possible choice appears in benchmark data in a uniform distribution. This way, a majority-class baseline will score as low as possible on the task. While binary-choice tasks should have a 50% majority class baseline, the MegaVeridicality task within DNC

[Poliak, Haldar, Rudinger, Hu, Pavlick, White,  Van DurmePoliak et al.2018a] has a 67% majority-class baseline due to unevenly distributed class labels, leaving significantly less room for incremental improvement than tasks with lower-performing baselines.

Question type bias.

For benchmarks involving question answering tasks, previous work has made effort to balance the types of questions, especially if questions are generated by crowdsourcing. This will ensure a broad domain of knowledge and reasoning required to solve the task. A fairly simple method to keep a balance of question types is to calculate the distribution of the first words of each question, as the creators of CoQA [Reddy, Chen,  ManningReddy et al.2018] and CommonsenseQA [Talmor, Herzig, Lourie,  BerantTalmor et al.2019] did. One could also manually label a random sample of questions with categories relating to types of knowledge or reasoning required, or have crowd workers perform this task if an expert is not essential for this process. Examples of this are shown by the creators of SQuAD 2.0 [Rajpurkar, Jia,  LiangRajpurkar et al.2018] and ARC [Clark, Cowhey, Etzioni, Khot, Sabharwal, Schoenick,  TafjordClark et al.2018]. To entirely avoid question type biases, implementing a standard set of questions for all provided texts may be beneficial. ProPara does for all participants in its procedural paragraphs [Mishra, Huang, Tandon, Yih,  ClarkMishra et al.2018], limiting questions about each entity to whether it is created, destroyed, or moved during the paragraph, and when and where this happens. manjunathaExplicitBiasDiscovery2018 suggest that biases can further be avoided in VQA benchmarks by forcing questions to require a particular skill (e.g., telling time) to be answered, and this rule of thumb can be applicable for textual benchmarks as well.

Superficial correlation bias.

The kind of biases most difficult to discern and avoid perhaps are those caused by accidental correlations between features of answers and questions. One example of this is gender bias, which commonsense reasoning systems are particularly vulnerable to when training on biased data. gender-bias-in-coreference-resolution highlight this problem in coreference resolution, showing that systems trained on gender-biased data perform worse in gender pronoun disambiguation tasks. For example, consider the problem from their Winogender dataset in Figure 3: "The paramedic performed CPR on the passenger even though she knew it was too late." In determining who she is, systems trained on gender-biased training data may be more likely, for example, to incorrectly choose the passenger rather than the paramedic due to male gender pronouns appearing more commonly in the context of this occupation than female gender pronouns. To avoid this, gender pronouns should appear equally frequently among other words, especially those related to occupations and activities. Similar gender biases are identified in Event2Mind data, which are derived from movie scripts [Rashkin, Sap, Allaway, Smith,  ChoiRashkin et al.2018b].

When authoring natural language data (e.g., generating questions or hypotheses), some human stylistic artifacts such as predictable sentence structure, presence of certain linguistic phenomena, and vocabulary use can also cause these superficial correlation biases. This is particularly the case if data is authored by crowd workers. In the Story Cloze Test [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016], systems are presented a plausible and implausible ending to the story, and must choose which ending is plausible. However, previous work schwartzEffectDifferentWriting2017 has shown that the Story Cloze Test can be solved with up to 75.2% accuracy by only looking at the two possible endings. They do this by exploiting human writing style biases in the possible endings rather than performing actual commonsense reasoning. For example, they find that negative language is used more commonly in the wrong ending (e.g., "hates"), and the correct ending is more likely to use enthusiastic language (e.g., "!"). An example of a biased negative ending is seen in Figure 10. sharmaTacklingStoryEnding2018 have begun work to update the benchmark data and remove these biases.

While generating the Story Cloze Test data was not a fast or simple task for crowd workers, gururanganAnnotationArtifactsNatural2018 suggest that such biases can come from crowd workers’ adoption of predictable annotation strategies and heuristics to quickly generate data. These strategies have been revealed for several textual entailment benchmarks which consist of pairs of short sentences. For example, on the entailment task in SemEval 2014

[Marelli, Bentivogli, Baroni, Bernardi, Menini,  ZamparelliMarelli et al.2014b] as part of the SICK benchmark [Marelli, Menini, Baroni, Bentivogli, Bernardi,  ZamparelliMarelli et al.2014a], laiIllinoisLHDenotationalDistributional2014 found that the presence of negation in an example was strongly associated with the appearance of the contradiction class label. Their trained classifier using this feature alone was able to achieve 61% accuracy. Later, poliakHypothesisOnlyBaselines2018 and gururanganAnnotationArtifactsNatural2018 found the presence of particular words in the hypothesis sentence can bias the entailment prediction in several entailment benchmarks. For example, "nobody" in contradictory examples from SNLI [Bowman, Angeli, Potts,  ManningBowman et al.2015] was found to be an indicator of contradiction, while generic words like "animal" and "instrument", as well as gender-neutral pronouns, were found to be indicators of entailment. gururanganAnnotationArtifactsNatural2018 further find that a high sentence length is an indicator of neutral entailment, and suggest that crowd workers often remove words from the context sentences to create entailed hypothesis sentences. Using biases like these, a baseline approach by poliakHypothesisOnlyBaselines2018 which only used the hypothesis sentence from entailment benchmarks was able to outperform a majority-class baseline in SNLI, JOCI [Zhang, Rudinger, Duh,  Van DurmeZhang et al.2016], SciTail [Khot, Sabharwal,  ClarkKhot et al.2018], two tasks within Inference is Everything [White, Rastogi, Duh,  Van DurmeWhite et al.2017], and the MultiNLI task within GLUE [Williams, Nangia,  BowmanWilliams et al.2017, Wang, Singh, Michael, Hill, Levy,  BowmanWang et al.2018].

These results demonstrate a serious need for greater attention to such bias in commonsense benchmarks. To recognize biases, a simple technique is to calculate the mutual information between words and classes within benchmark examples. This was performed by researchers in discovering stylistic biases in entailment benchmarks [Gururangan, Swayamdipta, Levy, Schwartz, Bowman,  SmithGururangan et al.2018], but this analysis should be performed on any new benchmark data when it is created. To avoid the biases, more advanced techniques may be required. For example, in creating the SWAG benchmark [Zellers, Bisk, Schwartz,  ChoiZellers et al.2018], a novel adversarial filtering process was introduced to ensure writing styles are consistent among ending choices, and the correct answer cannot be identified by exploitative stylistic classifiers. A continuous effort in finding techniques to avoid these kinds of biases will be important for developing future benchmarks.

2.2.4 Collection Methods

Methods to collect benchmark data ideally should be cost-efficient and should result in high-quality and unbiased data. Both manual and automatic approaches have been applied. Manual curation of data can be done by experts/researchers or through crowd-workers, which comes with its own set of considerations. Automatic approaches often involve automatically generating data by applying language models or automatically extracting or mining data from existing resources. As shown in Table 2, a benchmark is often created through a combination of these approaches. For the rest of this section, we summarize pros and cons for these different approaches.

Manual versus automatic generation.

Until recently, many of the existing benchmarks were created manually by groups of experts. This may involve tedious processes like manually collecting data from other corpora or the Internet, e.g., the first RTE Challenge [Dagan, Glickman,  MagniniDagan et al.2005], or authoring most data from scratch, e.g., the Winograd Schema Challenge [LevesqueLevesque2011, Levesque, Davis,  MorgensternLevesque et al.2012, Morgenstern  OrtizMorgenstern  Ortiz2015, Morgenstern, Davis,  OrtizMorgenstern et al.2016]. This approach ensures data is high quality and thus typically requires little validation, however it is not scalable. These datasets are often small compared to those created with other approaches.

Recent advances in NLP make it possible to automatically generate textual data (e.g., natural language statements, questions, etc.) for benchmark tasks. Although this approach is scalable and efficient, the quality of data varies, and often depends directly on the language model used. For example, in bAbI [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016], as agents interacted with objects in a virtual world and with each other, examples were automatically generated. This method ensures that produced data are sensible to the constraints of the physical world. However, as the questions and answers are written with simple structure, the data is easily understood by machines. Most of the dataset is solved with 100% accuracy just by baseline systems. Consequently, the bAbI tasks are often considered as toy tasks. Further, models trained on bAbI currently cannot generalize well to real-world, naturally-generated data [Das, Munkhdalai, Yuan, Trischler,  McCallumDas et al.2019]

. A more sophisticated rule-based method for probabilistically generating text data which encourages more diverse language without any added biases is presented by manningRealworldVisualReasoning2018. Though automatic natural language generation methods are improving, it is likely that such approaches will still require some manual validation.

Automatic generation versus text mining.

As millions of natural language texts are publicly available on the Internet and in existing datasets, it is possible to build commonsense benchmarks from these texts by automatically mining texts and extracting sentences. This process is most successful when the information source is created by experts and highly accurate. For example, in the CLOTH [Xie, Lai, Dai,  HovyXie et al.2017] benchmark, data instances were mined from fill-in-the-blank English tests created by human teachers. While other automatically generated cloze tasks like CBT [Hill, Bordes, Chopra,  WestonHill et al.2015] choose missing words mostly randomly, CLOTH’s cloze task is more challenging as the missing word in each example was chosen by an expert. For many other benchmarks that are built by mining less reliable or consistent sources, there’s often a need for automatic or human validation or filtering, e.g., in creating SWAG [Zellers, Bisk, Schwartz,  ChoiZellers et al.2018], which was mined in part from other corpora, such as the Large Scale Movie Description Challenge [Rohrbach, Torabi, Rohrbach, Tandon, Pal, Larochelle, Courville,  SchieleRohrbach et al.2017] and the captions in ActivityNet [Heilbron, Escorcia, Ghanem,  NieblesHeilbron et al.2015].

Crowdsourcing considerations.

Acquiring language data directly from crowd workers has become more feasible in recent years, due to the growth of crowdsourcing platforms, such as Amazon Mechanical Turk. Crowdsourcing has enabled researchers to create larger datasets than ever before, however it comes with a set of considerations relating to task complexity, worker qualification, data validation, and cost optimization.

Task complexity. When creating a crowdsourcing task, it is important to consider the difficulty level of what crowd workers will be expected to do. When given overly complicated instructions, the non-expert crowd workers may find it difficult understanding the instructions and keeping them in mind while working on the task. Easy crowdsourcing tasks typically involve quick pass/fail validation or relabeling of data, e.g., in validating SNLI [Bowman, Angeli, Potts,  ManningBowman et al.2015]. Difficult crowdsourcing tasks usually require crowd workers to write large texts, e.g., in creating ROCStories [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016], where workers wrote five-sentence stories following a fairly elaborate set of restrictions on story content to ensure stories were high-quality, focused, and well-organized. Such restrictions must be explained briefly and clearly, and it may take several pilot studies to ensure workers understand and follow the instructions correctly [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016].

Worker qualification. Regardless of task difficulty, efforts should be made to avoid workers submitting invalid data, whether they are trolls or unable to follow directions. This may be done through some sort of qualification task, perhaps requiring a prospective worker to recognize examples of acceptable submissions [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016], or testing a prospective worker’s grammar [Richardson, Burges,  RenshawRichardson et al.2013]. It may also be worthwhile to identify excellent workers, and reward them and/or recruit them for more work [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016]. In our own experiences with crowdsourcing, we have found that if a worker produces one invalid submission, all of his/her submissions will likely be invalid, and thus the worker should be rejected and potentially banned from the task. On the other hand, if a worker produces an excellent submission, all of his/her submissions will likely be excellent.

Data validation. Even though crowdsourced data is produced by non-experts, data can just as easily be validated by non-experts. In creating ROCStories, mostafazadehCorpusClozeEvaluation2016 employ several novel methods of crowdsourced validation of crowdsourced data. Involved validation is especially necessary for difficult crowdsourcing tasks like the authoring of ROCStories, which required crowd workers to write long texts following strict guidelines. Crowdsourced data validation typically just requires a separate group of workers to review generated data and identify any bad examples, e.g., the validation of DNC benchmark data [Poliak, Haldar, Rudinger, Hu, Pavlick, White,  Van DurmePoliak et al.2018a]

. For labeling tasks, multiple crowd workers can label the same examples, and agreement can be measured from this to estimate data quality, e.g., in creating the JOCI benchmark

[Zhang, Rudinger, Duh,  Van DurmeZhang et al.2016].

For complex writing tasks where both data authoring and validation are highly involved, it may be advantageous to present crowdsourced tasks to two interacting workers at once. reddyCoQAConversationalQuestion2018 do this to create CoQA from actual human conversations about provided passages on Amazon Mechanical Turk, and achieve high data quality with minimal validation. In creating the data, the two interacting workers validate each others’ work, and can even report workers who do not follow instructions, reducing the burden of worker qualification.

Dataset (Reference) Data Size Text/ Question Answer Alter- natives Anno- tation Vali- dation
RTE-1 [Dagan, Glickman,  MagniniDagan et al.2005] 1.37K M M M
RTE-2 [Bar-Haim, Dagan, Dolan, Ferro, Giampiccolo, Magnini,  SzpektorBar-Haim et al.2006] 1.60K M M M
RTE-3 [Giampiccolo, Magnini, Dagan,  DolanGiampiccolo et al.2007] 1.60K M M M
RTE-4 [Giampiccolo, Dang, Magnini, Dagan,  DolanGiampiccolo et al.2008] 1.00K M M M
RTE-5 [Bentivogli, Dagan, Dang, Giampiccolo,  MagniniBentivogli et al.2009] 1.20K M M M
RTE-6 [Bentivogli, Clark, Dagan,  GiampiccoloBentivogli et al.2010] 32.7K T M M
RTE-7 [Bentivogli, Clark, Dagan,  GiampiccoloBentivogli et al.2011] 48.8K T M, T M
COPA [Roemmele, Bejan,  GordonRoemmele et al.2011] 1.00K M M M M
SICK [Marelli, Menini, Baroni, Bentivogli, Bernardi,  ZamparelliMarelli et al.2014a] 9.84K M, T C C
SNLI [Bowman, Angeli, Potts,  ManningBowman et al.2015] 570K T, C C C C
CBT [Hill, Bordes, Chopra,  WestonHill et al.2015] 687K T, A A A
Triangle-COPA [GordonGordon2016] 100 M M M M M
ROCStories [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016] 98.2K C C C C
WSC [Morgenstern, Davis,  OrtizMorgenstern et al.2016] 60 M M M
bAbI [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016] 40.0K A A
SQuAD 1.1 [Rajpurkar, Zhang, Lopyrev,  LiangRajpurkar et al.2016] 108K T, C C C
JOCI [Zhang, Rudinger, Duh,  Van DurmeZhang et al.2016] 39.1K A, T C C
CLOTH [Xie, Lai, Dai,  HovyXie et al.2017] 99.4K T T T C
IIE [White, Rastogi, Duh,  Van DurmeWhite et al.2017] 313K T, A T, A M
SciTail [Khot, Sabharwal,  ClarkKhot et al.2018] 27.0K T, C C C C
ARC [Clark, Cowhey, Etzioni, Khot, Sabharwal, Schoenick,  TafjordClark et al.2018] 7.79K T T T
MCScript [Ostermann, Modi, Roth, Thater,  PinkalOstermann et al.2018] 13.9K M, T, C C C C C
SC [Rashkin, Bosselut, Sap, Knight,  ChoiRashkin et al.2018a] 161K T C C C
Event2Mind [Rashkin, Sap, Allaway, Smith,  ChoiRashkin et al.2018b] 57.1K T C C
ProPara [Mishra, Huang, Tandon, Yih,  ClarkMishra et al.2018] 488 C C C C
MultiRC [Khashabi, Chaturvedi, Roth, Upadhyay,  RothKhashabi et al.2018] 9.87K T, C C C C
SQuAD 2.0 [Rajpurkar, Jia,  LiangRajpurkar et al.2018] 151K T, C C C
CoQA [Reddy, Chen,  ManningReddy et al.2018] 8.40K T, C C C
GLUE [Wang, Singh, Michael, Hill, Levy,  BowmanWang et al.2018] 1.44M T, A T, A M
DNC [Poliak, Haldar, Rudinger, Hu, Pavlick, White,  Van DurmePoliak et al.2018a] 570K M, A, T M, A, T C
OpenBookQA [Mihaylov, Clark, Khot,  SabharwalMihaylov et al.2018] 5.96K C C C C C
SWAG [Zellers, Bisk, Schwartz,  ChoiZellers et al.2018] 114K T T, A A C C
ReCoRD [Zhang, Liu, Liu, Gao, Duh,  Van DurmeZhang et al.2018] 121K A, T A A A A, C
CommonsenseQA [Talmor, Herzig, Lourie,  BerantTalmor et al.2019] 9.40K C T T C
Table 2: Chronological summary of methods used in creating, annotating (i.e., providing extra useful information beyond the answer which the task evaluates upon), and validating selected commonsense benchmarks and tasks, where M refers to manual approaches by experts, A to automatic approaches through language generation, T to text mining, and C to crowdsourcing. Data size is included for comparison of methods used.

Cost optimization. Though hiring crowd workers is typically cheaper than hiring permanent workers, the cost may still be limiting, especially if the creator wishes to properly evaluate the quality of generated data through verification by even more crowd workers. For example, ROCStories [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016], consisting of about 50,000 well-evaluated five-sentence stories and 13,500 test cases, cost an average of 26 cents per story and an extra 10 cents per test case, resulting in a total cost close to 15,000 USD to generate the dataset. If the cost of such thorough validation is an issue, validating a random sample of produced data, e.g., in validating SNLI [Bowman, Angeli, Potts,  ManningBowman et al.2015], can serve as an indicator of the overall quality of benchmark data.

Ultimately, each method of data collection has its own advantages and drawbacks. While manual authoring results in high-quality, expert-verified data, it is slow and unscalable. Meanwhile, the quality of automatically authored data is highly dependent on the language model used, and though it is faster, it may require manual verification. If using text mining rather than generating from scratch, data is more likely to be representative of human language, however manual verification may still be necessary depending on the source which data are extracted from. And lastly, crowdsourcing is a quick and convenient way to collect human-authored data following any set of criteria or restrictions, but it comes with special considerations that address the difficulty of work, qualification of workers, data validation, and cost optimization. When developing a new benchmark, the above trade-offs will need to be carefully considered.

3 Knowledge Resources

It is estimated that a typical human has accrued several million different axioms of commonsense by adulthood [ChklovskiChklovski2003]. The lack of this commonsense knowledge is one of the major bottlenecks in machine intelligence. In order to remove this bottleneck, decades of efforts have been made in developing various knowledge resources in the field of AI. The acquired knowledge is often represented in various forms such as propositions, taxonomies, ontologies, and semantic networks. In this section, we start with an introduction to several existing knowledge resources, and then discuss the main issues involved in building these resources.

3.1 An Overview of Knowledge Resources for NLU

To understand human language, it is important to have linguistic knowledge resources that allow computers to identify syntactic and semantic structures from language. These structures often need to be augmented with common knowledge and commonsense knowledge in order to reach a full understanding.

3.1.1 Linguistic Knowledge Resources

Linguistic resources have been pivotal in pushing the NLP field forward in the last thirty years. Resources have been developed where annotations for syntactic, semantic, and discourse structures are provided for training machine learning models. Several knowledge bases, particularly for lexical semantics, have also been made available to facilitate semantic processing.

Annotated linguistic corpora.

Widely used linguistic resources include the Penn Treebank [Marcus, Santorini,  MarcinkiewiczMarcus et al.1993] and several derivatives of it. The Penn Treebank is perhaps the first annotated corpus that drove the development of earlier machine learning approaches in the 1990s. It started with POS tags and syntactic structures based on context-free grammar. The Wall Street Journal portion of it was further augmented into PropBank [Kingsbury, Palmer,  MarcusKingsbury et al.2002], which provides the annotation of predicate-argument structures[Taylor, Marcus,  SantoriniTaylor et al.2003]. The Penn Discourse Treebank (PDTB) is built upon these [Miltsakaki, Prasad, Joshi,  WebberMiltsakaki et al.2004], adding annotated discourse structures. OntoNotes revises the information in the Wall Street Journal portion of the Penn Treebank and PropBank, integrating it with word sense, proper name, coreference, and ontological annotations, as well as including some Chinese linguistic annotations [Pradhan, Hovy, Marcus, Palmer, Ramshaw,  WeischedelPradhan et al.2007]. The Abstract Meaning Representation (AMR) corpus extends PropBank into a sentence-level semantic formalism [Banarescu, Bonial, Cai, Georgescu, Griffitt, Hermjakob, Knight, Koehn, Palmer,  SchneiderBanarescu et al.2013]. All of these linguistic corpora can be downloaded by members of the Linguistic Data Consortium at http://www.ldc.upenn.edu/.

Lexical resources.

A widely used lexical resource for commonly used nouns, verbs, adjectives and adverbs is WordNet 777http://wordnet.princeton.edu/ [MillerMiller1995]. Different from a traditional online dictionary, WordNet organizes words in terms of concepts (i.e., a list of synonyms) and their semantic relations to other words (e.g., antonymy, hyponymy/hypernymy, entailment, etc.). WordNet has been applied in many NLP applications which involve, for example, query expansion and similarity measures. There are also resources specifically for verbs. VerbNet 888http://verbs.colorado.edu/~mpalmer/projects/verbnet.html

by schuler2005verbnet is a hierarchical English verb lexicon that is created based on the verb classes from the English Verb Classes and Alternations (EVCA) resource by levinEnglishVerbClasses1993. VerbNet contains 280 classes of verbs and each class is described by argument structures, selectional restrictions on the arguments, and syntactic descriptions. FrameNet 

999http://framenet.icsi.berkeley.edu/fndrupal/framenet_data [Fillmore, Baker,  SatoFillmore et al.2002] provides a database of frame semantics for a set of verbs. It also comes with sentences that are annotated with the frame semantics. Other resources for verbs include VerbOcean 101010http://demo.patrickpantel.com/demos/verbocean/[Chklovski  PantelChklovski  Pantel2004] which captures a network of 3,500 unique verbs and 22,000 fine-grained relations between the verbs, and VerbCorner 111111http://archive.gameswithwords.org/VerbCorner/about.php [Hartshorne, Bonial,  PalmerHartshorne et al.2013] which provides crowd-sourced validation for VerbNet.

3.1.2 Common Knowledge Resources

Common knowledge refers to specific facts about the world that are often explicitly stated, for example, "canine distemper is a domestic animal disease." [Cambria, Song, Wang,  HussainCambria et al.2011]. Though this is not the same as commonsense knowledge, it is often required to achieve a deep understanding of both the low- and high-level concepts found in language [Cambria, Song, Wang,  HussainCambria et al.2011]. In this section, we summarize several knowledge resources for common knowledge.

Yago.

Wikipedia is a large and open source of common knowledge. Yet Another Great Ontology (YAGO) by suchanekYAGOCoreSemantic2007 augments WordNet [MillerMiller1995] with common knowledge facts extracted from Wikipedia, converting WordNet from a primarily linguistic resource to a common knowledge base. YAGO originally consisted of more than 1 million entities and 5 million facts describing relationships between these entities. YAGO2 grounded entities, facts, and events in time and space, contained 446 million facts about 9.8 million entities [Hoffart, Suchanek, Berberich,  WeikumHoffart et al.2012], while YAGO3 added about 1 million more entities from non-English Wikipedia articles [Mahdisoltani, Biega,  SuchanekMahdisoltani et al.2013]. YAGO is available for free download at http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/.

DBpedia.

DBpedia by auerDBpediaNucleusWeb2007 is another Wikipedia-based knowledge base originally consisting of structured knowledge from more than 1.95 million Wikipedia articles. At its creation, DBpedia included around 103 million Resource Description Framework (RDF) triples121212https://www.w3.org/TR/rdf-concepts/#section-triples, which are triples of subjects, predicates, and objects which describe semantic relationships. These triples included descriptions of concepts within articles, information about people, links between articles, and category labels from YAGO. The latest version, available for free at http://wiki.dbpedia.org/develop/datasets, consists of 6.6 million entities, 5.5 million resources classified in the DBpedia ontology, and over 23 billion RDF triples.

WikiTaxonomy.

Yet another Wikipedia-based resource is WikiTaxonomy by ponzettoDerivingLargeScale2007 consists of about 105,000 well-evaluated semantic links between categories in Wikipedia articles. Categories and relationships are labeled using the connectivity of the conceptual network formed by the categories. The authors demonstrate that this resource can be used to calculate semantic similarity of words, which may be useful in textual entailment or inference tasks. WikiTaxonomy is available for free download at http://www.h-its.org/en/research/nlp/wikitaxonomy/.

Freebase.

Freebase by bollackerFreebaseCollaborativelyCreated2008 was a knowledge graph which originally contained 125 million RDF triples of general human knowledge about 4,000 types of entities and 7,000 properties of entities. This resource was later absorbed into the Google Knowledge Graph for intelligent web searching, however the last release is still available for free download at http://developers.google.com/freebase/. It contains more than 1.9 billion triples.

Nell.

As human knowledge is not static, it is advantageous for knowledge bases to grow over time. This is typically done through new releases, however the Never-Ending Language Learner (NELL) by carlsonArchitectureNeverEndingLanguage2010 continually grows by automatically mining structured beliefs of varying confidence from the web daily. It originally contained 242,000 beliefs about properties of entities, but now contains over 50 million beliefs, with almost 3 million of these having high confidence. This version can be downloaded for free at http://rtw.ml.cmu.edu/rtw/.

Probase.

Probase by wuProbabilisticTaxonomyMany2011 is different from previous common knowledge taxonomies in that relationships are probabilistic rather than concrete. Probase consists of 2.7 million concepts extracted from 1.6 billion web pages. Relationships between concepts are described in 20.8 million is-a and is-instance-of pairs, and probabilistic interpretation is possible through provided similarity values between 0 and 1 for each pair of concepts in the knowledge base. Though the original resource is no longer available, the Microsoft Concept Graph which was built upon Probase can be downloaded for free at http://concept.research.microsoft.com/Home/Download.

3.1.3 Commonsense Knowledge Resources

Commonsense knowledge, on the other hand, is considered obvious to most humans, and not so likely to be explicitly stated [Cambria, Song, Wang,  HussainCambria et al.2011]. davisCommonsenseReasoningCommonsense2015 demonstrate this fact: "if you see a six-foot-tall person holding a two-foot-tall person in his arms, and you are told they are father and son, you do not have to ask which is which." There has been a long effort in capturing and encoding commonsense knowledge. Various knowledge bases have been developed. Here, we give a brief introduction to some of the well-known commonsense knowledge bases.

Note that as there is a fine line between commonsense knowledge and common knowledge, the knowledge bases we describe here may also contain common knowledge. Learning commonsense knowledge requires generalizations over common knowledge, so it is not uncommon for these types of knowledge to appear together in knowledge bases. We present several knowledge resources which focus on commonsense knowledge, but may also include common knowledge.

Cyc.

A well-known project toward encoding commonsense knowledge is Cyc by lenatBuildingLargeKnowledgeBased1989, a knowledge base of rules expressing ontological relationships between objects encoded in the CycL language. The types of objects in Cyc include entities, collections, functions, and truth functions. Cyc also includes a powerful inference engine. ResearchCyc, a release of Cyc for the research community, can be licensed for free at http://www.cyc.com/researchcyc/. According to this site, the latest release of ResearchCyc contains over 7 million commonsense assertions. More recently, there have been efforts to map Cyc to Wikipedia articles in an attempt to connect it to other resources such as DBpedia and Freebase [Medelyan  LeggMedelyan  Legg2008, PohlPohl2012].

ConceptNet

Another popular knowledge base for commonsense reasoning is ConceptNet from liuConceptNetPracticalCommonsense2004, a product of the Open Mind Common Sense project by singhPublicAcquisitionCommonsense2002, which collected free text commonsense assertions from online users. This semantic network originally contained over 1.6 million assertions of commonsense knowledge represented as links between 300,000 nodes representing entities, but subsequent releases have expanded it and added more features. The latest release, ConceptNet 5.5 [Speer, Chin,  HavasiSpeer et al.2017], contains over 21 million links between over 8 million nodes, having been augmented by several additional resources including Cyc [Lenat  GuhaLenat  Guha1989] and DBpedia [Auer, Bizer, Kobilarov, Lehmann, Cyganiak,  IvesAuer et al.2007]. It includes knowledge from multilingual resources, and links to knowledge from other knowledge graphs. ConceptNet has been applied in several commonsense reasoning systems, some of which are described in Section 4.3. ConceptNet is open; information for using or downloading it can be found at http://conceptnet.io/.

AnalogySpace.

Though not technically a knowledge base itself, AnalogySpace [Speer, Havasi,  LiebermanSpeer et al.2008]

, another product of the Open Mind Common Sense project, is an algorithm for reducing the dimensionality of commonsense knowledge so that knowledge bases can be more efficiently and accurately reasoned over. Since knowledge in large knowledge bases can be noisy or subjective, it provides a way to make conclusions by analogy, i.e., through recognizing similarities and tendencies by simple vector operations. This projects concepts onto dimensions of goodness, difficulty, and so on, which may help models generalize on sparse benchmark data. AnalogySpace was originally applied to ConceptNet

[Liu  SinghLiu  Singh2004], and is included with the latest release of ConceptNet.

SenticNet.

SenticNet by cambriaSenticNetPubliclyAvailable2010 was originally only a commonsense knowledge base, but later versions incorporated common knowledge as well [Cambria, Olsher,  RajagopalCambria et al.2014a]. Though the knowledge base is intended for sentiment analysis, it may be useful in commonsense reasoning tasks which require inference about sentiment. SenticNet is available for free at http://sentic.net/downloads/.

IsaCore.

Isanette by cambriaIsanetteCommonCommon2011a was a semantic network of both common and commonsense knowledge created by combining ProBase [Wu, Li, Wang,  ZhuWu et al.2011] and ConceptNet 3 [Havasi, Speer,  AlonsoHavasi et al.2007] into a set of "is a" relationships and confidences. This work was later cleaned and optimized into IsaCore [Cambria, Song, Wang,  HowardCambria et al.2014b], and demonstrated to be effective for sentiment analysis. IsaCore may also be a useful resource for commonsense reasoning. It is available for free download at http://sentic.net/downloads/.

Cogbase.

COGBASE by olsherSemanticallybasedPriorsNuanced2014a uses a novel formalism to represent 2.7 million concepts and 10 million commonsense facts about them. It makes up the core of SenticNet 3 [Cambria, Olsher,  RajagopalCambria et al.2014a]. Data from COGBASE can currently be accessed via an online interface and a demo API available at http://cogview.com/cogbase/.

WebChild.

WebChild by tandonWebChildHarvestingOrganizing2014 was originally a commonsense knowledge base of general noun-adjective relations extracted from Web content and other resources, consisting of about 78,000 distinct noun senses, 5,600 distinct adjective senses, and 4.6 million assertions between them. These assertions captured fine-grained relations among the noun and adjective senses. Unlike other resources collected from the Web, WebChild consists primarily of commonsense knowledge, as it consists of generalized, fine-grained relationships between nouns and adjectives collected from various corpora rather than structured common knowledge taken directly from a single open source like Wikipedia. WebChild 2.0 [Tandon, de Melo,  WeikumTandon et al.2017] was later released to include over 2 million concepts and activities, and over 18 million such assertions. Data from WebChild can be browsed and downloaded online at http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/webchild/.

LocatedNear.

xuAutomaticExtractionCommonsense2018 claim that objects which tend to be near each other (e.g., silverware, a plate, and a glass) is a type of commonsense knowledge lacking in previous knowledge bases like ConceptNet 5.5 [Speer, Chin,  HavasiSpeer et al.2017]. They refer to this property as LocatedNear, and to address this issue, they create two datasets which we refer to jointly as LocatedNear. The first consists of 5,000 sentences describing scenes of two objects labeled for whether the objects tend to occur near each other, which can serve as a commonsense task similar to those introduced in Section 2. The second consists of 500 pairs of objects with human-produced confidence scores for how likely the objects are to appear near each other. These resources can be downloaded from https://github.com/adapt-sjtu/commonsense-locatednear.

Atomic.

The Atlas of Machine Commonsense (ATOMIC) by sapATOMICAtlasMachine2019 is a knowledge graph consisting of about 300,000 nodes corresponding to short textual descriptions of events, and about 877,000 "if-event-then" triples representing if-then relationships between everyday events. Rather than taxonomic or ontological knowledge, this graph contains easily-accessed inferential knowledge. sapATOMICAtlasMachine2019 demonstrate that neural models can learn simple commonsense reasoning skills from ATOMIC which can be used to make inferences about previously unseen events. ATOMIC can be browsed and downloaded for free at http://homes.cs.washington.edu/~msap/atomic/.

3.2 Approaches to Creating Knowledge Resources

Similar to creating commonsense reasoning benchmarks described in Section 2, various approaches have been applied to create knowledge resources. These approaches range from manual encoding by experts, to text mining from web documents, and to collection through crowdsourcing. A detailed description of these approaches is provided by davisCommonsenseReasoningCommonsense2015. Here, we give a brief discussion about pros and cons of these approaches.

Manual encoding.

Early knowledge bases were often manually created. The classic example of this is Cyc, which is produced by knowledge engineers who hand-code commonsense knowledge into the CycL formalism

[Lenat  GuhaLenat  Guha1989]. Since its first release in 1984, Cyc has been going through continuous development over the last 35 years. The cost of this manual encoding is high, with a total estimated cost of $120M [PaulheimPaulheim2018]. As a consequence, Cyc is small relative to other resources, and growing very slowly. On the other hand, this expert-based approach ensures high quality of data.

Text mining.

Text mining and information extraction tools such as TextRunner [Etzioni, Banko, Soderland,  WeldEtzioni et al.2008] and KnowItAll [Etzioni, Cafarella, Downey, Popescu, Shaked, Soderland, Weld,  YatesEtzioni et al.2005] are applied to automatically generate knowledge graphs and taxonomies from information sources. One popular information source is Wikipedia, which was drawn from in creating common knowledge bases such as YAGO [Suchanek, Kasneci,  WeikumSuchanek et al.2007], DBpedia [Auer, Bizer, Kobilarov, Lehmann, Cyganiak,  IvesAuer et al.2007], WikiTaxonomy [Ponzetto  StrubePonzetto  Strube2007]. Other knowledge bases are generated from crawling the Web, e.g., NELL [Carlson, Betteridge, Kisiel, Settles, Jr,  MitchellCarlson et al.2010], or even from other knowledge bases, e.g., IsaCore [Cambria, Song, Wang,  HowardCambria et al.2014b]. One key advantage of text mining approaches is cost efficiency. According to paulheimHowMuchTriple2018, creating a statement in the Wikipedia-extracted DBpedia and YAGO costs 1.85¢and 0.83¢respectively (USD), which are hundreds of folds less than the cost of manually encoding a statement in Cyc (which was estimated at about $5.71 per statement). This makes text mining approaches easily scale up to create large knowledge bases. However, the drawback is that the acquired knowledge can be noisy and inconsistent, and may often need human validation.

Crowdsourcing.

Another highly popular approach to creating knowledge bases is crowdsourcing. The Open Mind Common Sense project responsible for producing ConceptNet [Liu  SinghLiu  Singh2004] used a competitive online game to accept statements from humans in free text [SinghSingh2002]. Later, researchers converted the knowledge within collected statements into a knowledge graph by automatic processes. This method of using games to attract users to perform human intelligence tasks for free has been applied in creating other knowledge resources such as VerbCorner [Hartshorne, Bonial,  PalmerHartshorne et al.2013] and the Robot Trainer knowledge base by rodosthenousHybridApproachCommonsense2016, where players must teach a virtual robot human knowledge. The cost of crowdsourcing effort is difficult to assess. It ranges from literally getting it for free (e.g., the Open Mind Common Sense platform where users who submitted data were unpaid) to an estimate of $2.25 per statement [PaulheimPaulheim2018]. Though the gaming approach may be cheaper in the long run, developing such a game platform is inevitably more time-consuming. Another challenge of the crowdsourcing approach, as pointed out by davisCommonsenseReasoningCommonsense2015, is that naive crowd workers may not be able to follow the theories and representations of knowledge that engineers have worked out. As a result, knowledge acquired by crowdsourcing can be somewhat messy, which again often needs human expert validation.

Each of these methods has its own advantages and drawbacks in terms of the trade-offs between the cost and the quality of the acquired knowledge. Most of these knowledge resources are developed from a bottom-up fashion. The goal is to create general knowledge bases to provide inductive bias for a variety of learning and reasoning tasks. Nevertheless, it is not clear whether such a goal is met and to what extent these knowledge resources are applied to commonsense reasoning in practice. A systematic study, as suggested by davisCommonsenseReasoningCommonsense2015, for Cyc and other resources would be useful.

4 Learning and Inference Approaches

To solve the benchmark tasks described in Section 2

, a variety of approaches have been developed. These range from earlier symbolic and statistical approaches to recent approaches that apply deep learning and neural networks. This section gives a brief overview to some representative approaches.

4.1 Symbolic and Statistical Approaches

Manually authored logic rules and formalisms have been demonstrated to perform well for various reasoning tasks. davisLogicalFormalizationsCommonsense2017 provides a more detailed review of these approaches in commonsense reasoning, but we introduce a few which have been applied to the surveyed commonsense benchmarks. Manually authored logical rules were applied, for example, in systems for the earlier RTE Challenges [Raina, Ng,  ManningRaina et al.2005]. They were also applied in more recent work such as the baseline approach to the Triangle-COPA benchmark which achieved 91% accuracy [GordonGordon2016]. The highest-performing system [IfteneIftene2008] in the 3-way task of the fourth RTE challenge [Giampiccolo, Dang, Magnini, Dagan,  DolanGiampiccolo et al.2008] used both manually authored logical rules and outside knowledge from Wikipedia, WordNet [MillerMiller1995], and VerbOcean [Chklovski  PantelChklovski  Pantel2004]. While manually authored logical rules have been demonstrated to be highly effective in some tasks [GordonGordon2016], this approach is not scalable for more complex tasks and reasoning.

Statistical approaches often rely on engineered features to train statistical models for various tasks. Lexical features, for example, based on bag of words and word matching were commonly used in the earlier RTE Challenges [Dagan, Glickman,  MagniniDagan et al.2005, Bar-Haim, Dagan, Dolan, Ferro, Giampiccolo, Magnini,  SzpektorBar-Haim et al.2006], but often achieved results only slightly better than random guessing [Bar-Haim, Dagan, Dolan, Ferro, Giampiccolo, Magnini,  SzpektorBar-Haim et al.2006]. More competitive systems have used more linguistic features to make predictions, such as semantic dependencies and paraphrases [Hickl, Bensley, Williams, Roberts, Rink,  ShiHickl et al.2006], synonym, antonym, and hypernym relationships derived from training data, and hidden correlation biases in benchmark data laiIllinoisLHDenotationalDistributional2014.

External knowledge and the Web are often used to complement features derived from the training data. For example, the best system in the first RTE Challenge [Dagan, Glickman,  MagniniDagan et al.2005] used a naïve Bayes classifier with features from the co-occurrences of word from an online search engine [GlickmanGlickman2006]. A similar approach was also applied in the top system from the seventh RTE Challenge [Bentivogli, Clark, Dagan,  GiampiccoloBentivogli et al.2011], which utilized knowledge resources from Section 3, acronyms extracted from the training data, and linguistic knowledge to calculate a statistical measure of entailment between sentences [Tsuchida  IshikawaTsuchida  Ishikawa2011]

. While the use of some external knowledge provides an advantage over models which only use linguistic features extracted from training data, statistical models have still not been competitive in recent benchmarks of large data size. Nonetheless, such models may serve as useful baselines for new benchmarks, as demonstrated for JOCI

[Zhang, Rudinger, Duh,  Van DurmeZhang et al.2016].

4.2 Neural Approaches

Figure 12: Common components in neural approaches to language-based commonsense tasks.

The increasingly large amount of data available for recent benchmarks make it possible to train neural models. These approaches often top various leaderboards. Figure 12

shows some common components in neural models. First of all, distributional representation of words is fundamental where word vectors or embeddings are usually trained using neural networks on large-scale text corpora. In traditional word embedding models like word2vec

[Mikolov, Chen, Corrado,  DeanMikolov et al.2013] or GloVe [Pennington, Socher,  ManningPennington et al.2014], the embedding vectors are context independent. No matter what context the target word appears in, once trained, its embedding vector is always the same. Consequently, these embeddings lack the capability of modeling different word senses in different context, although this phenomena is prevalent in language. To address this problem, recent work has developed contextual word representation models, e.g., Embeddings from Language Models (ELMo) by petersDeepContextualizedWord2018 and Bidirectional Encoder Representations from Transformers (BERT) by devlinBERTPretrainingDeep2018. These models give words different embedding vectors based on the context in which they appear. These pre-trained word representations can be used as features or fine-tuned for downstream tasks. For example, the Generative Pre-trained Transformer (GPT) by radfordImprovingLanguageUnderstanding2018a and BERT [Devlin, Chang, Lee,  ToutanovaDevlin et al.2018]

introduce minimal task-specific parameters, and can be easily fine-tuned on the downstream tasks with modified final layers and loss functions.

On top of the word embedding layers, task-specific network architectures are designed for different downstream applications. These architectures often adopt recurrent neural networks (RNNs, e.g., LSTM and GRU), convolutional neural networks (CNNs), or more recently, transformers to solve specific tasks. And the output layers of networks are chosen based on the task formulation, e.g., linear layer plus softmax for classifications, language decoder for language generations. Because of the sequential nature of language, RNN-based architectures are widely applied and are often implemented in both baseline approaches

[Bowman, Angeli, Potts,  ManningBowman et al.2015, Rashkin, Sap, Allaway, Smith,  ChoiRashkin et al.2018b] and the current state-of-the-art approaches [Kim, Hong, Kang,  KwakKim et al.2019, Chen, Cui, Ma, Wang, Liu,  HuChen et al.2018, Henaff, Weston, Szlam, Bordes,  LeCunHenaff et al.2017]. Given different architectures, neural models also benefit from techniques like memory augmentation and attention mechanisms. For tasks that require reasoning based on multiple supporting facts, e.g., bAbI [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016], memory-augmented networks like memory networks [Weston, Chopra,  BordesWeston et al.2015] and recurrent entity networks [Henaff, Weston, Szlam, Bordes,  LeCunHenaff et al.2017] have been shown effective. And for tasks that require alignment between input and output, e.g., textual entailment tasks like SNLI [Bowman, Angeli, Potts,  ManningBowman et al.2015], or capturing long-term dependencies, it is often beneficial to adopt attention mechanisms to models.

Next, we give examples of current state-of-the-art systems, particularly focusing on three aspects: memory augmentation, attention mechanism, and pre-trained models and representations.

4.2.1 Memory Augmentation

Mentioned earlier, a popular type of approach for tasks which require comprehending passages with several state changes or supporting facts, such as bAbI [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016] or ProPara [Mishra, Huang, Tandon, Yih,  ClarkMishra et al.2018], involves augmenting systems with a dynamic memory which may be maintained over time to represent the changing state of the world. We discuss memory networks [Weston, Chopra,  BordesWeston et al.2015], recurrent entity networks [Henaff, Weston, Szlam, Bordes,  LeCunHenaff et al.2017], and the recent Knowledge Graph-Machine Reading Comprehension (KG-MRC) system [Das, Munkhdalai, Yuan, Trischler,  McCallumDas et al.2019] to highlight key characteristics of such an approach.

Memory networks.

Memory networks by westonMemoryNetworks2015, introduced as high-performing baseline approaches to both bAbI [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016] and CBT [Hill, Bordes, Chopra,  WestonHill et al.2015], track the world state by adding a long-term memory component to the typical network architecture. A memory network consists of a memory array, an input feature map, a generalization module which updates the memory array given new input, an output feature map, and a response module which converts output to the appropriate response or action. The networks can take characters, words, or sentences as input. Each component of the network can take different forms, but a common implementation is for them to be neural networks, in which case the network is called a MemNN.

The ability to maintain a long-term memory provides more involved tracking of the world state and context. On the CBT cloze task [Hill, Bordes, Chopra,  WestonHill et al.2015], it is demonstrated that memory networks can outperform primarily RNN- and LSTM-based approaches in predicting missing named entities and common nouns, and this is because memory networks can efficiently leverage a wider context than these approaches in making inferences. When tested on bAbI, MemNNs also achieved high performance and outperformed LSTM baselines, and on some tasks were able to achieve high performance with fewer training examples than provided [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016].

Recurrent entity networks.

The recurrent entity network (EntNet) by henaffTrackingWorldState2017 is composed of several dynamic memory cells, where each cell learns to represent the state or properties concerning entities mentioned in the input. Each cell is a gated RNN which only updates its content when new information relevant to the particular entity is received. Further, EntNet’s memory cells run in parallel, allowing multiple locations of memory to be updated at the same time.

EntNet is, to our knowledge, the first model to pass all twenty tasks in bAbI [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016], and achieves impressive results on CBT [Hill, Bordes, Chopra,  WestonHill et al.2015], outperforming memory network baselines on both benchmarks. EntNet is also used as a baseline in the Story Commonsense benchmark [Rashkin, Bosselut, Sap, Knight,  ChoiRashkin et al.2018a] in an attempt to track the motivations and emotions of characters in stories from ROCStories [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016] with some success. An advantage of EntNet is that it maintains and updates the state of the world as it reads the text, unlike memory networks, which can only preform reasoning when the entire supporting text and the question are processed and loaded to the memory. For example, given a supporting text with multiple questions, EntNet does not need to process the input text multiple times to answer these questions, while memory networks need to re-process the whole input for each question.

While EntNet achieves state-of-the-art performance on bAbI, it does not perform so well on ProPara [Mishra, Huang, Tandon, Yih,  ClarkMishra et al.2018], another benchmark which requires tracking the world state. According to dasBuildingDynamicKnowledge2019, a drawback of EntNet is that while it maintains memory registers for entities, it has no separate embedding for individual states of entities over time. They further explain that EntNet does not explicitly update coreferences in memory, which can certainly cause errors when reading human-authored text which is rich in coreference, as opposed to the simply-structured, automatically-generated bAbI data.

Kg-Mrc.

A more recent model, the Knowledge Graph-Machine Reading Comprehension (KG-MRC) system from dasBuildingDynamicKnowledge2019, maintains a dynamic memory similar to memory networks. However, this memory is in the form of knowledge graphs generated after every sentence of procedural text, leveraging research efforts from the area of information extraction. Generated knowledge graphs are bipartite, connecting entities in the paragraph with their locations (currently, it only captures the location relation). Connections between entities and locations are updated to generate a new graph after each sentence.

According to the official ProPara leaderboard131313https://leaderboard.allenai.org/propara/submissions/public, the Knowledge Graph-Machine Reading Comprehension (KG-MRC) system from dasBuildingDynamicKnowledge2019 achieves the highest accuracy on the benchmark, reported as 47.0% in the paper. It provides advantages over ProStruct, the previous state of the art for ProPara [Tandon, Mishra, Grus, Yih, Bosselut,  ClarkTandon et al.2018]. While ProStruct manually enforces hard and soft commonsense constraints, further investigation shows that KG-MRC learns these constraints automatically, violating them less often than ProStruct dasBuildingDynamicKnowledge2019. This shows that the use of the recurrent graph representation helps the model learn these constraints, perhaps better than can be manually enforced. Further, since KG-MRC includes a trained reading comprehension model, it can likely better track changes in coreference which often occur in these texts.

4.2.2 Attention Mechanism

Since the first application of attention mechanism for neural machine translation 

[Bahdanau, Cho,  BengioBahdanau et al.2015]

, attention has been used widely in NLP tasks, especially to capture the alignment between an input (encoder) and an output (decoder). Modeling attention has several advantages. It allows the decoder to directly go to and focus on certain parts of the input. It alleviates the vanishing gradient problem by providing a way to account for states far away in the input sequence. Another advantage is that the attention distribution learned by the model automatically provides an alignment between inputs and outputs which allows some understanding of their relations. Because of these advantages, attention mechanisms have been successfully applied to commonsense benchmark tasks.

Attention in RNN/CNN.

Adding an attention mechanism to RNNs, LSTMs, CNNs, and more has been shown to improve performance on various tasks compared to their vanilla models [Kim, Hong, Kang,  KwakKim et al.2019]. It is particularly successful for tasks which require alignment between input and output, such as various RTE tasks which require the modeling of context and hypothesis, and reading comprehension questions which refer directly to an accompanying passage, like MCScript [Ostermann, Modi, Roth, Thater,  PinkalOstermann et al.2018].

For example, the official leaderboard141414http://nlp.stanford.edu/projects/snli/ for the SNLI task [Bowman, Angeli, Potts,  ManningBowman et al.2015] reports that the best performing system [Kim, Hong, Kang,  KwakKim et al.2019], partly inspired by DenseNet [Huang, Liu, van der Maaten,  WeinbergerHuang et al.2016], uses a densely connected RNN while concatenating features from an attention mechanism to recurrent features in the network. As discussed by kimSemanticSentenceMatching2019, the attentive weights resulting from this alignment help the system make accurate entailment and contradiction decisions for highly similar pairs of sentences. One such example given is the context sentence "Several men in front of a white building" compared to the hypothesis sentence "Several people in front of a gray building." For the MCScript task ostermannMCScriptNovelDataset2018 used in SemEval 2018, online results151515http://competitions.codalab.org/competitions/17184#results indicate that the best performer [Chen, Cui, Ma, Wang, Liu,  HuChen et al.2018] achieved an accuracy of 84.13% using a bidirectional LSTM-based approach with an attention layer.

RNNs with attention also have limitations particularly when the alignment between inputs and outputs is not straightforward. For example, chenHFLRCSystemSemEval20182018 found that yes/no questions were particularly challenging in MCScript, as they require a special handling of negation and deeper understanding of the question. Further, since the multiple choices to answer questions in MCScript are human-authored instead of extracted directly from the accompanying passage like other QA benchmarks, this caused some difficulties in connecting words back to the passage that stemming alone could not fix.

Self-attention in transformers.

In lieu of adding attention mechanisms to a typical neural model such as an RNN, LSTM, or CNN, the recently proposed transformer architecture is composed entirely of attention mechanisms [Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser,  PolosukhinVaswani et al.2017]161616An excellent post on the implementation of the transformer can be found at http://nlp.seas.harvard.edu/2018/04/03/attention.html. One key difference is the self-attention layer in both the encoder and the decoder. For each word position in an input sequence, self-attention allows it to attend to all positions in the sequence to better encode the word. It provides a method to potentially capture long-range dependencies between words, such as syntactic, semantic, and coreference relations. Furthermore, instead of performing a single attention function, the transformer performs multi-head attention in the sense that it applies the attention function multiple times with different linear projections, and therefore allows the model to jointly capture different attentions from different subspaces, e.g., jointly attend to information that might indicate both coreference and syntactic relations.

Another big benefit of the transformer is its suitability for parallel computing. The sequence models such as RNN and LSTM by their sequential nature make it difficult for parallelization. The transformer, which uses attention to capture global dependencies between inputs and outputs, maximizes the amount of parallelizable computations. Empirical results on NLP tasks such as machine translation and constituency parsing have shown impressive performance gains with a significant reduction in training costs [Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser,  PolosukhinVaswani et al.2017]. Transformers have recently been used in pre-trained contextual models like GPT [Radford, Narasimhan, Salimans,  SutskeverRadford et al.2018] and BERT [Devlin, Chang, Lee,  ToutanovaDevlin et al.2018] to achieve state-of-the-art performance on many commonsense benchmarks.

4.2.3 Pre-Trained Models and Representations

One of the most exciting recent advances in NLP is the development of pre-trained models and embeddings that can be used as features or further fine-tuned for downstream tasks. These models are often trained based on a large amount of unsupervised textual data. The earlier pre-trained word embedding models such as word2vec [Mikolov, Chen, Corrado,  DeanMikolov et al.2013] and GloVe [Pennington, Socher,  ManningPennington et al.2014] have been widely applied. However, these models are context independent, meaning the same embedding is used in different contexts, and they therefore cannot capture different word senses. More recent work has addressed this problem by pre-training models that can provide a word embedding based on context. The most representative models are ELMo, GPT, and BERT. Next, we give a brief overview of these models and summarize their performance on the selected benchmark tasks.

ELMo.

ELMo’s characteristic contribution is its contextual word embeddings, which each rely on the entire input sentence they belong to petersDeepContextualizedWord2018. These embeddings are calculated from learned weights in a bidirectional LSTM which is pre-trained on the supervised One Billion Word language modeling benchmark [Chelba, Mikolov, Schuster, Ge, Brants, Koehn,  RobinsonChelba et al.2014]. Simply adding these embeddings to the input features of previous state-of-the-art systems improved performance, suggesting that they are indeed successful in representing word context. An investigation by the authors shows that ELMo embeddings make it possible to identify word sense and POS, further supporting this.

When the ELMo embedding system was originally released, it helped exceed the state of the art on several benchmarks in the areas of question answering, textual entailment, and sentiment analysis. These included SQuAD [Rajpurkar, Jia,  LiangRajpurkar et al.2018] and SNLI [Bowman, Angeli, Potts,  ManningBowman et al.2015]. All of these approaches have since been exceeded. ELMo still commonly appears in baseline approaches to benchmarks, e.g., to SWAG [Zellers, Bisk, Schwartz,  ChoiZellers et al.2018] and CommonsenseQA [Talmor, Herzig, Lourie,  BerantTalmor et al.2019]. It is often combined with the enhanced LSTM-based models such as the ESIM model [Chen, Zhu, Ling, Wei, Jiang,  InkpenChen et al.2017], or the CNN- and bidirectional GRU-based models such as the DocQA model [Clark  GardnerClark  Gardner2018].

Gpt.

GPT by radfordImprovingLanguageUnderstanding2018a uses the transformer architecture originally proposed by vaswaniAttentionAllYou2017, in particular the decoder. This system is pre-trained on a large amount of open online data unsupervised, then fine-tuned to various benchmark datasets. Unlike ELMo, GPT learns its contextual embeddings in an unsupervised setting, which allows it to learn features of language without restrictions. The creators found that this technique produced more discriminative features than supervised pre-training when applied to a large magnitude of clean data. The transformer architecture itself can then be easily fine-tuned in a supervised setting for downstream tasks, which ELMo is not as suitable for, and instead should be used for input features to a separate task-specific model.

When GPT was first released, it pushed the state of the art forward on 12 benchmarks in textual entailment, semantic similarity, sentiment analysis, commonsense reasoning, and more. These included SNLI [Bowman, Angeli, Potts,  ManningBowman et al.2015], MultiNLI [Williams, Nangia,  BowmanWilliams et al.2017], SciTail [Khot, Sabharwal,  ClarkKhot et al.2018], the Story Cloze Test [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016], COPA [Roemmele, Bejan,  GordonRoemmele et al.2011], and GLUE [Wang, Singh, Michael, Hill, Levy,  BowmanWang et al.2018]. To our knowledge, GPT is still the highest-performing documented system for the Story Cloze Test and COPA, achieving 86.5% and 78.6% accuracy, respectively. GPT holds high positions on several other commonsense benchmark leaderboards as well. It is often commonly used as a baseline for new benchmarks, e.g., CommonsenseQA [Talmor, Herzig, Lourie,  BerantTalmor et al.2019].

radfordImprovingLanguageUnderstanding2018a identify several limitations of the model. First, the model has high computational requirements, which is undesirable for obvious reasons. Second, data from the Internet, which the model is pre-trained on, are incomplete and sometimes inaccurate. Lastly, like many deep learning NLP models, GPT shows some issues with generalizing over data with high lexical variation.

To improve the generalization ability and develop upon the use of unsupervised training settings, the larger GPT 2.0 was recently released [Radford, Wu, Child, Luan, Amodei,  SutskeverRadford et al.2019], which is highly similar to the original implementation, but with significantly more parameters and formulated as a language model. The expanded model has achieved new state-of-the-art results on several language modeling tasks, including CBT [Hill, Bordes, Chopra,  WestonHill et al.2015] and the 2016 Winograd Schema Challenge [Davis, Morgenstern,  OrtizDavis et al.2017]. Further, it exceeds three out of four baseline approaches to CoQA [Reddy, Chen,  ManningReddy et al.2018] in an unsupervised setting, i.e., trained only on documents and questions, not answers. In a supervised training setting, the model would be fed the answers directly with the questions so that model parameters could be updated based upon correlations between them. The model instead learns to perform tasks by observing natural language demonstrations of them without being told where the questions and answers are. This way, it is ensured that the model is not overfitting to superficial correlations between questions and answers. A qualitative investigation into model predictions suggests that some heuristics are indeed being learned to answer questions. For example, if asked a "who" question, the model has learned to return the name of a person mentioned in the passage that the question is posed on. This provides some evidence of the model performing reasoning processes similar to what the benchmarks intend.

Bert.

Recently, the BERT model [Devlin, Chang, Lee,  ToutanovaDevlin et al.2018] exceeded the state-of-the-art accuracy on several benchmarks including GLUE [Wang, Singh, Michael, Hill, Levy,  BowmanWang et al.2018], SQuAD 1.1 [Rajpurkar, Zhang, Lopyrev,  LiangRajpurkar et al.2016], and SWAG [Zellers, Bisk, Schwartz,  ChoiZellers et al.2018] benchmarks. According to the GLUE leaderboard171717https://gluebenchmark.com/leaderboard, BERT originally achieved an overall accuracy of 80.4% on the multi-task benchmark. Meanwhile, according to the SQuAD leaderboard,181818http://rajpurkar.github.io/SQuAD-explorer/ BERT solved SQuAD 1.1 with 87.433% exact-match accuracy, exceeding human performance by 5.13%, and it solved SWAG with 86.28% accuracy according to the SWAG leaderboard, 191919http://leaderboard.allenai.org/swag/submissions/public greatly exceeding the previous state of the art set by GPT [Radford, Narasimhan, Salimans,  SutskeverRadford et al.2018].

After its initial release, BERT further topped several new leaderboards such as OpenBookQA [Mihaylov, Clark, Khot,  SabharwalMihaylov et al.2018], CLOTH [Xie, Lai, Dai,  HovyXie et al.2017], SQuAD 2.0 [Rajpurkar, Jia,  LiangRajpurkar et al.2018], CoQA [Reddy, Chen,  ManningReddy et al.2018], ReCoRD [Zhang, Liu, Liu, Gao, Duh,  Van DurmeZhang et al.2018], and SciTail [Khot, Sabharwal,  ClarkKhot et al.2018]. The deeper model introduced alongside the base model topped the leaderboards of OpenBookQA202020http://leaderboard.allenai.org/open_book_qa/submissions/public [Mihaylov, Clark, Khot,  SabharwalMihaylov et al.2018] and CLOTH212121http://www.qizhexie.com/data/CLOTH_leaderboard [Xie, Lai, Dai,  HovyXie et al.2017], achieving accuracies of 60.40% and 86.0%, respectively. According to the SQuAD leaderboard, various implementations of BERT had beaten the state of the art performance on SQuAD 2.0 more than a dozen times at the time of writing, the highest performance coming from an updated implementation which achieved an F-measure of 89.147. A modified ensemble implementation of BERT tops the CoQA leaderboard222222http://stanfordnlp.github.io/coqa/ with an accuracy of 86.8%, and a single-model implementation tops the ReCoRD leaderboard232323http://sheng-z.github.io/ReCoRD-explorer/ with an accuracy of 74.76%. Most recently, a new implementation of BERT with updated loss functions is topping the GLUE leaderboard with 83.3% accuracy, beating the original implementation. Most of this progress comes in a span of just a few months.

Benchmark Simple Baseline Best Baseline ELMo GPT BERT BigBird Human
SQuAD 1.1 1.3 40.4 81.0 87.4 82.3
SWAG 25.0 59.2 78.0 86.3 88.0
SciTail 60.4 79.6 88.3 94.1
ReCoRD 18.6 45.4 73.0 91.3
OpenBookQA 25.0 50.2 60.4 91.7
CLOTH 25.0 70.7 86.0 85.9
GLUE 60.3 72.8 83.3 83.1 87.1
SQuAD 2.0 48.9 63.4 86.7 86.9
CoQA 1.3 65.1 86.8 88.8
SNLI 33.3 77.8 89.3 89.9 91.1
Table 3: Comparison of exact-match accuracy achieved on various benchmarks by a random or majority-choice baseline, the best-performing baseline presented in the original paper for each benchmark, ELMo, GPT, BERT, BigBird, and humans. ELMo refers to the highest-performing listed approach using ELMo embeddings. Best system performance on each benchmark in bold. Information extracted from leaderboards linked to in Section 4.2 at time of writing (March 2019), and original papers for benchmarks introduced in Section 2.

BERT provides several advantages over past state-of-the-art systems with pre-trained contextual embeddings. First, it is pre-trained on larger data than previous competitive systems like GPT [Radford, Narasimhan, Salimans,  SutskeverRadford et al.2018]. Where GPT is pre-trained with a large text corpus, BERT

is trained with two larger corpora of passages on two tasks: a cloze task where input tokens are randomly masked, and a sentence ordering task where given two sentences, the system must predict whether the second sentence could come directly after the first. This sort of transfer learning from large-scale supervised tasks has been demonstrated several times to be effective in NLP problems. Similar to

GPT, BERT was further fine-tuned for each of the 11 commonsense benchmark tasks it originally attempted.

Second, BERT uses a bidirectional form of the transformer architecture [Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser,  PolosukhinVaswani et al.2017] for pre-training contextual embeddings. This better captures context, an advantage that previous competitive approaches like GPT [Radford, Narasimhan, Salimans,  SutskeverRadford et al.2018] and ELMo [Peters, Neumann, Iyyer, Gardner, Clark, Lee,  ZettlemoyerPeters et al.2018] do not have. Instead, GPT uses a left-to-right transformer, while ELMo uses a concatenation of left-to-right and right-to-left LSTMs.

Lastly, BERT’s input embeddings are superior at representing context, which besides a traditional embedding for tokens, also capture a learned embedding for each sentence (i.e., if a pair of sentences is the input) applied to each token, and learned positional embeddings for tokens. Consequently, its embedding can capture each unique word and its context, perhaps in a more sophisticated way than previous systems. Further, it can uniquely represent a sentence or pair of sentences, advantageous for solving a wide variety of language processing tasks in question answering, textual entailment, and more.

Recently, another BERT-based system called BigBird or the Multi-Task Deep Neural Network (MT-DNN) by liuMultiTaskDeepNeural2019 is achieving competitive performance on several leaderboards such as for SciTail 242424http://leaderboard.allenai.org/scitail/submissions/public with an accuracy of 94.07%, SNLI [Bowman, Angeli, Potts,  ManningBowman et al.2015] 252525https://nlp.stanford.edu/projects/snli/ with an accuracy of 91.1%, and GLUE with an accuracy of 83.1%, beating the original implementation of BERT. It appears that the performance gain from the BigBird model can mostly be attributed to adding multi-task learning during fine-tuning procedures. This is done through task-specific layers which generate representations for specific tasks, e.g., text similarity and sentence pair classification. While pre-training the bi-directional transformer helps BERT learn universal word representations which are applicable across several tasks, multi-task learning prevents the model from overfitting to particular tasks during fine-tuning, thus allowing it to leverage more cross-task data.

BERT and its variations are currently the state of the art on nearly all commonsense benchmark tasks, even exceeding human performance in some cases. According to devlinBERTPretrainingDeep2018, a goal of future work will be to determine whether BERT truly captures the intended semantic phenomena in benchmark datasets.Further, as the BigBird model was able to improve performance on several benchmarks by adding task-specific layers and enabling multi-task learning, BERT may be a bit too task-invariant, and potentially missing helpful information that comes from differentiating between specific tasks. It could likely benefit from further investigation into multi-task learning approaches such as BigBird. Figure 3 compares performance from ELMo, GPT, BERT, and BigBird on various benchmarks.

When to fine-tune.

These new pre-trained contextual models are applied to benchmark tasks in different ways. Particularly, while ELMo has been traditionally used to generate input features for a separate task-specific model, BERT-based models are typically fine-tuned on various tasks and applied to them directly. Understanding why these choices were made are important for the further development of these models. petersDeepContextualizedWord2018 investigate this difference in training the two models and compare their performance both when just extracting their output as features for another model, against when fine-tuning them to be used directly on various tasks. Their results show that ELMo’s LSTM architecture can actually be fine-tuned and applied directly to downstream tasks like BERT can with some success, although it is more difficult to perform this fine-tuning on ELMo. Further, performance on sentence pair classification tasks like MultiNLI [Williams, Nangia,  BowmanWilliams et al.2017] and SICK [Marelli, Menini, Baroni, Bentivogli, Bernardi,  ZamparelliMarelli et al.2014a] is shown to be better when the contextual embeddings generated by ELMo are instead used as input features to a separate task-specific architecture. They infer that this may be because the LSTM architecture of ELMo must consider tokens sequentially, rather than being able to compare all tokens to each other across sentence pairs like BERT’s transformer architecture can. BERT’s output can also be used as features for a task-specific model with some success, and it actually outperforms ELMo in most of the studied tasks when used in this way, likely for the same reason. It is important to note, however, that performance on sentence similarity tasks like the Microsoft Research Paraphrase Corpus [Dolan  BrockettDolan  Brockett2005] is significantly better when the model is fine-tuned to the task.

4.3 Incorporating External Knowledge

Most of recent approaches have relied on the benchmark datasets (e.g., training data) to build models for reasoning and inference. Despite the availability of knowledge resources discussed in Section 3, few of them were actually applied to solve the benchmark tasks. WordNet [MillerMiller1995] is perhaps the mostly applied lexical resource, and its word relations are particularly useful for textual entailment problems. WordNet has made an appearance in earlier approaches all throughout the RTE Challenges [Dagan, Glickman,  MagniniDagan et al.2005, Hickl, Bensley, Williams, Roberts, Rink,  ShiHickl et al.2006, Giampiccolo, Dang, Magnini, Dagan,  DolanGiampiccolo et al.2008, IfteneIftene2008, Bentivogli, Clark, Dagan,  GiampiccoloBentivogli et al.2011, Tsuchida  IshikawaTsuchida  Ishikawa2011], and more recently used by a competitive approach to the 2016 Winograd Schema Challenge [Davis, Morgenstern,  OrtizDavis et al.2018, Trinh  LeTrinh  Le2018]. The well-known and popular common knowledge resources such as DBpedia [Auer, Bizer, Kobilarov, Lehmann, Cyganiak,  IvesAuer et al.2007] and YAGO [Suchanek, Kasneci,  WeikumSuchanek et al.2007] have been used in creating benchmarks [Morgenstern, Davis,  OrtizMorgenstern et al.2016, Choi, He, Iyyer, Yatskar, Yih, Choi, Liang,  ZettlemoyerChoi et al.2018], however, have not been directly applied to solve the benchmark tasks.

ConceptNet [Liu  SinghLiu  Singh2004] and Cyc [Lenat  GuhaLenat  Guha1989] are by far the most talked-about commonsense knowledge resources available. However, Cyc does not seem to appear in any of our surveyed approaches, while ConceptNet has been occasionally applied. For example, OpenBookQA [Mihaylov, Clark, Khot,  SabharwalMihaylov et al.2018] uses ConceptNet in a neural baseline approach, which often retrieves additional common and commonsense knowledge facts not included in benchmark data. ConceptNet is most often used to create knowledge-enhanced word embeddings. A neural model applied to COPA leverages commonsense knowledge from ConceptNet [Roemmele  GordonRoemmele  Gordon2018] through ConceptNet-based embeddings which were generated by applying the word2vec skip-gram model [Mikolov, Chen, Corrado,  DeanMikolov et al.2013] to commonsense knowledge tuples in ConceptNet [Li, Lee-Urban, Johnston,  RiedlLi et al.2013]. A recent approach to the Winograd Schema Challenge from [Liu, Jiang, Ling, Zhu, Wei,  HuLiu et al.2017] uses a similar technique, as well as a baseline approach to SWAG [Zellers, Bisk, Schwartz,  ChoiZellers et al.2018]. Though ConceptNet is demonstrated to be useful for such purposes, still very few state-of-the-art approaches actually use it, rather they gain commonsense knowledge relations through benchmark training data only. This brings up some important questions as to how to incorporate external knowledge in modern neural approaches and how to acquire relevant external knowledge for the tasks at hand.

5 Other Related Benchmarks

While this paper intends to cover language understanding tasks for which some external knowledge or advanced reasoning beyond linguistic context is required, many related benchmarks have not been covered. First of all, nearly all language understanding benchmarks developed throughout the last couple decades could benefit from commonsense knowledge and reasoning. Second, as language communication is integral to other perception and reasoning systems, recent years have also seen an increasing number of benchmark tasks that combine language and vision.

Language-related tasks.

Many early corpora for classical NLP tasks such as semantic role labeling, relation extraction, and paraphrase may also require commonsense knowledge and reasoning, though this was not emphasized or investigated at the time. For example, in creating the Microsoft Research Paraphrase Corpus, dolanAutomaticallyConstructingCorpus2005 found that the task of annotating text pairs was difficult to streamline because this often required commonsense, suggesting that the paraphrases within the corpus require commonsense knowledge and reasoning to identify. Some such tasks are actually included in the multi-task benchmarks like Inference is Everything [White, Rastogi, Duh,  Van DurmeWhite et al.2017], GLUE [Wang, Singh, Michael, Hill, Levy,  BowmanWang et al.2018], and DNC [Poliak, Haldar, Rudinger, Hu, Pavlick, White,  Van DurmePoliak et al.2018a]. Other examples of related textual benchmarks include QuAC [Choi, He, Iyyer, Yatskar, Yih, Choi, Liang,  ZettlemoyerChoi et al.2018], which relies on conversation discourse for robust contextual question answering, but does not require commonsense in the way that the similar CoQA benchmark does [Choi, He, Iyyer, Yatskar, Yih, Choi, Liang,  ZettlemoyerChoi et al.2018], and a conversation dataset created by xuCommonsenseKnowledgeAware2018 which explores the use of commonsense knowledge from ConceptNet in producing higher quality and more relevant responses for chatbots.

Besides English, there are also benchmarks in other languages. For example, there have been RTE datasets created in Italian262626See http://www.evalita.it/2009/tasks/te. and Portuguese272727See http://nilc.icmc.usp.br/assin/., and cross-lingual RTE datasets have appeared in several SemEval shared tasks over the years [Negri, Marchetti, Mehdad, Bentivogli,  GiampiccoloNegri et al.2012, Negri, Marchetti, Mehdad, Bentivogli,  GiampiccoloNegri et al.2013, Cer, Diab, Agirre, Lopez-Gazpio,  SpeciaCer et al.2017] to encourage progress in machine translation and content synchronization. There also exist various cross-lingual knowledge resources, including the latest version of ConceptNet [Speer, Chin,  HavasiSpeer et al.2017], which contains relations from several multilingual resources.

Visual benchmarks.

Commonsense plays an important role in integrating language and vision, for example, grounding language to perception [Gao, Doering, Yang,  ChaiGao et al.2016], language-based justification for action recognition [Yang, Gao, Sadiya,  ChaiYang et al.2018], and visual question answering [Kafle  KananKafle  Kanan2017]. Visual commonsense benchmarks include VQA benchmarks like the original VQA [Agrawal, Lu, Antol, Mitchell, Zitnick, Batra,  ParikhAgrawal et al.2015], other similar VQA datasets like Visual7W [Zhu, Groth, Bernstein,  Fei-FeiZhu et al.2016], and similar datasets with synthetic images, such as CLEVR [Johnson, Hariharan, van der Maaten, Fei-Fei, Zitnick,  GirshickJohnson et al.2017] and the work by suhrCorpusNaturalLanguage2017. They also include the tasks of commonsense action recognition and justification, which are found in the dataset by Fouhey18, and Visual Commonsense Reasoning (VCR) by zellersRecognitionCognitionVisual2019. These are all image-based, but we are also beginning to see similar video-based datasets, such as Something Something [Goyal, Kahou, Michalski, Materzyńska, Westphal, Kim, Haenel, Fruend, Yianilos, Mueller-Freitag, Hoppe, Thurau, Bax,  MemisevicGoyal et al.2017], which aims to evaluate visual commonsense through over 100,000 videos portraying everyday actions. We have also seen the vision-and-language navigation (VLN) task such as Room-to-Room (R2R) by andersonVisionandLanguageNavigationInterpreting2018. Such benchmarks are important to promote progress in physically grounded commonsense knowledge and reasoning.

6 Discussion and Conclusion

The availability of data and computing resources and the rise of new learning and inference methods make this an unprecedentedly exciting time for research on commonsense reasoning and natural language understanding. As the research field moves forward, here are a few things we think are important to pursue in the future.

Among different kinds of knowledge, two types of commonsense knowledge are considered fundamental for human reasoning and decision making: intuitive psychology and intuitive physics. Several benchmarks are geared towards intuitive psychology, e.g., Triangle-COPA [GordonGordon2016], Story Commonsense [Rashkin, Bosselut, Sap, Knight,  ChoiRashkin et al.2018a], and Event2Mind [Rashkin, Sap, Allaway, Smith,  ChoiRashkin et al.2018b]. Reasoning with intuitive physics is scattered in different benchmarks such as bAbI [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016] and SWAG [Zellers, Bisk, Schwartz,  ChoiZellers et al.2018]. Understanding how such commonsense knowledge is developed and acquired in humans and how they are related to human language production and comprehension may shed light on computational models for language processing.

In addition, besides benchmark tasks in a written language form as discussed in this paper, it may be worthwhile to also explore new tasks that involve artificial agents (in either a simulated world or the real physical world) which can use language to communication, to perceive, and to act. Some examples can be interactive task learning [Chai, Gao, She, Yang, Saba-Sadiya,  XuChai et al.2018] or embodied question answering [Das, Datta, Gkioxari, Lee, Parikh,  BatraDas et al.2018]. As commonsense knowledge is so intuitive for humans, it would be difficult even for researchers to identify and formalize such kind of knowledge. Working with agents and observing their abilities and limitations in understanding language and grounding language to their own sensorimotor skills will allow researchers to better understand the space of commonsense knowledge and tackle the problem of knowledge acquisition accordingly.

One challenge of the current trend of work is the disconnect between commonsense knowledge resources and approaches taken to tackle those benchmark tasks. Most approaches, particularly neural approaches, only accrue knowledge or learn models from training data, a method that critics are unsure can result in comparable reasoning ability on the level of humans [Cambria, Song, Wang,  HussainCambria et al.2011]. Although there exist many knowledge bases designed for commonsense reasoning, most of them are not used directly to solve the benchmark tasks, except for very few. One likely reason is that these knowledge bases do not cover the kind of knowledge that is required to solve those tasks. This was discovered, for example, of ConceptNet [Liu  SinghLiu  Singh2004] in creating the Event2Mind benchmark [Rashkin, Sap, Allaway, Smith,  ChoiRashkin et al.2018b]. To address this problem, several methods have been proposed for leveraging incomplete knowledge bases. One method mentioned in Section 3 is AnalogySpace [Speer, Havasi,  LiebermanSpeer et al.2008], which uses principle component analysis to make analogies to smooth missing commonsense axioms. Another example is memory comparison networks [Andrade, Bai, Rajendran,  WatanabeAndrade et al.2018], which allow machines to generalize over existing temporal relations in knowledge resources in order to acquire new relations. Future work will need to come up with more solutions to handle the long-tail phenomenon [Davis  MarcusDavis  Marcus2015].

Another potential avenue to address the disconnection is to jointly develop benchmark tasks and construct knowledge bases. Recently, we are seeing the creation of knowledge graphs geared toward particular tasks, like the ATOMIC knowledge graph by sapATOMICAtlasMachine2019 which expands upon data in the Event2Mind benchmark, and ideally provides the required relations that ConceptNet does not. We are also seeing the creation of benchmarks geared toward particular knowledge graphs, like CommonsenseQA [Talmor, Herzig, Lourie,  BerantTalmor et al.2019], where questions were drawn from subgraphs of ConceptNet, thus encouraging the use of ConceptNet in approaching the benchmark tasks. A tighter coupling of benchmark tasks and knowledge resources will help understand and formalize the scope of knowledge needed, and facilitate the development and evaluation of approaches that can incorporate external knowledge.

As more benchmark tasks become available and performance on these tasks keeps growing, one central question is whether the technologies developed are in fact pushing the state-of-the-art or only learning superficial artifacts from the dataset. Better understanding of the behaviors of these models, especially deep learning models that achieve high performance, is critical. For example, when applied to DNC [Poliak, Haldar, Rudinger, Hu, Pavlick, White,  Van DurmePoliak et al.2018a], the multi-task two-way entailment benchmark, InferSent achieves high accuracy on many of the benchmark’s tasks (sometimes over 90%) just by training and testing on only the hypothesis texts from the benchmark rather than both the context and hypothesis sentences. Since humans require both the context and hypothesis to perform textual entailment, this suggests that the model is learning obscure statistical biases in the data rather than performing actual reasoning. For another example, jiaAdversarialExamplesEvaluating2017 propose an adversarial evaluation scheme for SQuAD [Rajpurkar, Zhang, Lopyrev,  LiangRajpurkar et al.2016] which randomly inserts distractor sentences into passages which do not change the meaning of the passage, and shows that high-performing models on the benchmark drop significantly. marasovicNLPGeneralizationProblem2018 also highlights several more models which have been shown to be spurious, identifying several recent works which show that high-performing modern NLP systems can break down due to small, inconsequential changes in inputs. BERT may suffer from a similar issue, as its deep, bidirectional architecture makes its reasoning processes highly unexplainable, and thus it is quite opaque why certain conclusions are made by the model. It is unclear if BERT is capturing semantic phenomena or again learning statistical biases. According to the creators of BERT, this will be a subject of future research.

Lastly, like most deep learning applications, a significant amount of effort has been spent on tuning parameters to improve the performance. We have also recently seen in other AI subfields that more sophisticated models may lead to better performance on a particular benchmark, but simpler models with better parameter tuning may later lead to comparable results. For example, in image classification, a study by brendelApproximatingCNNsBagoflocalFeatures2019 shows that nearly all improvements of recent deep neural networks over earlier bag-of-features classifiers come from better fine-tuning rather than improvements in decision processes. NLP models may be similarly vulnerable to this. More efforts on theoretical understanding and motivation for model design and parameter tuning would be beneficial.

References

  • [Agrawal, Lu, Antol, Mitchell, Zitnick, Batra,  ParikhAgrawal et al.2015] Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L., Batra, D.,  Parikh, D. 2015. VQA: Visual Question Answering  In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV 2015), Santiago, Chile. IEEE.
  • [Agrawal, Lu, Antol, Mitchell, Zitnick, Parikh,  BatraAgrawal et al.2017] Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C. L., Parikh, D.,  Batra, D. 2017. VQA: Visual Question Answering  Int. J. Comput. Vision, 123(1), 4–31.
  • [Anderson, Wu, Teney, Bruce, Johnson, Sünderhauf, Reid, Gould,  van den HengelAnderson et al.2018] Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S.,  van den Hengel, A. 2018. Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments 

    In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA. IEEE.

  • [Andrade, Bai, Rajendran,  WatanabeAndrade et al.2018] Andrade, D., Bai, B., Rajendran, R.,  Watanabe, Y. 2018. Leveraging knowledge bases for future prediction with memory comparison networks  AI Communications.
  • [Auer, Bizer, Kobilarov, Lehmann, Cyganiak,  IvesAuer et al.2007] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R.,  Ives, Z. 2007. DBpedia: A Nucleus for a Web of Open Data  In Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G.,  Cudré-Mauroux, P., Proceedings of the Semantic Web Challenge 2007 Co-Located with ISWC 2007 + ASWC 2007, Lecture Notes in Computer Science,  722–735, Busan, Korea. Springer Berlin Heidelberg.
  • [Bahdanau, Cho,  BengioBahdanau et al.2015] Bahdanau, D., Cho, K.,  Bengio, Y. 2015. Neural Machine Translation by Jointly Learning to Align and Translate  In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015).
  • [Banarescu, Bonial, Cai, Georgescu, Griffitt, Hermjakob, Knight, Koehn, Palmer,  SchneiderBanarescu et al.2013] Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., Knight, K., Koehn, P., Palmer, M.,  Schneider, N. 2013. Abstract Meaning Representation for Sembanking  In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse,  178–186, Sofia, Bulgaria. Association for Computational Linguistics.
  • [Bar-Haim, Dagan, Dolan, Ferro, Giampiccolo, Magnini,  SzpektorBar-Haim et al.2006] Bar-Haim, R., Dagan, I., Dolan, B., Ferro, L., Giampiccolo, D., Magnini, B.,  Szpektor, I. 2006. The Second PASCAL Recognising Textual Entailment Challenge  In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, Venice, Italy.
  • [Bentivogli, Clark, Dagan,  GiampiccoloBentivogli et al.2010] Bentivogli, L., Clark, P., Dagan, I.,  Giampiccolo, D. 2010. The Sixth PASCAL Recognizing Textual Entailment Challenge  In Proceedings of the Third Text Analysis Conference (TAC 2010), Gaithersburg, MD, USA. National Institute of Standards and Technology.
  • [Bentivogli, Clark, Dagan,  GiampiccoloBentivogli et al.2011] Bentivogli, L., Clark, P., Dagan, I.,  Giampiccolo, D. 2011. The Seventh PASCAL Recognizing Textual Entailment Challenge  In Proceedings of the Fourth Text Analysis Conference (TAC 2011), Gaithersburg, MD, USA. National Institute of Standards and Technology.
  • [Bentivogli, Dagan, Dang, Giampiccolo,  MagniniBentivogli et al.2009] Bentivogli, L., Dagan, I., Dang, H. T., Giampiccolo, D.,  Magnini, B. 2009. The Fifth PASCAL Recognizing Textual Entailment Challenge  In Proceedings of the Second Text Analysis Conference (TAC 2009),  15, Gaithersburg, MD, USA. National Institute of Standards and Technology.
  • [Bollacker, Evans, Paritosh, Sturge,  TaylorBollacker et al.2008] Bollacker, K., Evans, C., Paritosh, P., Sturge, T.,  Taylor, J. 2008. Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge  In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08,  1247–1250, New York, NY, USA. ACM.
  • [Bos, Basile, Evang, Venhuizen,  BjervaBos et al.2017] Bos, J., Basile, V., Evang, K., Venhuizen, N. J.,  Bjerva, J. 2017. The Groningen Meaning Bank  In Handbook of Linguistic Annotation,  463–496. Springer.
  • [Bowman, Angeli, Potts,  ManningBowman et al.2015] Bowman, S. R., Angeli, G., Potts, C.,  Manning, C. D. 2015. A large annotated corpus for learning natural language inference 

    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015),  632–642, Lisbon, Portugal. Association for Computational Linguistics.

  • [Brendel  BethgeBrendel  Bethge2019] Brendel, W.  Bethge, M. 2019.

    Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet 

    In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA.
  • [Cambria, Olsher,  RajagopalCambria et al.2014a] Cambria, E., Olsher, D.,  Rajagopal, D. 2014a. SenticNet 3: A Common and Common-Sense Knowledge Base for Cognition-Driven Sentiment Analysis  In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI-14), Québec City, QC, Canada. AAAI Press.
  • [Cambria, Song, Wang,  HowardCambria et al.2014b] Cambria, E., Song, Y., Wang, H.,  Howard, N. 2014b. Semantic Multidimensional Scaling for Open- Domain Sentiment Analysis  IEEE Intelligent Systems, 29(2), 44–51.
  • [Cambria, Song, Wang,  HussainCambria et al.2011] Cambria, E., Song, Y., Wang, H.,  Hussain, A. 2011. Isanette: A Common and Common Sense Knowledge Base for Opinion Mining  In 2011 IEEE 11th International Conference on Data Mining Workshops,  315–322, Vancouver, BC, Canada. IEEE.
  • [Cambria, Speer, Havasi,  HussainCambria et al.2010] Cambria, E., Speer, R., Havasi, C.,  Hussain, A. 2010. SenticNet: A Publicly Available Semantic Resource for Opinion Mining  In AAAI Fall Symposium on Commonsense Knowledge, Menlo Park, CA, USA. AAAI Press.
  • [Carlson, Betteridge, Kisiel, Settles, Jr,  MitchellCarlson et al.2010] Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Jr, E. R. H.,  Mitchell, T. M. 2010. Toward an Architecture for Never-Ending Language Learning  In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10), Atlanta, GA, USA. AAAI Press.
  • [Cer, Diab, Agirre, Lopez-Gazpio,  SpeciaCer et al.2017] Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I.,  Specia, L. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation  In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017),  1–14, Vancouver, BC, Canada. Association for Computational Linguistics.
  • [Chai, Gao, She, Yang, Saba-Sadiya,  XuChai et al.2018] Chai, J. Y., Gao, Q., She, L., Yang, S., Saba-Sadiya, S.,  Xu, G. 2018. Language to Action: Towards Interactive Task Learning with Physical Agents  In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI 2018),  2–9, Stockholm, Sweden. International Joint Conferences on Artificial Intelligence Organization.
  • [Chambers  JurafskyChambers  Jurafsky2008] Chambers, N.  Jurafsky, D. 2008. Unsupervised Learning of Narrative Event Chains  In Proceedings of ACL-08: HLT, Columbus, OH, USA. Association for Computational Linguistics.
  • [Chelba, Mikolov, Schuster, Ge, Brants, Koehn,  RobinsonChelba et al.2014] Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P.,  Robinson, T. 2014. One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling  In 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014), Singapore, Singapore. ISCA Archive.
  • [Chen, Zhu, Ling, Wei, Jiang,  InkpenChen et al.2017] Chen, Q., Zhu, X., Ling, Z., Wei, S., Jiang, H.,  Inkpen, D. 2017. Enhanced LSTM for Natural Language Inference  In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, BC, Canada. Association for Computational Linguistics.
  • [Chen, Cui, Ma, Wang, Liu,  HuChen et al.2018] Chen, Z., Cui, Y., Ma, W., Wang, S., Liu, T.,  Hu, G. 2018. HFL-RC System at SemEval-2018 Task 11: Hybrid Multi-Aspects Model for Commonsense Reading Comprehension  arXiv: 1803.05655.
  • [ChklovskiChklovski2003] Chklovski, T. 2003. Learner: A System for Acquiring Commonsense Knowledge by Analogy  In Proceedings of the 2nd International Conference on Knowledge Capture (K-CAP ’03), K-CAP ’03,  4–12, New York, NY, USA. ACM.
  • [Chklovski  PantelChklovski  Pantel2004] Chklovski, T.  Pantel, P. 2004. VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations  In Lin, D.  Wu, D., Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP 2004),  33–40, Barcelona, Spain. Association for Computational Linguistics.
  • [Choi, He, Iyyer, Yatskar, Yih, Choi, Liang,  ZettlemoyerChoi et al.2018] Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W.-t., Choi, Y., Liang, P.,  Zettlemoyer, L. 2018. QuAC : Question Answering in Context  In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics.
  • [Clark  GardnerClark  Gardner2018] Clark, C.  Gardner, M. 2018. Simple and Effective Multi-Paragraph Reading Comprehension  In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018),  845–855, Melbourne, Australia. Association for Computational Linguistics.
  • [Clark, Cowhey, Etzioni, Khot, Sabharwal, Schoenick,  TafjordClark et al.2018] Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C.,  Tafjord, O. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge  arXiv: 1803.05457.
  • [Dagan, Glickman,  MagniniDagan et al.2005] Dagan, I., Glickman, O.,  Magnini, B. 2005. The PASCAL Recognising Textual Entailment Challenge  In Quiñonero-Candela, J., Dagan, I., Magnini, B.,  d’Alché-Buc, F., Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment,  3944,  177–190. Springer Berlin Heidelberg, Berlin, Heidelberg.
  • [Das, Datta, Gkioxari, Lee, Parikh,  BatraDas et al.2018] Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D.,  Batra, D. 2018. Embodied Question Answering  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA. IEEE.
  • [Das, Munkhdalai, Yuan, Trischler,  McCallumDas et al.2019] Das, R., Munkhdalai, T., Yuan, X., Trischler, A.,  McCallum, A. 2019. Building Dynamic Knowledge Graphs from Text using Machine Reading Comprehension  In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA.
  • [DavisDavis2017] Davis, E. 2017. Logical Formalizations of Commonsense Reasoning: A Survey  Journal of Artificial Intelligence Research, 59, 651–723.
  • [Davis  MarcusDavis  Marcus2015] Davis, E.  Marcus, G. 2015. Commonsense reasoning and commonsense knowledge in artificial intelligence  Commun. ACM, 58(9), 92–103.
  • [Davis, Morgenstern,  OrtizDavis et al.2018] Davis, E., Morgenstern, L.,  Ortiz, C. 2018. The Winograd Schema Challenge  https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html.
  • [Davis, Morgenstern,  OrtizDavis et al.2017] Davis, E., Morgenstern, L.,  Ortiz, C. L. 2017. The First Winograd Schema Challenge at IJCAI-16  AI Magazine; La Canada, 38(3), 97–98.
  • [Devlin, Chang, Lee,  ToutanovaDevlin et al.2018] Devlin, J., Chang, M.-W., Lee, K.,  Toutanova, K. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding  arXiv: 1810.04805.
  • [Dolan  BrockettDolan  Brockett2005] Dolan, W. B.  Brockett, C. 2005. Automatically Constructing a Corpus of Sentential Paraphrases  In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), Jeju Island, Korea.
  • [Dzikovska, Nielsen, Brew, Leacock, Giampiccolo, Bentivogli, Clark, Dagan,  DangDzikovska et al.2013] Dzikovska, M. O., Nielsen, R. D., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., Clark, P., Dagan, I.,  Dang, H. T. 2013. SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge  In Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval-2013),  13, Atlanta, GA, USA. Association for Computational Linguistics.
  • [Etzioni, Banko, Soderland,  WeldEtzioni et al.2008] Etzioni, O., Banko, M., Soderland, S.,  Weld, D. S. 2008. Open information extraction from the web  Communications of the ACM, 51(12), 68.
  • [Etzioni, Cafarella, Downey, Popescu, Shaked, Soderland, Weld,  YatesEtzioni et al.2005] Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S.,  Yates, A. 2005. Unsupervised named-entity extraction from the Web: An experimental study  Artificial Intelligence, 165(1), 91–134.
  • [FellbaumFellbaum1999] Fellbaum, C. 1999. WordNet : An Electronic Lexical Database,  2nd printing of Language, Speech, and Communication. A Bradford Book, Cambridge, Mass.
  • [Fillmore, Baker,  SatoFillmore et al.2002] Fillmore, C. J., Baker, C. F.,  Sato, H. 2002. The FrameNet Database and Software Tools.  In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands - Spain. European Language Resources Association (ELRA).
  • [Fouhey, Kuo, Efros,  MalikFouhey et al.2018] Fouhey, D. F., Kuo, W., Efros, A. A.,  Malik, J. 2018. From Lifestyle VLOGs to Everyday Interactions  In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA. IEEE.
  • [Gabrilovich, Ringgaard,  SubramanyaGabrilovich et al.2013] Gabrilovich, E., Ringgaard, M.,  Subramanya, A. 2013. FACC1: Freebase annotation of ClueWeb corpora, Version 1 (Release date 2013-06-26, Format version 1, Correction level 0)  Note: http://lemurproject. org/clueweb09/FACC1/Cited by, 5.
  • [Gao, Doering, Yang,  ChaiGao et al.2016] Gao, Q., Doering, M., Yang, S.,  Chai, J. 2016. Physical Causality of Action Verbs in Grounded Language Understanding  In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016),  1814–1824, Berlin, Germany. Association for Computational Linguistics.
  • [Giampiccolo, Dang, Magnini, Dagan,  DolanGiampiccolo et al.2008] Giampiccolo, D., Dang, H. T., Magnini, B., Dagan, I.,  Dolan, B. 2008. The Fourth PASCAL Recognizing Textual Entailment Challenge  In Proceedings of the First Text Analysis Conference (TAC 2008),  9, Gaithersburg, MD, USA. National Institute of Standards and Technology.
  • [Giampiccolo, Magnini, Dagan,  DolanGiampiccolo et al.2007] Giampiccolo, D., Magnini, B., Dagan, I.,  Dolan, B. 2007. The Third PASCAL Recognizing Textual Entailment Challenge  In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, RTE ’07,  1–9, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • [GlickmanGlickman2006] Glickman, O. 2006. Applied Textual Entailment. Ph.D. Thesis, Bar Ilan University.
  • [GordonGordon2016] Gordon, A. S. 2016. Commonsense Interpretation of Triangle Behavior  In Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA. AAAI Press.
  • [Goyal, Kahou, Michalski, Materzyńska, Westphal, Kim, Haenel, Fruend, Yianilos, Mueller-Freitag, Hoppe, Thurau, Bax,  MemisevicGoyal et al.2017] Goyal, R., Kahou, S. E., Michalski, V., Materzyńska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., Hoppe, F., Thurau, C., Bax, I.,  Memisevic, R. 2017. The "something something" video database for learning and evaluating visual common sense  In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV 2017).
  • [Gururangan, Swayamdipta, Levy, Schwartz, Bowman,  SmithGururangan et al.2018] Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S.,  Smith, N. A. 2018. Annotation Artifacts in Natural Language Inference Data  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018),  107–112, New Orleans, LA, USA. Association for Computational Linguistics.
  • [Hartshorne, Bonial,  PalmerHartshorne et al.2013] Hartshorne, J. K., Bonial, C.,  Palmer, M. 2013. The VerbCorner project: Toward an empirically-based semantic decomposition of verbs  In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013),  1438–1442, Seattle, WA, USA. Association for Computational Linguistics.
  • [Havasi, Speer,  AlonsoHavasi et al.2007] Havasi, C., Speer, R.,  Alonso, J. B. 2007. ConceptNet 3: A flexible, multilingual semantic network for common sense knowledge  In Recent Advances in Natural Language Processing (RANLP-07), Borovets, Bulgaria. Association for Computational Linguistics.
  • [Heilbron, Escorcia, Ghanem,  NieblesHeilbron et al.2015] Heilbron, F. C., Escorcia, V., Ghanem, B.,  Niebles, J. C. 2015. ActivityNet: A large-scale video benchmark for human activity understanding  In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015),  961–970, Boston, MA, USA. IEEE.
  • [Henaff, Weston, Szlam, Bordes,  LeCunHenaff et al.2017] Henaff, M., Weston, J., Szlam, A., Bordes, A.,  LeCun, Y. 2017. Tracking the World State with Recurrent Entity Networks  In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017),  14, Palais des Congrès Neptune, Toulon, France.
  • [Hickl, Bensley, Williams, Roberts, Rink,  ShiHickl et al.2006] Hickl, A., Bensley, J., Williams, J., Roberts, K., Rink, B.,  Shi, Y. 2006. Recognizing textual entailment with LCC’s GROUNDHOG system  In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, Venice, Italy.
  • [Hill, Bordes, Chopra,  WestonHill et al.2015] Hill, F., Bordes, A., Chopra, S.,  Weston, J. 2015. The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations  arXiv: 1511.02301.
  • [Hoffart, Suchanek, Berberich,  WeikumHoffart et al.2012] Hoffart, J., Suchanek, F. M., Berberich, K.,  Weikum, G. 2012. YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia  Artificial Intelligence, 194, 28–61.
  • [Huang, Liu, van der Maaten,  WeinbergerHuang et al.2016] Huang, G., Liu, Z., van der Maaten, L.,  Weinberger, K. Q. 2016. Densely Connected Convolutional Networks  In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA. IEEE.
  • [IfteneIftene2008] Iftene, A. 2008. UAIC Participation at RTE4  In Proceedings of the First Text Analysis Conference (TAC 2008), Gaithersburg, MD, USA. National Institute of Standards and Technology.
  • [Iyer, Dandekar,  CsernaiIyer et al.2017] Iyer, S., Dandekar, N.,  Csernai, K. 2017. First Quora Dataset Release: Question Pairs  https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs.
  • [Jia  LiangJia  Liang2017] Jia, R.  Liang, P. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems  In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017),  2021–2031. Association for Computational Linguistics.
  • [Johnson, Hariharan, van der Maaten, Fei-Fei, Zitnick,  GirshickJohnson et al.2017] Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L.,  Girshick, R. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning  In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). IEEE.
  • [Kafle  KananKafle  Kanan2017] Kafle, K.  Kanan, C. 2017. Visual Question Answering: Datasets, Algorithms, and Future Challenges  Computer Vision and Image Understanding, 163, 3–20.
  • [Khashabi, Chaturvedi, Roth, Upadhyay,  RothKhashabi et al.2018] Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S.,  Roth, D. 2018. Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), New Orleans, LA, USA. Association for Computational Linguistics.
  • [Khot, Sabharwal,  ClarkKhot et al.2018] Khot, T., Sabharwal, A.,  Clark, P. 2018. SciTail: A Textual Entailment Dataset from Science Question Answering  In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18),  9, New Orleans, LA, USA. AAAI Press.
  • [Kim, Hong, Kang,  KwakKim et al.2019] Kim, S., Hong, J.-H., Kang, I.,  Kwak, N. 2019. Semantic Sentence Matching with Densely-connected Recurrent and Co-attentive Information  In Proceedings of the Thirty-Third AAAI Conference on Artificial Inteligence (AAAI-19), Honolulu, HI, USA. AAAI Press.
  • [Kingsbury, Palmer,  MarcusKingsbury et al.2002] Kingsbury, P., Palmer, M.,  Marcus, M. 2002. Adding Semantic Annotation to the Penn TreeBank  In Proceedings of the Second International Conference on Human Language Technology Research (HLT ’02),  5, San Diego, CA, USA. Morgan Kaufmann Publishers Inc.
  • [Kotzias, Denil, De Freitas,  SmythKotzias et al.2015] Kotzias, D., Denil, M., De Freitas, N.,  Smyth, P. 2015.

    From group to individual labels using deep features 

    In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,  597–606, Sydney, Australia. ACM, ACM.
  • [Lai  HockenmaierLai  Hockenmaier2014] Lai, A.  Hockenmaier, J. 2014. Illinois-LH: A Denotational and Distributional Approach to Semantics  In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval-2014),  329–334, Dublin, Ireland. Association for Computational Linguistics and Dublin City University.
  • [Lee, Artzi, Choi,  ZettlemoyerLee et al.2015] Lee, K., Artzi, Y., Choi, Y.,  Zettlemoyer, L. 2015. Event detection and factuality assessment with non-expert supervision  In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015),  1643–1648, Lisbon, Portugal. Association for Computational Linguistics.
  • [Lenat  GuhaLenat  Guha1989] Lenat, D. B.  Guha, R. V. 1989. Building Large Knowledge-Based Systems; Representation and Inference in the Cyc Project (1st ). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
  • [LevesqueLevesque2011] Levesque, H. J. 2011. The Winograd Schema Challenge  In AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning, Stanford, CA, USA. AAAI Press.
  • [Levesque, Davis,  MorgensternLevesque et al.2012] Levesque, H. J., Davis, E.,  Morgenstern, L. 2012. The Winograd schema challenge.  In Proceedings of the Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning (KR2012), Rome, Italy. AAAI Press.
  • [LevinLevin1993] Levin, B. 1993. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago.
  • [Li, Lee-Urban, Johnston,  RiedlLi et al.2013] Li, B., Lee-Urban, S., Johnston, G.,  Riedl, M. O. 2013. Story Generation with Crowdsourced Plot Graphs  In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence (AAAI-13), AAAI’13,  598–604, Bellevue, Washington. AAAI Press.
  • [LinLin2004] Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries  In Marie-Francine Moens, S. S., Text Summarization Branches Out: Proceedings of the ACL-04 Workshop,  74–81, Barcelona, Spain. Association for Computational Linguistics.
  • [Liu  SinghLiu  Singh2004] Liu, H.  Singh, P. 2004. ConceptNet — A Practical Commonsense Reasoning Tool-Kit  BT Technology Journal, 22(4), 211–226.
  • [Liu, Jiang, Ling, Zhu, Wei,  HuLiu et al.2017] Liu, Q., Jiang, H., Ling, Z.-H., Zhu, X., Wei, S.,  Hu, Y. 2017. Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems  In 2017 AAAI Spring Symposium Series.
  • [Liu, He, Chen,  GaoLiu et al.2019] Liu, X., He, P., Chen, W.,  Gao, J. 2019. Multi-Task Deep Neural Networks for Natural Language Understanding  arXiv: 1901.11504.
  • [Mahdisoltani, Biega,  SuchanekMahdisoltani et al.2013] Mahdisoltani, F., Biega, J.,  Suchanek, F. M. 2013. YAGO3: A Knowledge Base from Multilingual Wikipedias  In Proceedings of the 6th Biennial Conference on Innovative Data Systems Research (CIDR 2013), Asilomar, CA, USA.
  • [Manjunatha, Saini,  DavisManjunatha et al.2018] Manjunatha, V., Saini, N.,  Davis, L. S. 2018. Explicit Bias Discovery in Visual Question Answering Models  arXiv: 1811.07789.
  • [Manning  HudsonManning  Hudson2018] Manning, C.  Hudson, D. 2018. Towards real-world visual reasoning.
  • [MarasovićMarasović2018] Marasović, A. 2018. NLP’s generalization problem, and how researchers are tackling it.
  • [Marcus, Santorini,  MarcinkiewiczMarcus et al.1993] Marcus, M. P., Santorini, B.,  Marcinkiewicz, M. A. 1993. Building a Large Annotated Corpus of English: The Penn Treebank  Computational Linguistics, 1(2).
  • [Marelli, Menini, Baroni, Bentivogli, Bernardi,  ZamparelliMarelli et al.2014a] Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R.,  Zamparelli, R. 2014a. A SICK cure for the evaluation of compositional distributional semantic models  In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), Reykjavik, Iceland. European Language Resources Association (ELRA).
  • [Marelli, Bentivogli, Baroni, Bernardi, Menini,  ZamparelliMarelli et al.2014b] Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S.,  Zamparelli, R. 2014b. SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment  In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval-2014), Dublin, Ireland. Association for Computational Linguistics.
  • [MaslowMaslow1943] Maslow, A. H. 1943. A theory of human motivation  Psychological Review, 50(4), 370–396.
  • [Medelyan  LeggMedelyan  Legg2008] Medelyan, O.  Legg, C. 2008. Integrating Cyc and Wikipedia: Folksonomy meets rigorously defined common-sense  In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (AAAI-08), Chicago, IL, USA. AAAI Press.
  • [Mihaylov, Clark, Khot,  SabharwalMihaylov et al.2018] Mihaylov, T., Clark, P., Khot, T.,  Sabharwal, A. 2018. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering  In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018),  2381–2391, Brussels, Belgium. Association for Computational Linguistics.
  • [Mikolov, Chen, Corrado,  DeanMikolov et al.2013] Mikolov, T., Chen, K., Corrado, G.,  Dean, J. 2013. Efficient Estimation of Word Representations in Vector Space  In Proceedings of the 1st International Conference on Learning Representations (ICLR 2013), Scottsdale, AZ, USA.
  • [MillerMiller1995] Miller, G. A. 1995. WordNet: A Lexical Database for English  Commun. ACM, 38(11), 39–41.
  • [Miller, Hempelmann,  GurevychMiller et al.2017] Miller, T., Hempelmann, C.,  Gurevych, I. 2017. SemEval-2017 Task 7: Detection and Interpretation of English Puns  In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017),  58–68, Vancouver, Canada. Association for Computational Linguistics.
  • [Miltsakaki, Prasad, Joshi,  WebberMiltsakaki et al.2004] Miltsakaki, E., Prasad, R., Joshi, A.,  Webber, B. 2004. The Penn Discourse Treebank  In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-2004),  4, Lisbon, Portugal. Evaluation and Language Resources Distribution Agency.
  • [Minard, Speranza, Urizar, Altuna, van Erp, Schoen,  van SonMinard et al.2016] Minard, A.-L., Speranza, M., Urizar, R., Altuna, B., van Erp, M., Schoen, A.,  van Son, C. 2016. MEANTIME, the NewsReader Multilingual Event and Time Corpus  In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC-2016), Portorož, Slovenia. European Language Resources Association (ELRA).
  • [Mishra, Huang, Tandon, Yih,  ClarkMishra et al.2018] Mishra, B. D., Huang, L., Tandon, N., Yih, W.-t.,  Clark, P. 2018. Tracking State Changes in Procedural Text: A Challenge Dataset and Models for Process Paragraph Comprehension  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), New Orleans, LA, USA. Association for Computational Linguistics.
  • [MorgensternMorgenstern2016] Morgenstern, L. 2016. Pronoun Disambiguation Problems  http://commonsensereasoning.org/disambiguation.html.
  • [Morgenstern, Davis,  OrtizMorgenstern et al.2016] Morgenstern, L., Davis, E.,  Ortiz, C. L. 2016. Planning, Executing, and Evaluating the Winograd Schema Challenge  AI Magazine, 37(1), 50–54.
  • [Morgenstern  OrtizMorgenstern  Ortiz2015] Morgenstern, L.  Ortiz, C. 2015. The Winograd Schema Challenge  In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI-15), Austin, TX, USA. AAAI Press.
  • [Mostafazadeh, Chambers, He, Parikh, Batra, Vanderwende, Kohli,  AllenMostafazadeh et al.2016] Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P.,  Allen, J. 2016. A Corpus and Cloze Evaluation Framework for Deeper Understanding of Commonsense Stories  In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2016), San Diego, CA, USA. Association for Computational Linguistics.
  • [Negri, Marchetti, Mehdad, Bentivogli,  GiampiccoloNegri et al.2012] Negri, M., Marchetti, A., Mehdad, Y., Bentivogli, L.,  Giampiccolo, D. 2012. Semeval-2012 Task 8: Cross-lingual Textual Entailment for Content Synchronization  In Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval-2012),  399–407, Montréal, Canada. Association for Computational Linguistics.
  • [Negri, Marchetti, Mehdad, Bentivogli,  GiampiccoloNegri et al.2013] Negri, M., Marchetti, A., Mehdad, Y., Bentivogli, L.,  Giampiccolo, D. 2013. Semeval-2013 Task 8: Cross-lingual Textual Entailment for Content Synchronization  In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval-2013),  25–33, Atlanta, Georgia, USA. Association for Computational Linguistics.
  • [OlsherOlsher2014] Olsher, D. 2014. Semantically-based priors and nuanced knowledge core for Big Data, Social AI, and language understanding  Neural Networks, 58, 131–147.
  • [OrtizOrtiz2016] Ortiz, C. 2016. Why We Need a Physically Embodied Turing Test and What It Might Look Like  AI Magazine, 37(1), 55–62.
  • [Ostermann, Modi, Roth, Thater,  PinkalOstermann et al.2018] Ostermann, S., Modi, A., Roth, M., Thater, S.,  Pinkal, M. 2018. MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge  In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  • [Papineni, Roukos, Ward,  ZhuPapineni et al.2002] Papineni, K., Roukos, S., Ward, T.,  Zhu, W.-J. 2002. Bleu: A Method for Automatic Evaluation of Machine Translation  In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002),  311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  • [PaulheimPaulheim2018] Paulheim, H. 2018. How much is a Triple? Estimating the Cost of Knowledge Graph Creation  In Proceedings of the 17th International Semantic Web Conference (ISWC 2018), Monterey, CA, USA. Springer.
  • [Pennington, Socher,  ManningPennington et al.2014] Pennington, J., Socher, R.,  Manning, C. 2014. Glove: Global Vectors for Word Representation  In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014),  1532–1543, Doha, Qatar. Association for Computational Linguistics.
  • [Peters, Neumann, Iyyer, Gardner, Clark, Lee,  ZettlemoyerPeters et al.2018] Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K.,  Zettlemoyer, L. 2018. Deep Contextualized Word Representations  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018),  2227–2237, New Orleans, Louisiana. Association for Computational Linguistics.
  • [PlutchikPlutchik1980] Plutchik, R. 1980. A general psychoevolutionary theory of emotion  In Theories of Emotion,  3–31. Academic Press.
  • [PohlPohl2012] Pohl, A. 2012. Classifying the Wikipedia Articles into the OpenCyc Taxonomy  In Proceedings of the Web of Linked Entities Workshop in Conjunction with the 11th International Semantic Web Conference (ISWC 2012), Boston, MA, USA.
  • [Poliak, Haldar, Rudinger, Hu, Pavlick, White,  Van DurmePoliak et al.2018a] Poliak, A., Haldar, A., Rudinger, R., Hu, J. E., Pavlick, E., White, A. S.,  Van Durme, B. 2018a. Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation  In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium. Association for Computational Linguistics.
  • [Poliak, Naradowsky, Haldar, Rudinger,  Van DurmePoliak et al.2018b] Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R.,  Van Durme, B. 2018b. Hypothesis Only Baselines in Natural Language Inference  In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics,  180–191, New Orleans, LA, USA. Association for Computational Linguistics.
  • [Ponzetto  StrubePonzetto  Strube2007] Ponzetto, S. P.  Strube, M. 2007. Deriving a Large Scale Taxonomy from Wikipedia  In Proceedings of the 22nd National Conference on Artificial Intelligence, AAAI’07,  1440–1445, Vancouver, BC, Canada. AAAI Press.
  • [Pradhan, Hovy, Marcus, Palmer, Ramshaw,  WeischedelPradhan et al.2007] Pradhan, S. S., Hovy, E., Marcus, M., Palmer, M., Ramshaw, L.,  Weischedel, R. 2007. OntoNotes: A Unified Relational Semantic Representation  In International Conference on Semantic Computing (ICSC 2007),  517–526.
  • [Radford, Narasimhan, Salimans,  SutskeverRadford et al.2018] Radford, A., Narasimhan, K., Salimans, T.,  Sutskever, I. 2018. Improving Language Understanding with Unsupervised Learning.
  • [Radford, Wu, Child, Luan, Amodei,  SutskeverRadford et al.2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,  Sutskever, I. 2019. Language Models are Unsupervised Multitask Learners  https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
  • [Rahman  NgRahman  Ng2012] Rahman, A.  Ng, V. 2012. Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge  In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012),  777–789, Jeju Island, Korea. Association for Computational Linguistics.
  • [Raina, Ng,  ManningRaina et al.2005] Raina, R., Ng, A. Y.,  Manning, C. D. 2005. Robust textual inference via learning and abductive reasoning  In Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05),  1099–1105, Pittsburgh, PA, USA. AAAI Press.
  • [Rajpurkar, Jia,  LiangRajpurkar et al.2018] Rajpurkar, P., Jia, R.,  Liang, P. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD  In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia. Association for Computational Linguistics.
  • [Rajpurkar, Zhang, Lopyrev,  LiangRajpurkar et al.2016] Rajpurkar, P., Zhang, J., Lopyrev, K.,  Liang, P. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text  In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), Austin, TX, USA. Association for Computational Linguistics.
  • [Rashkin, Bosselut, Sap, Knight,  ChoiRashkin et al.2018a] Rashkin, H., Bosselut, A., Sap, M., Knight, K.,  Choi, Y. 2018a. Modeling Naive Psychology of Characters in Simple Commonsense Stories  In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia. Association for Computational Linguistics.
  • [Rashkin, Sap, Allaway, Smith,  ChoiRashkin et al.2018b] Rashkin, H., Sap, M., Allaway, E., Smith, N. A.,  Choi, Y. 2018b. Event2Mind: Commonsense Inference on Events, Intents, and Reactions  In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia. Association for Computational Linguistics.
  • [Reddy, Chen,  ManningReddy et al.2018] Reddy, S., Chen, D.,  Manning, C. D. 2018. CoQA: A Conversational Question Answering Challenge  arXiv: 1808.07042.
  • [ReissReiss2004] Reiss, S. 2004. Multifaceted Nature of Intrinsic Motivation: The Theory of 16 Basic Desires  Review of General Psychology, 8(3), 179–193.
  • [Richardson, Burges,  RenshawRichardson et al.2013] Richardson, M., Burges, C. J. C.,  Renshaw, E. 2013. MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text  In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013),  11, Seattle, WA, USA. Association for Computational Linguistics.
  • [Rodosthenous  MichaelRodosthenous  Michael2016] Rodosthenous, C.  Michael, L. 2016. A Hybrid Approach to Commonsense Knowledge Acquisition  In Proceedings of the 8th European Starting AI Researcher Symposium, The Hague, the Netherlands.
  • [Roemmele, Bejan,  GordonRoemmele et al.2011] Roemmele, M., Bejan, C. A.,  Gordon, A. S. 2011. Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning  In AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning,  6, Stanford, CA, USA.
  • [Roemmele  GordonRoemmele  Gordon2018] Roemmele, M.  Gordon, A. 2018. An Encoder-decoder Approach to Predicting Causal Relations in Stories  In Proceedings of the First Workshop on Storytelling,  50–59, New Orleans, LA, USA. Association for Computational Linguistics.
  • [Rohrbach, Torabi, Rohrbach, Tandon, Pal, Larochelle, Courville,  SchieleRohrbach et al.2017] Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., Courville, A.,  Schiele, B. 2017. Movie Description  Int. J. Comput. Vision, 123(1), 94–120.
  • [Rudinger, Naradowsky, Leonard,  Van DurmeRudinger et al.2018a] Rudinger, R., Naradowsky, J., Leonard, B.,  Van Durme, B. 2018a. Gender Bias in Coreference Resolution  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018),  8–14, New Orleans, LA, USA. Association for Computational Linguistics.
  • [Rudinger, White,  Van DurmeRudinger et al.2018b] Rudinger, R., White, A. S.,  Van Durme, B. 2018b. Neural Models of Factuality  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018),  731–744, New Orleans, LA, USA. Association for Computational Linguistics.
  • [Sap, LeBras, Allaway, Bhagavatula, Lourie, Rashkin, Roof, Smith,  ChoiSap et al.2019] Sap, M., LeBras, R., Allaway, E., Bhagavatula, C., Lourie, N., Rashkin, H., Roof, B., Smith, N. A.,  Choi, Y. 2019. ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning  In Proceedings of the Thirty-Third AAAI Conference on Artificial Inteligence (AAAI-19), Honolulu, HI, USA. AAAI Press.
  • [SchulerSchuler2005] Schuler, K. K. 2005. VerbNet: A Broad-Coverage, Comprehensive Verb Lexicon. Dissertation, University of Pennsylvania.
  • [Schwartz, Sap, Konstas, Zilles, Choi,  SmithSchwartz et al.2017] Schwartz, R., Sap, M., Konstas, I., Zilles, L., Choi, Y.,  Smith, N. A. 2017. The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task  In Proceedings of the 21st Conference on Computational Natural Language (CoNLL 2017), Vancouver, BC, Canada. Association for Computational Linguistics.
  • [Sharma, Allen, Bakhshandeh,  MostafazadehSharma et al.2018] Sharma, R., Allen, J., Bakhshandeh, O.,  Mostafazadeh, N. 2018. Tackling the Story Ending Biases in The Story Cloze Test  In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018),  752–757, Melbourne, Australia. Association for Computational Linguistics.
  • [SinghSingh2002] Singh, P. 2002. The Public Acquisition of Commonsense Knowledge  In AAAI Spring Symposium Series, Palo Alto, CA, USA.
  • [Socher, Perelygin, Wu, Chuang, Manning, Ng,  PottsSocher et al.2013] Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y.,  Potts, C. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank  In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013),  12, Seattle, WA, USA. Association for Computational Linguistics.
  • [Speer, Chin,  HavasiSpeer et al.2017] Speer, R., Chin, J.,  Havasi, C. 2017. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge.  In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) and the Twenty-Ninth Innovative Applications of Artificial Intelligence Conference (IAAI-17), San Francisco, CA, USA. AAAI Press.
  • [Speer, Havasi,  LiebermanSpeer et al.2008] Speer, R., Havasi, C.,  Lieberman, H. 2008. AnalogySpace: Reducing the Dimensionality of Common Sense Knowledge  In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (AAAI-08), Chicago, IL, USA.
  • [Suchanek, Kasneci,  WeikumSuchanek et al.2007] Suchanek, F. M., Kasneci, G.,  Weikum, G. 2007. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia  In Proceedings of the 16th International Conference on World Wide Web,  10, Amherst, MA, USA. ACM.
  • [Suhr, Lewis, Yeh,  ArtziSuhr et al.2017] Suhr, A., Lewis, M., Yeh, J.,  Artzi, Y. 2017. A Corpus of Natural Language for Visual Reasoning  In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017),  217–223. Association for Computational Linguistics.
  • [Talmor, Herzig, Lourie,  BerantTalmor et al.2019] Talmor, A., Herzig, J., Lourie, N.,  Berant, J. 2019. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge  In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA. Association for Computational Linguistics.
  • [Tandon, de Melo, Suchanek,  WeikumTandon et al.2014] Tandon, N., de Melo, G., Suchanek, F.,  Weikum, G. 2014. WebChild: Harvesting and organizing commonsense knowledge from the web  In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM ’14),  523–532, New York, NY, USA. ACM.
  • [Tandon, de Melo,  WeikumTandon et al.2017] Tandon, N., de Melo, G.,  Weikum, G. 2017. WebChild 2.0 : Fine-Grained Commonsense Knowledge Distillation  In Proceedings of ACL 2017, System Demonstrations,  115–120, Vancouver, BC, Canada. Association for Computational Linguistics.
  • [Tandon, Mishra, Grus, Yih, Bosselut,  ClarkTandon et al.2018] Tandon, N., Mishra, B. D., Grus, J., Yih, W.-t., Bosselut, A.,  Clark, P. 2018. Reasoning about Actions and State Changes by Injecting Commonsense Knowledge  In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium. Association for Computational Linguistics.
  • [Taylor, Marcus,  SantoriniTaylor et al.2003] Taylor, A., Marcus, M.,  Santorini, B. 2003. The Penn Treebank: An Overview  In Treebanks,  20 of Text, Speech and Language Technology. Springer, Dordrecht.
  • [TaylorTaylor1953] Taylor, W. L. 1953. “Cloze Procedure”: A New Tool for Measuring Readability  Journalism Bulletin, 30(4), 415–433.
  • [Tjong Kim Sang  De MeulderTjong Kim Sang  De Meulder2003] Tjong Kim Sang, E. F.  De Meulder, F. 2003. Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition  In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, CONLL ’03,  142–147, Edmonton, AB, Canada. Association for Computational Linguistics.
  • [Trinh  LeTrinh  Le2018] Trinh, T. H.  Le, Q. V. 2018. A Simple Method for Commonsense Reasoning  arXiv: 1806.02847.
  • [Tsuchida  IshikawaTsuchida  Ishikawa2011] Tsuchida, M.  Ishikawa, K. 2011. IKOMA at TAC2011: A Method for Recognizing Textual Entailment using Lexical-level and Sentence Structure-level features  In Proceedings of the Fourth Text Analysis Conference (TAC 2011), Gaithersburg, MD, USA. National Institute of Standards and Technology.
  • [Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser,  PolosukhinVaswani et al.2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L.,  Polosukhin, I. 2017. Attention is All you Need  In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S.,  Garnett, R., Advances in Neural Information Processing Systems 30,  5998–6008. Curran Associates, Inc.
  • [Wang, Singh, Michael, Hill, Levy,  BowmanWang et al.2018] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O.,  Bowman, S. R. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding  In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium. Association for Computational Linguistics.
  • [Warstadt, Singh,  BowmanWarstadt et al.2018] Warstadt, A., Singh, A.,  Bowman, S. R. 2018. Neural Network Acceptability Judgments  arXiv: 1805.12471.
  • [Weston, Bordes, Chopra, Rush, van Merriënboer, Joulin,  MikolovWeston et al.2016] Weston, J., Bordes, A., Chopra, S., Rush, A. M., van Merriënboer, B., Joulin, A.,  Mikolov, T. 2016. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks  In Proceedings of the 4th International Conference on Learning Representations (ICLR 2016), San Juan, Puerto Rico.
  • [Weston, Chopra,  BordesWeston et al.2015] Weston, J., Chopra, S.,  Bordes, A. 2015. Memory Networks  In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA.
  • [White, Rastogi, Duh,  Van DurmeWhite et al.2017] White, A. S., Rastogi, P., Duh, K.,  Van Durme, B. 2017. Inference is Everything: Recasting Semantic Resources into a Unified Evaluation Framework  In Proceedings of the Eighth International Joint Conference on Natural Language Processing (IJCNLP 2017),  996–1005, Taipei, Taiwan. Asian Federation of Natural Language Processing.
  • [White  RawlinsWhite  Rawlins2018] White, A. S.  Rawlins, K. 2018. The role of veridicality and factivity in clause selection  In Proceedings of the 48th Annual Meeting of the North East Linguistic Society, Amherst, MA, USA. GLSA Publications.
  • [Williams, Nangia,  BowmanWilliams et al.2017] Williams, A., Nangia, N.,  Bowman, S. R. 2017. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference  In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2018), New Orleans, LA, USA. Association for Computational Linguistics.
  • [WinogradWinograd1972] Winograd, T. 1972. Understanding Natural Language. Academic Press, New York.
  • [Wu, Li, Wang,  ZhuWu et al.2011] Wu, W., Li, H., Wang, H.,  Zhu, K. Q. 2011. Towards a Probabilistic Taxonomy of Many Concepts  https://www.microsoft.com/en-us/research/wp-content/uploads/2011/03/paper-2.pdf.
  • [Xie, Lai, Dai,  HovyXie et al.2017] Xie, Q., Lai, G., Dai, Z.,  Hovy, E. 2017. Large-scale Cloze Test Dataset Created by Teachers  In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), Copenhagen, Denmark. Association for Computational Linguistics.
  • [Xu, Lin,  ZhuXu et al.2018a] Xu, F. F., Lin, B. Y.,  Zhu, K. 2018a. Automatic Extraction of Commonsense LocatedNear Knowledge  In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018),  96–101, Melbourne, Australia. Association for Computational Linguistics.
  • [Xu, Zhou, Young, Zhao, Huang,  ZhuXu et al.2018b] Xu, J., Zhou, H., Young, T., Zhao, H., Huang, M.,  Zhu, X. 2018b. Commonsense Knowledge Aware Conversation Generation with Graph Attention  In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI 2018),  4623–4629, Stockholm, Sweden. IJCAI.
  • [Yang, Lavie, Dyer,  HovyYang et al.2015] Yang, D., Lavie, A., Dyer, C.,  Hovy, E. 2015. Humor Recognition and Humor Anchor Extraction  In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,  2367–2376, Lisbon, Portugal. Association for Computational Linguistics.
  • [Yang, Gao, Sadiya,  ChaiYang et al.2018] Yang, S., Gao, Q., Sadiya, S.,  Chai, J. 2018. Commonsense Justification for Action Explanation  In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018),  2627–2637, Brussels, Belgium. Association for Computational Linguistics.
  • [Zellers, Bisk, Farhadi,  ChoiZellers et al.2019] Zellers, R., Bisk, Y., Farhadi, A.,  Choi, Y. 2019. From Recognition to Cognition: Visual Commonsense Reasoning  In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA. IEEE.
  • [Zellers, Bisk, Schwartz,  ChoiZellers et al.2018] Zellers, R., Bisk, Y., Schwartz, R.,  Choi, Y. 2018. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference  In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), Brussels, Belgium. Association for Computational Linguistics.
  • [Zhang, Liu, Liu, Gao, Duh,  Van DurmeZhang et al.2018] Zhang, S., Liu, X., Liu, J., Gao, J., Duh, K.,  Van Durme, B. 2018. ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension  arXiv: 1810.12885.
  • [Zhang, Rudinger, Duh,  Van DurmeZhang et al.2016] Zhang, S., Rudinger, R., Duh, K.,  Van Durme, B. 2016. Ordinal Common-sense Inference  Transactions of the Association for Computational Linguistics, 5, 379–395.
  • [Zhu, Groth, Bernstein,  Fei-FeiZhu et al.2016] Zhu, Y., Groth, O., Bernstein, M.,  Fei-Fei, L. 2016. Visual7W: Grounded Question Answering in Images  In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA. IEEE.