CommonGen: A Constrained Text Generation Dataset Towards Generative Commonsense Reasoning

11/09/2019 ∙ by Bill Yuchen Lin, et al. ∙ University of Southern California 23

Rational humans can generate sentences that cover a certain set of concepts while describing natural and common scenes. For example, given apple(noun), tree(noun), pick(verb), humans can easily come up with scenes like "a boy is picking an apple from a tree" via their generative commonsense reasoning ability. However, we find this capacity has not been well learned by machines. Most prior works in machine commonsense focus on discriminative reasoning tasks with a multi-choice question answering setting. Herein, we present CommonGen: a challenging dataset for testing generative commonsense reasoning with a constrained text generation task. We collect 37k concept-sets as inputs and 90k human-written sentences as associated outputs. Additionally, we also provide high-quality rationales behind the reasoning process for the development and test sets from the human annotators. We demonstrate the difficulty of the task by examining a wide range of sequence generation methods with both automatic metrics and human evaluation. The state-of-the-art pre-trained generation model, UniLM, is still far from human performance in this task. Our data and code is publicly available at .



There are no comments yet.


page 4

Code Repositories


A Constrained Text Generation Challenge Towards Generative Commonsense Reasoning

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Commonsense reasoning has long been acknowledged as a critical bottleneck of artificial intelligence and especially in natural language processing. It is an ability of combining commonsense facts and logic rules to make new presumptions about ordinary scenes in our daily life. A distinct property of commonsense reasoning problems is that they are generally trivial for human-beings while challenging for machine reasoners.


Figure 1: A motivating example for generative commonsense reasoning and the CommonGen task. A reasoner gets a concept-set as the input and should generate a sentence that covers all given concepts while describing a common scene (in the green box) out of less plausible ones (in the red box).

There have been a few recent tasks and datasets for testing machine commonsense, while most of them frame their problems as multi-choice question answering, such as CSQA Talmor et al. (2019) and SWAG Zellers et al. (2018). We name this kind of tasks as deterministic commonsense reasoning because they focus on modeling the plausibility of given complete scenes. The systems for these tasks thus have to work with biased selection of distractors, and thus are less practical or challenging. Simply fine-tuning such large pre-trained language encoders can yield near or exceeding human performance Liu et al. (2019). On the other hand, few work has been done so far in testing machine commonsense in a generative reasoning setting, where a reasoner is expected to complete scenes with several given concepts.

Specifically, we would like to investigate if machine-reasoning models can generate a sentence that contains a required set of concepts (i.e. nouns or verbs) while describing a common scene in our daily life. For example, as shown in Figure 1, given an unordered collection of concepts “{apple (noun), bag (noun), pick (verb), place (verb), tree (noun)}”, a rational reasoner should be able to generate a sentence like “A boy picks some apples from a tree and places them into a bag.”, which describes a natural scene and contains all given concepts. The creation of this sentence is easy for humans while non-trivial for even state-of-the-art conditional language generation models. We argue that such an ability of recovering natural scenes of daily life can benefit a wide range of natural language generation (NLG) tasks including image/video captioning Qiao et al. (2019); Wang et al. (2019), scene-based visual reasoning and VQA Hudson and Manning (2019), storytelling Guan et al. (2018), and dialogue systems Zhou et al. (2018a, b).

Towards empowering machines with the generative commonsense reasoning ability, we create a large-scale dataset, named CommonGen, for the constrained text generation task. We collect concept-sets as the inputs, each of which contains three to five common concepts. These concept-sets are sampled from several large corpora of image/video captions, such that the concepts inside them are more likely to co-occur in natural scenes. Through crowd-sourcing via Amazon Mechanical Turk111 (AMT), we finally obtain human-written sentences as expected outputs. We investigate the performance of sophisticated sequence generation methods for the proposed task with both automatic metrics and human evaluation. The experiments show that all methods are far from human performance in generative commonsense reasoning. Our main contributions are as follows: 1) We introduce the first large-scale constrained text generation dataset targeting at generative commonsense reasoning; 2) We systematically compare methods for this (lexically) constrained text generation with extensive experiments and evaluation. 3) Our code and data are publicly available (w/ the URL in the abstract), so future research in this direction can be directly developed in a unified framework.

2 Problem Formulation

In this section, we formulate our task with mathematical notations and discuss its inherent challenges. The input to the task is a set of concepts , where is a common noun or verb. denotes the space of concept-sets and stands for the concept vocabulary. The expected output of this task is a simple, grammatical sentence , describing a natural scene in our daily-life that covers all given concepts in . Note that other forms of given concepts are also accepted, such as plural forms of nouns and verbs. In addition, we also provide rationales as an optional resource to model the generation process. For each pair of , a rationale is a list of sentences that explains the background commonsense knowledge used in the scene recovering process.

The task is to learn a structured predictive function , which maps a concept-set to a sentence. Thus, it can be seen as a special case of constrained text generation Hu et al. (2017). The unique challenges of our proposed task come from two main aspects as follows.

Constrained Decoding. Lexically constrained decoding for sentence generation has been an important and challenging research topic in machine translation community Hokamp and Liu (2017), where they focus on how to decode sentences when some words/phrases (e.g. terminology) must present in target sentences (Section 6). However, it is still an open problem how to efficiently generate sentences given an unordered set of multiple keywords with potential morphological changes (e.g. “pick” “picks” in the previous case). Apart from that, the part-of-speech constraints brings even more difficulties (e.g. “place” can be verb/noun).

Commonsense Reasoning. Apart from the challenge in constrained decoding, a generative commonsense reasoner also has to compositionally use (latent) commonsense knowledge for generating most plausible scenes. Recall the illustrative example in Figure 1, even such a simple scene generation process needs pretty much commonsense knowledge like: 1) “apples grow in trees”; 2) “bags are containers that you can put something in”; 3) “you usually pick something and then place it in a container”. Expected reasoners have to prioritize target scenes over an infinity number of less plausible scenes like “A boy picks an apple tree and places it into bags.” or “A boy places some bags on a tree and picks an apple.”.

3 The CommonGen Dataset

In this section, we present how we build the CommonGen dataset for testing machine commonsense with generative reasoning. The overall data collection process is as follows. 1) We first collect a large amount of high-quality image/video caption sentences from several existing corpora, 2) Then, we compute co-occurrence statistics about concept-sets of different sizes (), such that we can find the concept-sets that are more likely to be present in the same scene. 3) Finally, we ask human crowd-workers from AMT to write scenes with rationales for every given concept-set, which serve as our development and test sets. The training set consists of carefully post-processed human-written caption sentences, which have little overlap with dev/test sets. We present the statistics and show its inherent challenges at the end of this section.

3.1 Collecting Concept-Sets with Captions

Following the general definition in the largest commonsense knowledge graph, ConceptNet 

Speer et al. (2016), we understand a concept as a common noun or verb. We aim to test the ability of generating natural scenes with a given set of concepts. The expected concept-sets in our task are supposed to be likely co-occur in natural, daily-life scenes . The concepts in images/videos captions, which usually describe scenes in our daily life, thus possess the desired property. We therefore collect a large amount of caption sentences from a variety of datasets, including VATEX Wang et al. (2019), LSMDC Rohrbach et al. (2017), ActivityNet Krishna et al. (2017), and SNLI222Most of the sentences in the SNLI dataset are from Flickr30k Young et al. (2014), a large-scale image caption corpus, while it has much more human-written scenes. Bowman et al. (2015), forming 1,040,330 sentences in total.

We assume if a set of concepts are all mentioned together in more caption sentences, then this concept-set is more like to co-occur. Thus, we compute the co-occurrence frequency of all possible concept-sets that have concepts, named as three/four/five-concept-sets respectively333There are too many two-concept-sets and they can generate over-flexible scenes that are hard to evaluate. Also, there are too few concept-sets larger than five.. Each concept-set is associated with at least one caption sentences. We carefully post-process them and take the shortest ones with minimal overlaps as the final data. These initial concept-sets are further divided into three parts: train/dev/test. We then iterate all training concept-sets and remove the ones that have more than two overlapping concepts with any concept-set in the dev or test set. Thus, the dev/test set can better measure the generalization ability of models on unseen combinations of concepts.

3.2 Crowd-Sourcing via AMT

It is true that the above-mentioned associated caption sentences for each concept-set are human-written and do describe scenes that cover all given concepts. However, they are created under specific contexts (i.e. an image or a video) and thus might be less representative for common sense. To better measure the quality and interpretability of generative reasoners, we need to evaluate them with scenes and rationales created by using concept-sets only as the signals for annotators.

We collect more human-written scenes for each concept-set in dev and test set through crowd-sourcing via the Amazon Mechanical Turk platform. Each input concept-set is annotated by at least three different humans. The annotators are also required to give sentences as the rationales, which further encourage them to use common sense in creating their scenes. The crowd-sourced sentences correlate well with the associated captions, meaning that it is reasonable to use caption sentences as training data although they can be partly noisy. Additionally, we utilize a search engine over the OMCS corpus Singh et al. (2002) for retrieving relevant propositions as distant rationales in training data.


Figure 2: The frequency of top 50 single concepts (upper) and co-occurred concept-pairs (lower) in the test data.

3.3 Statistics

We present the statistical information of our final dataset. Firstly, we summarize the basic statistics in Table 1, such as the number of unique concept-sets, scene sentences, and sentence lengths. In total, there are 3,706 unique concepts among all concept-sets, and 3,614/1,018/1,207 in the train/dev/test parts respectively. Note that there are 4% of the dev and 6% of the test concepts never appear in the training data, so we can better understand how well trained models can perform with unseen concepts.

We analyze the overlap between training concept-sets and dev/test concept-sets. By average, we find that 98.8% of the training instances share no common concept at all with dev/test data, such that the dev/test can help us analyze model performance on new combinations of concepts.

We also visualize the frequency distribution of our test concept-sets in Figure 2 by showing the frequency of top 50 single concepts and co-occurred concept pairs.

0.80 #Concept-Sets (size: 3/4/5) # Sent. Avg. Len. Train 34,773 (27,588/6,485/633) 81,458 12.93 Dev 1,245 (246/499/500) 4,039 13.15 Test 1,245(247/498/500) 3,531 13.10

Table 1: The basic statistics of CommonGen.

4 Methods

In this section, we introduce the methods that we adopt for the proposed constrained text generation task. We group these methods into several types as follows. Basically, we have different kinds of encoder-decoder architectures with copy attention mechanism, including both classic and recently proposed methods. Apart from that, we utilize the state-of-the-art pre-trained sentence generation model for our task. Moreover, we include three typical models for abstractive summarization, story generation respectively, and keywords-based decoding of language models.

4.1 Seq-to-Seq Learning

One very straightforward way is to form this problem as a “sequence”-to-sequence task, where input sequences are randomly sorted sets of given concepts. In this way, encoder-decoder seq2seq architectures based on bidirectional RNN (bRNN) Sutskever et al. (2014) or Transformer (Trans.) Vaswani et al. (2017) can be directly adopted to the task, just like many other conditional sequence generation problems (translation, summarization, etc.).

Order-insensitive processing.

However, these encoders may degrade because our inputs are actually order-insensitive. We thus try to use multi-layer perceptrons (MLP) with mean-pooling as the encoder (“mean encoder”) over sequences of word vectors to completely eliminate the order sensitivity. Similarly, we consider removing the positional embeddings in Transformers (Trans. w/o Pos).

Copying mechanism. The above-mentioned architectures with vanilla attention can miss the words in input sequences and thus produce either unknown tokens or synonyms. To force the decoder to produce target sentences with a constraint on input sentence, we utilize the copying mechanism Gu et al. (2016) for all these models. We follow the implementation of these methods by OpenNMT-py444 Klein et al. (2017).

Non-autoregressive generation. Recent advances in conditional sentence generation have a focus on edit-based models, which iteratively refine generated sequences (usually bounded by a fixed length). These models potentially get better performance than auto-regressive methods because of their explicit modeling on iterative refinements. We study typical models including iNAT Lee et al. (2018), Insertion Transformer (InsertTrans) Stern et al. (2019), and Levenshtein Transformer (LevenTrans) Gu et al. (2019).

4.2 A BERT-based Method: UniLM

We employ a new unified pre-trained language model, UniLM Dong et al. (2019), which uses BERT Devlin et al. (2019b) as the encoder and then fine-tunes the whole architecture with different generation-based objective. To the best of our knowledge, the UniLM model is the state-of-the-art method for a wide range of conditional text generation tasks including summarization, question generation, and dialogue responding.

4.3 Other methods

Based on the similarity between our task and abstractive summarization and story generation (with given topic words), we also apply Pointer Generator Networks (“PointerGen”) See et al. (2017) and Multi-scale Fusion Attention (“Fusion Attn.”) Fan et al. (2018) model respectively for our task.

4.4 Incorporating Commonsense Rationales

We explore how to utilize additional commonsense knowledge (i.e. rationales) as the input to the task. Like we mentioned in Section 3.2, we search relevant sentences from the OMCS corpus as the additional distant rationales, and ground truth rationale sentences for dev/test data. The inputs are no longer the concept-sets themselves, but in a form of “[rationalesconcept-set]” (i.e. concatenating the rationale sentences and original concept-set strings).

0.72 Dev set. Test set. Model / Metric BLEU-3/4 ROUGE-2/L CIDEr SPICE BERTS. BLEU-3/4 ROUGE-2/L CIDEr SPICE BERTS. PointerGen 5.60 / 2.50 4.00 / 21.70 3.42 14.60 -2.39 6.60 / 2.70 4.60 / 22.60 3.74 15.50 -1.92 FusionAttn 4.90 / 2.40 2.60 / 16.60 1.88 7.10 -2.56 4.40 / 2.00 2.60 / 16.40 1.75 6.90 -2.62 mean encoder 14.30 / 8.80 9.20 / 28.50 6.86 17.90 1.94 12.10 / 6.90 8.20 / 27.30 6.41 17.70 1.62 Trans.w/o Pos 10.90 / 6.90 7.20 / 24.20 4.78 14.50 1.48 8.60 / 4.90 6.10 / 22.40 4.20 13.30 1.09 bRNN 14.00 / 8.70 8.80 / 27.70 6.63 17.70 1.69 11.10 / 6.50 7.30 / 25.90 5.86 16.30 1.17 bRNN + 12.80 / 7.70 7.70 / 26.70 6.16 17.10 1.39 11.20 / 6.10 7.20 / 26.10 6.16 16.90 1.14 Trans. 11.80 / 7.10 7.60 / 25.00 5.53 15.80 1.58 9.50 / 5.30 6.60 / 23.40 4.72 14.00 1.29 Trans. + 11.60 / 6.30 6.30 / 23.40 5.36 14.60 0.89 10.40 / 5.50 6.10 / 23.40 5.34 15.40 0.67 LevenTrans 19.80 / 13.10 11.40 / 31.60 8.69 22.70 2.55 17.00 / 10.70 10.10 / 30.40 8.04 21.30 2.29 iNAT 10.10 / 4.90 5.90 / 23.00 3.80 16.00 -1.43 8.60 / 4.20 5.30 / 22.20 3.54 14.70 -1.72 InsertTrans 13.80 / 8.70 8.40 / 26.70 5.86 16.80 1.85 12.30 / 7.60 8.00 / 26.10 5.50 16.30 1.59 UniLM 27.60 / 19.30 17.40 / 38.20 13.01 32.60 4.37 24.40 / 16.40 16.10 / 36.40 12.25 31.10 4.18 UniLM + 28.10 / 20.10 17.60 / 38.60 12.36 30.60 4.21 25.10 / 17.40 16.70 / 37.50 11.86 30.00 4.13 Human Bound 44.50 / 40.80 45.60 / 59.80 38.71 59.50 7.83 47.60 / 44.30 48.10 / 61.70 42.39 61.20 8.21

Table 2: Experimental results of different baseline methods on the CommonGen.

5 Evaluation

Herein, we present the experimental results for comparing different baseline methods in the proposed setting. We first introduce the setup and automatic metrics, and then we present the results and analysis. Finally, we show human evaluation results and qualitative analysis.

5.1 Setup

We use the proposed CommonGen dataset in two setting: knowledge-agnostic and knowledge-aware. For the knowledge-agnostic setting, we simply apply the methods in Section 4 while we concatenate rationales and input concept-sets together as the knowledge-aware inputs (“”).

5.2 Automatic Metrics

For automatically evaluating our methods, we propose to use widely used metric for image/video captioning. This is because the proposed CommonGen task can be regarded as also a caption task where the context are incomplete scenes with given concept-sets. Therefore, we choose BLEU-3/4 Papineni et al. (2001), ROUGE-2/L Lin (2004), CIDEr Vedantam et al. (2014), and SPICE Anderson et al. (2016) as the main metrics. Apart from these classic metrics, we also include a novel embedding-based metric named BERTScore Zhang et al. (2019). To make the comparisons more clear, we show the delta of BERTScore results by subtracting the score of merely using input concept-sets as target sentences, named BERTS.

To have an estimation about

human performance in each metric, we iteratively treat every reference sentence in dev/test data as the prediction to be compared with all references (including itself). That is, if a model has the same reasoning ability with average performance of our crowd workers, its results should exceed this “human bound”.

5.3 Experimental Results

We present the experimental results of five groups of methods that are introduced in Section 4. We find that the model UniLM outperforms all other baseline methods by a large margin, which is expected due to it is pre-trained with the BERT encoder towards generation objectives. However, its performance is still way far from the human bound, and this margin is even larger in test data.

We notice that the most recent edit-based model named LevenTrans archives the best performance among models without pre-training at all. This shows that edit-based sequence generation models can better deal with the cases where target sentences share similar vocabulary with source ones. Nonetheless, the other two models within the same sequence modeling framework (i.e. fairseq555 are much worse, which might because of their specialty designed for machine translation.

An order-insensitive sequence/set encoder, “mean encoder”, outperform order-sensitive counterparts like “bRNN”. However, such a marginal improvement is not seen in the comparison between “Trans.” vs “Trans. w/o Pos”. We assume that for short sequences the order sensitivity does not harm sequential encoders, while positional embeddings in Transformers can better improve the self-attention mechanism. Also, we find that Transformer-based seq2seq architectures are not outperforming simpler models like bRNN.

As for the use of additional retrieved sentences form OMCS corpus and human-written associated rationales, we find that they are not generally helpful in investigated architectures. Although they increase the BLEU and ROUGE scores, the metrics specially designed for captioning like CIDEr and SPICE are dropping down. We argue that it might because the OMCS sentences are actually not aligned with training data, and more sophisticated methods for encoding such non-sequential facts in a more compositional way.

5.4 Human Evaluation

From the automatic evaluation results with multiple metrics, we have a rough idea of the performance of all models. However, no automatic metric is perfect, especially for a newly proposed generation task like the CommonGen. We thus ask humans666We recruit five (native) English speakers who are college students in the US for human evaluation. to rank 100 outputs of 6 selected typical models as well as one randomly picked reference sentence, forming seven systems in total. Annotators are educated to rank results by their coverage, fluency, and plausibility in daily life. Then, we compute the cumulative gains of each system in all 100 cases:

is the final score of the -th system by the -th annotator. is the rank position of the -th system output for -th example. In our case, , , .

As shown in Table 3

, we compare different systems including human bound for both the above-introduced cumulative ranking scores and the average hit@top3 rates with standard deviations. We find that the correlation between human evaluation and CIDEr and SPICE are better than the other metrics (see Table 


0.7 Model avg. rank score std. avg. Hit@Top3 std. Lower Bound =14.29 0.00 0.00 0.00 bRNN 19.46 0.45 4.00 1.73 mean encoder 19.70 0.73 9.80 8.53 bRNN + 20.61 0.71 10.60 1.82 LevenTrans 26.64 1.61 23.20 3.49 UniLM + 47.13 1.96 73.40 2.07 UniLM 50.18 3.31 86.60 2.61 Human Bound 75.57 3.44 96.40 2.07

Table 3: The average humane evaluation ranking scores and hit@top3 rates for each tested system.

5.5 Qualitative Analysis

For more clearly observe the performance of interested models, we present several real system outputs on the test set in Table 4. We find that models usually cannot cover all given concepts, and also can produce repetitions of given concepts (e.g. “a dog catches a dog”, “a couple of couples”, and “at an object and an object .”). Moreover, we find that the order of actions may be mot natural. For example, the model output “a man pulls a sword out of his mouth and swallows it” makes less sense because a man usually swallow a sword first before he pull it out in such performances.

1 Concept-Sets: {bowl (noun), food (noun), hold (verb), smile (verb)} bRNN: A close up of a partly smile . mean encoder: He puts food into a toothbrush then follows it around . bRNN + : She puts food into a bowl . LevenTrans: A man puts food in a gentle smile . UniLM + : A woman holds a large bowl of food in her hand as she prepares another . UniLM: She grabs a bowl of food and holds it in her hands . Reference: A girl smiles while holding a bowl of food . Concept-Sets: {catch (verb), dog (noun), rabbit (noun), run (verb)} bRNN: Girl getting ready to catch rabbit on a note . mean encoder: A rabbit is on the catch of a run in an empty stadium . bRNN + : A rabbit up a run to the neck . LevenTrans: A woman is making a catch while getting cheese on a metal shop . UniLM + : A dog chasing a rabbit . UniLM: A dog catches a dog and rabbit while they run . Reference: The dog ran away from its owner to try and catch a rabbit . The dog ran to catch the rabbit . Concept-Sets: {mouth (noun), pull (verb), swallow (verb), sword (noun)} bRNN: A man swallows a sword and then release it . mean encoder: A man pulls a sword into his mouth while it plays out . bRNN + : At a gym , people take turns and pull the mouth out . LevenTrans: A man swallows a sword out of his mouth . UniLM + : A man swallows a sword in his mouth as he tries to chop it up . UniLM: A man pulls a sword out of his mouth and swallows it . Reference: We had to pull the sword out to cut an apple to put in my mouth to swallow . The sideshow performer pulled a sword he had swallowed from his mouth . Concept-Sets: {couple (noun), dance (verb), outfit (noun), stage (noun)} bRNN: A couple wears red and orange . mean encoder: A couple wears matching outfits . bRNN + : The couple are performing a dance . LevenTrans: A couple wears a dance . UniLM + : A couple in a dance on a stage . UniLM: A couple of couples dance in western outfits on a stage . Reference: The couple on the stage danced in elaborate outfits . The couples wearing Halloween outfits danced on the stage . Concept-Sets: {dive (verb), object (noun), pool (noun), retrieve (verb), throw (verb)} bRNN: A man is holding a retrieve and then flings the object . mean encoder: A man stands on stage picks up an object from his hands while he watch . bRNN + : A pool of people on floor with their mustache clearing an object . LevenTrans: A man is diving down at an object and an object . UniLM + : A man retrieves an object and throws it into a pool . UniLM: A man dives into an object in a pool and retrieves it . Reference: The woman threw an object in the pool and the dog dove to retrieve it . A man dived into the pool to retrieve the object he threw in it .

Table 4: The example outputs from different models and human references for qualitative analysis.

6 Related Work

Machine Common Sense

Machine common sense (MCS) has long been considered as one of the most significant area in artificial intelligence. Recently, there are various emerging datasets for testing machine commonsense from different angles, such as commonsense extraction Xu et al. (2018); Li et al. (2016), next situation prediction (SWAG Zellers et al. (2018), CODAH Chen et al. (2019), HellaSWAG Zellers et al. (2019b)), cultural/social understanding Lin et al. (2018); Sap et al. (2019, 2019), visual scene comprehension Zellers et al. (2019a), and general commonsense question answering Talmor et al. (2019); Huang et al. (2019). Most of them are in a multi-choice QA setting for discriminative commonsense reasoning, among which CSQA Talmor et al. (2019) and SWAG Zellers et al. (2018) are two typical examples. The input of the CSQA task is a question that needs commonsense reasoning and there are five candidate answers (words/phrases). The SWAG task asks models to select which situation is the most plausible next situation, given a sentence describing an event.

The two tasks share very similar objectives with large pre-trained language encoders like BERT Devlin et al. (2019a): Masked-LM can predict the missing words in an incomplete sentence, which is similar to the CSQA setting; NextSentPredictionclassifies whether a sentence is the next sentence of the given sentence in the corpora, which can be seen as using distant supervision for the SWAG task. Thus, simply fine-tuning such large pre-trained language encoders can yield near or exceeding human performance Lin et al. (2019); Liu et al. (2019), but it does not necessarily mean machine reasoners can really produce new assumptions in an open and generative setting. The proposed CommonGen, to the best of our knowledge, is the first dataset and task for generative commonsense reasoning.

Constrained Text Generation

Constrained or controllable text generation aims to decode realistic sentences that have expected attributes such as sentiment Luo et al. (2019a); Hu et al. (2017), tense Hu et al. (2017), template Zhu et al. (2019), style Fu et al. (2018); Luo et al. (2019b); Li et al. (2018), etc. The most similar scenario with our task is lexically constrained sentence encoding, which has been studied mainly in the machine translation community Hasler et al. (2018); Dinu et al. (2019) for dealing with terminology and additional bilingual dictionaries.

Classic methods usually modify the (beam) searching algorithms to accommodate lexical constraints like Grid Beam Search Hokamp and Liu (2017). The most recent work in this line is the CGMH Miao et al. (2018) model, which works in the inference stage to sample sentences with a sequence of multiple keywords from language models. However, our task brings more challenges: 1) we do not assume there is a fixed order of keywords in target sentences; 2) we allow morphological changes of the keywords; 3) the decoded sentences must describe highly plausible scenes in our daily life. Current methods cannot well address these issues and also work extremely slow to generate grammatical sentences. We instead mainly investigate sequence-to-sequence architectures, especially models that are based on editing operations and non-autoregressive. Pre-trained seq2seq generation models like UniLM Dong et al. (2019) and BRAT Lewis et al. (2019) are usually initialized with pre-trained language encoder and then further fine-tuned with multiple NLG tasks. The UniLM archives the best performance on our proposed CommonGen task, while being far from human-level performance and hardly interpretable.

7 Conclusion

In this paper, we purpose a novel constrained text generation task for generative commonsense reasoning. We introduce a new large-scale dataset named CommonGen and investigate various methods on them. Through our extensive experiments and human evaluation, we demonstrate that the inherent difficulties of the new task cannot be addressed by even the state-of-the-art pre-trained language generation model.

For the future research, we believe the following directions are highly valuable to explore: 1) specially designed metrics for automatic evaluation that focus on commonsense plausibility; 2) better mechanisms for retrieving and imposing useful commonsense knowledge into sentence generation processes; 3) explicitly modeling keyword-centric edits (e.g. insertion, deletion, morphological changes) such that relevant commonsense knowledge can be well utilized. We also believe that models performed well on CommonGen can be easily transferred to other commonsense-required reasoning tasks with few annotations, including image/video captioning, visual question answering, and discriminative multi-choice commonsense question answering.


  • P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) SPICE: semantic propositional image caption evaluation. In ECCV, Cited by: §5.2.
  • S. R. Bowman, G. Angeli, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.1.
  • M. Chen, M. D’Arcy, A. Liu, J. Fernandez, and D. Downey (2019) CODAH: an adversarially authored question-answer dataset for common sense. ArXiv abs/1904.04365. Cited by: §6.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019a) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. of NAACL-HLT, Cited by: §6.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019b) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §4.2.
  • G. Dinu, P. Mathur, M. Federico, and Y. Al-Onaizan (2019)

    Training neural machine translation to apply terminology constraints

    In ACL, Cited by: §6.
  • L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon (2019) Unified language model pre-training for natural language understanding and generation. ArXiv abs/1905.03197. Cited by: §4.2, §6.
  • A. Fan, M. Lewis, and Y. Dauphin (2018) Hierarchical neural story generation. In ACL, Cited by: §4.3.
  • Z. Fu, X. Tan, N. Peng, D. Zhao, and R. Yan (2018) Style transfer in text: exploration and evaluation. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §6.
  • J. Gu, Z. Lu, H. Li, and V. O.K. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. In Proc. of ACL, Cited by: §4.1.
  • J. Gu, C. Wang, and J. Zhao (2019) Levenshtein transformer. ArXiv abs/1905.11006. Cited by: §4.1.
  • J. Guan, Y. Wang, and M. Huang (2018) Story ending generation with incremental encoding and commonsense knowledge. In AAAI, Cited by: §1.
  • E. Hasler, A. de Gispert, G. Iglesias, and B. Byrne (2018) Neural machine translation decoding with terminology constraints. In NAACL-HLT, Cited by: §6.
  • C. Hokamp and Q. Liu (2017) Lexically constrained decoding for sequence generation using grid beam search. In Proc. of ACL, Cited by: §2, §6.
  • Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017) Toward controlled generation of text. In Proc. of ICML, Cited by: §2, §6.
  • L. Huang, R. Le Bras, C. Bhagavatula, and Y. Choi (2019) Cosmos QA: machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: §6.
  • D. A. Hudson and C. D. Manning (2019) GQA: a new dataset for real-world visual reasoning and compositional question answering. In CVPR, Cited by: §1.
  • G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush (2017)

    OpenNMT: open-source toolkit for neural machine translation

    In ACL, Cited by: §4.1.
  • R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles (2017) Dense-captioning events in videos. In

    International Conference on Computer Vision (ICCV)

    Cited by: §3.1.
  • J. D. Lee, E. Mansimov, and K. Cho (2018) Deterministic non-autoregressive neural sequence modeling by iterative refinement. In EMNLP, Cited by: §4.1.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. S. Zettlemoyer (2019) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Cited by: §6.
  • J. Li, R. Jia, H. He, and P. Liang (2018) Delete, retrieve, generate: a simple approach to sentiment and style transfer. In NAACL-HLT, Cited by: §6.
  • X. Li, A. Taheri, L. Tu, and K. Gimpel (2016) Commonsense knowledge base completion. In ACL, Cited by: §6.
  • B. Y. Lin, X. Chen, J. Chen, and X. Ren (2019) KagNet: knowledge-aware graph networks for commonsense reasoning.. In Proceedings of EMNLP-IJCNLP, Cited by: §6.
  • B. Y. Lin, F. F. Xu, K. Q. Zhu, and S. Hwang (2018) Mining cross-cultural differences and similarities in social media. In ACL, Cited by: §6.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In ACL 2004, Cited by: §5.2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. S. Joshi, D. Chen, O. Levy, M. Lewis, L. S. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: §1, §6.
  • F. Luo, P. Li, P. Yang, J. Zhou, Y. Tan, B. Chang, Z. Sui, and X. Sun (2019a) Towards fine-grained text sentiment transfer. In ACL, Cited by: §6.
  • F. Luo, P. Li, J. Zhou, P. Yang, B. Chang, Z. Sui, and X. Sun (2019b)

    A dual reinforcement learning framework for unsupervised text style transfer

    In IJCAI, Cited by: §6.
  • N. Miao, H. Zhou, L. Mou, R. Yan, and L. Li (2018) CGMH: constrained sentence generation by metropolis-hastings sampling. In AAAI, Cited by: §6.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2001) Bleu: a method for automatic evaluation of machine translation. In ACL, Cited by: §5.2.
  • T. Qiao, J. Zhang, D. Xu, and D. Tao (2019) MirrorGAN: learning text-to-image generation by redescription. ArXiv abs/1903.05854. Cited by: §1.
  • A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele (2017) Movie description. International Journal of Computer Vision. Cited by: §3.1.
  • M. Sap, R. LeBras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith, and Y. Choi (2019) ATOMIC: an atlas of machine commonsense for if-then reasoning. In Proc. of AAAI, Cited by: §6.
  • M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y. Choi (2019) Social IQa: commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Cited by: §6.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In ACL, Cited by: §4.3.
  • P. Singh, T. Lin, E. T. Mueller, G. Lim, T. Perkins, and W. L. Zhu (2002) Open mind common sense: knowledge acquisition from the general public. In CoopIS/DOA/ODBASE, Cited by: §3.2.
  • R. Speer, J. Chin, and C. Havasi (2016) ConceptNet 5.5: an open multilingual graph of general knowledge. In AAAI, Cited by: §3.1.
  • M. Stern, W. Chan, J. R. Kiros, and J. Uszkoreit (2019) Insertion transformer: flexible sequence generation via insertion operations. In ICML, Cited by: §4.1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014)

    Sequence to sequence learning with neural networks

    In NIPS, Cited by: §4.1.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proc. of NAACL-HLT, Cited by: §1, §6.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §4.1.
  • R. Vedantam, C. L. Zitnick, and D. Parikh (2014) CIDEr: consensus-based image description evaluation.

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 4566–4575.
    Cited by: §5.2.
  • X. Wang, J. Wu, J. Chen, L. Li, Y. Wang, and W. Y. Wang (2019) VaTeX: a large-scale, high-quality multilingual dataset for video-and-language research. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §3.1.
  • F. F. Xu, B. Y. Lin, and K. Q. Zhu (2018) Automatic extraction of commonsense locatednear knowledge. In ACL, Cited by: §6.
  • P. Young, A. Lai, M. Hodosh, and J. Hockenmaier (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2. Cited by: footnote 2.
  • R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi (2019a) From recognition to cognition: visual commonsense reasoning. In Proc. of CVPR, Cited by: §6.
  • R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi (2018) SWAG: a large-scale adversarial dataset for grounded commonsense inference. In Proc. of EMNLP, Cited by: §1, §6.
  • R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019b) HellaSwag: can a machine really finish your sentence?. In ACL, Cited by: §6.
  • T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2019) BERTScore: evaluating text generation with bert. ArXiv abs/1904.09675. Cited by: §5.2.
  • H. Zhou, T. Young, M. Huang, H. Zhao, J. Xu, and X. Zhu (2018a) Commonsense knowledge aware conversation generation with graph attention. In IJCAI, Cited by: §1.
  • J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun (2018b) Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: §1.
  • W. Zhu, Z. Hu, and E. P. Xing (2019) Text infilling. ArXiv abs/1901.00158. Cited by: §6.