SemEval-2020 Task 4: Commonsense Validation and Explanation

In this paper, we present SemEval-2020 Task 4, Commonsense Validation and Explanation (ComVE), which includes three subtasks, aiming to evaluate whether a system can distinguish a natural language statement that makes sense to human from one that does not, and provide the reasons. Specifically, in our first subtask, the participating systems are required to choose from two natural language statements of similar wording the one that makes sense and the one does not. The second subtask additionally asks a system to select the key reason from three options why a given statement does not make sense. In the third subtask, a participating system needs to generate the reason automatically.



There are no comments yet.


page 1

page 2

page 3

page 4


Does It Make Sense? And Why? A Pilot Study for Sense Making and Explanation

Introducing common sense to natural language understanding systems has r...

CS-NET at SemEval-2020 Task 4: Siamese BERT for ComVE

In this paper, we describe our system for Task 4 of SemEval 2020, which ...

Autoencoding Language Model Based Ensemble Learning for Commonsense Validation and Explanation

An ultimate goal of artificial intelligence is to build computer systems...

A Benchmark Arabic Dataset for Commonsense Explanation

Language comprehension and commonsense knowledge validation by machines ...

LMVE at SemEval-2020 Task 4: Commonsense Validation and Explanation using Pretraining Language Model

This paper describes our submission to subtask a and b of SemEval-2020 T...

CUHK at SemEval-2020 Task 4: CommonSense Explanation, Reasoning and Prediction with Multi-task Learning

This paper describes our system submitted to task 4 of SemEval 2020: Com...

KaLM at SemEval-2020 Task 4: Knowledge-aware Language Models for Comprehension And Generation

This paper presents our strategies in SemEval 2020 Task 4: Commonsense V...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Although in the past decades computers’ ability in processing natural language has significantly improved, the intelligence in understanding common sense expressed in language, however, is still limited. It is straightforward for humans to judge the following sentence is plausible and makes sense: “John put a turkey into a fridge”, but “John put an elephant into the fridge” does not, while it is non-trivial for a computer to tell the difference. Arguably, commonsense reasoning plays a central role in a natural language understanding system [logical]. It is an important problem to evaluate how well computers can understand whether given statements make sense. In our task, we take an operational definition of making sense by asking human subjects to generate natural language statements that obey or violate their commonsense knowledge about the world.

Many existing tasks embed the evaluation of commonsense understanding in other problems such as co-reference resolution [WSC2012, WSC2015], subsequent event prediction [COPA], ordinal common-sense inference [JOCI], situations with adversarial generations [SWAG], event validation [wang2018modeling], reading comprehension [RocStories2016, SemEval-2018-Task-11, MCScript], dialogue [mutual] and QA problems [SQUABU, COMMONSENSEQA, OpenBookQA]. They verify whether a system is equipped with common sense by testing whether the system can give a correct answer where the input does not contain such knowledge. The above tasks do not directly evaluate commonsense validation and they do not explicitly identify the key factor required in a commonsense validation process.

Our SemEval-2020 Task 4 includes three subtasks on testing whether a system can distinguish natural language statements that make sense from those that do not, and probe the reasons. In the first subtask, a system needs to choose from two natural language statements of similar wordings the one does not make sense and the one does, e.g., “John put an elephant into the fridge” and “John put a turkey into the fridge”, respectively. The second task aims to find the key reason from three provided options why a given nonsensical statement does not make sense. For example, for the nonsensical statement, “John put an elephant into the fridge”, the three options are “An elephant is much bigger than a fridge”, “Elephants are usually white while fridges are usually white”, and “An elephant cannot eat a fridge.” A system needs to identify the correct reason. In addition, the third task requires the participating systems to generate the reason automatically. We hope our task helps facilitate further research on commonsense validation and its interpretability, other commonsense reasoning problems, as well as other related natural language understanding and generation tasks.

2 Task Setup

2.1 Task Definition

Formally, each instance in our dataset is composed of 8 sentences: {, , , , , , , }. and are two similar statements that differ by only a few words; one of them makes sense while the other does not. They are used in our Subtask A: the Validation subtask, which requires a model to identify which makes sense. For the statement that does not make sense, we have three candidate reasons, i.e., three options , , and ; one of them explains why the statement does not make sense. So, in our Subtask B, the Explanation (Multi-Choice) subtask, a model is required to find the correct reason from the three options. For the same nonsensical statement, in Subtask C, the Explanation (Generation) subtask, a participating system needs to generate the reason why it does not make sense. Three references, , , and , are used for evaluating Subtask C. Below we give an example for each subtask, in which we introduce some notations we will used in the paper.

  • Subtask A: Validation

Question: Which statement of the two does not make sense?

: John put a turkey into a fridge.

: John put an elephant into the fridge.

In this example, is a sensical statement, denoted as , while is the nonsensical statement, which is denoted as .
  • Subtask B: Explanation (Multi-Choice)

Task: Select the best reason that explains why the given statement does not make sense.

Nonsensical statement (): John put an elephant into the fridge.

: An elephant is much bigger than a fridge.

: Elephants are usually white while fridges are usually white.

: An elephant cannot eat a fridge.

In this example, the option is the correct reason, which is denoted also as , while and are not the reason, which are also denoted as and .
  • Subtask C: Explanation (Generation)

Task: Generate the reason why this statement does not make sense.

Nonsensical statement (): John put an elephant into the fridge.

Reference reasons (used for calculating the BLEU score):

: An elephant is much bigger than a fridge.

: A fridge is much smaller than an elephant.

: Most of the fridges aren’t large enough to contain an elephant.

2.2 Evaluation Metrics

The Subtask A and B are evaluated using accuracy. Subtask C is evaluated with the BLEU score [papineni-etal-2002-bleu]. In addition, for Subtask C, we further perform human evaluation—we randomly select 100 instances from the test set and evaluate system outputs on Amazon Mechanical Turk. We ask three different crowd-sourcing workers to score each generated reason with a Likert scale ranging from 0 to 3, inclusively, according the rubrics listed in Table 1.

Score Description
0 The reason is not grammatically correct, or not comprehensible at all, or not related to the statement at all.
1 The reason is just the negation of the statement or a simple paraphrase. Obviously, a better explanation can be made.
2 The reason is relevant and appropriate, though it may contain a few grammatical errors or unnecessary parts. Or like case 1, but it’s hard to write a proper reason.
3 The reason is appropriate and is a solid explanation of why the statement does not make sense.
Table 1: Rubrics used in human evaluation in Subtask C.

Then we calculate the average score of the three scores as our final human evaluation score. Formally, the human evaluation score of system is


where means the score from the annotator for system on the instance.

3 Data Construction

Our data construction was mainly performed on Amazon Mechanical Turk, which consists of two steps:

  • Step 1: In this step, we construct datasets for Subtask A and Subtask B. Specifically, we ask a crowd-sourcing worker to write a sensical statement and a nonsensical statement . For the nonsensical statement , the worker further writes three sentences, , , ; one of them, denoted as , explains why the nonsensical statement does not make sense; two of them, denoted as and , serve as the confusing choices. (Refer to Section 3.1 for details.)

  • Step 2: We then collect three reference reasons, , , , for Subtask C. We use as one of three references, and collect two more references in this step. We ask two different crowd-sourcing workers to write each of them. Note that instead of letting the same worker in step 1 to write these two references, we asked two more workers. The reason is to encourage diversity of the reference. (Refer to Section 3.2 for details.)

Finally, each instance of the dataset have 8 sentences: {, , , , , , , }. Note that one sentence in , , is repeated in , , , but for convenience of description, we note it differently.

3.1 Step 1: Collecting Data for Subtask A and B

Annotation Guidelines.

When writing instances, workers were asked to follow several principles: (1) Try to avoid complex knowledge and focus on daily common sense. Make the questions as understandable as possible, so that a literate person is able to give the right answers. (2) The confusing reason options, and , should better contain more content words or information such as entities and activities in the nonsensical statements . For example, the confusing reasons of “John put an elephant into the fridge” should better contain both “elephant” and “fridge”. (3) The confusing reasons, and , should be related to the statements and the correct reason and not deviate from the context; otherwise it may be easily captured by pretrained models like BERT [COMMONSENSEQA]. (4) The three option reasons, , and should only be related to the incorrect statements rather than the correct statements

. Because we want further studies to be able to estimate nonsensical statements

without the correct statement . (5) The confusing reasons, and , should make sense themselves. Otherwise, the models may simply ignore the incorrect options , without considering the casual semantics. This concern is raised from and motivated by the fact that models can achieve high performance in the ROC Story Cloze Task, when only looking at the alternative endings and ignoring the story content [Schwartz2017]. (6) We control the length of each sentence, making the nonsensical statement nearly as long as the sensical statement , and the correct reason neither too long nor too short among the three reason options .

Use of Inspirational Materials.

It is not easy for all crowd-sourcing workers to write instances from scratch. To address this issue, we also provide them with external reading materials to stimulate inspiration, such as the sentences of the OMCS (Open Mind Common Sense) project [OMCS]. For example, “he was sent to a (restaurant)/(hospital) for treatment after a car crash” can be inspired by the two sentences “restaurants provide food” and “hospitals provide medical care”. In addition, we include a small number of existing commonsense reasoning questions such as WSC [WSC2012, WSC2015], COPA [COPA], and SQUABU [SQUABU].

Quality Control.

After the annotators write the instances, the first two authors of this paper check them, and if an instance containing sentences that violate the principles significantly, we will reject the instance and ask the crowd-sourcing worker to rewrite it. And if one worker writes too many low-quality instances, we will remove her or him from our annotator pool. With the quality control process, we accept between 30% and 40% submitted instances.

3.2 Step 2: Collecting Data for Subtask C

Annotation Guidelines.

To collect data for Subtask C, each worker is given a nonsensical statement and a sensical statement and asked to write a reason to explain why the nonsensical statement does not make sense. They shall follow the following rules: (1) Do not explain why the sensical statement makes sense. (2) Avoid mentioning the sensical statement . (3) Write the reason, rather than simply add the word “not” or “can’t” to the nonsensical statement to form an explanation. (4) Write the reason, don’t use patterns like “XXX is not for YYY” to create an explanation. (5) Do not try to justify why the nonsensical statement makes sense. (6) Write only one sentence, do not be overly formal. (7) Refrain from using “because” at the beginning of a sentence. (8) Do not try to correct the statement , but just give the reason.

Quality Control. Same as in Step 1, after the annotators write the reasons in Step 2, the first two authors of the paper perform the check process again. We will reject low-quality reasons (that violate the rules significantly) and low-quality annotators (who write many low-quality reasons with the number above a threshold).

Dataset Nonsens_first Nonsens_second Total
Table 2: Nonsensical label distribution in Subtask A. Nonsens_first means in a (, ) pair, the first sentence is the nonsensical statement, i.e., and . Nonsens_second means the second sentence is a nonsensical statement, i.e., and .
Dataset Option 1 correct Option 2 correct Option 3 correct Label Number
Training 3,195 3,362 3,443 10,000
Dev 344 327 336 997
Test 320 355 325 1,000
Table 3: Correct label distribution in Subtask B.

3.3 Data Summary and Analysis

For SemEval-2020, we created 11,997 instances (i.e., 11,997 8-sentence tuples). We further split the instances into 10,000 (the training set), 997 (the development set), and 1,000 (the test set). We conduct four more data analysis experiments to evaluate data quality, label distribution, sentence length, and common words.

Type of Sentences Training Set Dev Set Test Set
Sensical Statements 7.67 7.12 7.25
Nonsensical Statements 7.69 7.16 7.36
Correct Reasons 8.13 7.96 8.09
Confusing Reasons 7.80 7.14 7.29
Referential Reasons 8.08 7.92 8.06
Table 4: Average Length of different types of sentences of Training/Dev/Test set

Average Length. In Table 4, we present the average length of each type of sentence in the training/dev/test set. The sentences in the development and test set have shorter lengths than those in the training set. This is because we check the development and test more carefully and more strictly thus remove more long and incomprehensible instances, which lower the average lengths of the dev/test set. The sensical statements and nonsensical statements almost have the same average lengths in the three sets (the differences are equal or smaller than 1% ), which is balanced. However, there is an obvious gap between the correct reasons and confusing reasons in terms of the average lengths (roughly 4% in the training set and 10% in the dev/test set).

Types of Sentences (Word, Word Frequency(‰))
Sensical Statements (’in’, 2.928)(’he’, 2.754)(’i’, 1.88)(’of’, 1.408)(’on’, 1.403)
Nonsensical Statements (’in’, 2.941)(’he’, 2.714)(’i’, 1.827)(’on’, 1.432)(’of’, 1.427)
Correct Reasons (’not’, 4.126)(’in’, 2.218)(’and’, 1.758)(’cannot’, 1.572)(’of’, 1.523)
Confusing Reasons (’in’, 2.456)(’and’, 2.038)(’can’, 1.883)(’of’, 1.799)(’people’, 1.502)
Referential Reasons (’not’, 5.164)(’in’, 2.2)(’and’, 1.691)(’cannot’, 1.492)(’for’, 1.457)
(a) Training set
Types of Sentences (Word, Word Frequency(‰))
Sensical Statements (’in’, 3.031)(’he’, 2.75)(’on’, 1.713)(’i’, 1.625)(’she’, 1.485)
Nonsensical Statements (’in’, 3.295)(’he’, 2.598)(’on’, 1.543)(’you’, 1.482)(’i’, 1.456)
Correct Reasons (’in’, 2.429)(’not’, 2.107)(’and’, 1.785)(’can’, 1.549)(’no’, 1.462))
Confusing Reasons (’in’, 2.536)(’can’, 2.098)(’and’, 2.02)(’of’, 1.56)(’people’, 1.456)
Referential Reasons (’not’, 3.828)(’in’, 2.387)(’and’, 2.193)(’for’, 1.49)(’of’, 1.278)
(b) Dev+Test set
Table 5: Top-5 common words and their frequencies in different types of sentences in the training and dev+test set. 1.000‰ means this word appear one time in every 1000 words. (We skip most uninformative words, including ’a’, ’an’, ’the’, ’to’, ’is’, ’are’ and ’be’.)

Common Word Analysis The most common words are important for showing the differences between sentences. Instead of using a standard stopword list, we manually created one for our task here, which we called uninformative words and are listed in the caption of Table 5. After removing those words, we can list the top-5 common words in each type of sentence in the training/dev+test sets. For sensical statements and nonsensical statements , there are no significant differences between the training, dev, and test set. However, there is an obvious gap in the correct reasons and confusing reasons in negative words such as “not”, “no”, and “cannot”. In the training data, negative words are about 3 times more common in the correct option than in the confusing options . In the dev+test data, the gap is about 40%, which indicates that the dev+test data has a higher quality than the training data. However, as discussed in [Probing], the Spurious Statistical Cues can affect BERT’s results. We suppose that the negative words are also spurious effective clues, which make the Subtask B potentially easier.

Repetition. The dev+test set have 12 instances (0.6%) that repeat the same nonsensical statements in the training data and 36 instances (1.8%) that repeat the same correct reasons with the training data.

3.4 Cautions of using the data

The following advice is given to the participants: (1) Feel free to use whatever additional data they deem appropriate for the tasks to train their model. (2) Do not use the input of Subtask B/C to help Subtask A and do not use the option of Subtask B to help Subtask C. Otherwise the task will be artificially easy. This is because of two reasons: a) The nonsensical statements of Subtask B and Subtask C is exactly the nonsensical statements of Subtask A and, participants can use the input of the Subtask B/C to directly obtain the answer of Subtask A and the option answers of Subtask B will also reduce the difficulty of Subtask A; b) the correct reason of Subtask B is also one of the reference reason in Subtask C.

4 Systems and Results

In this section, we show the evaluation results of all the submitted systems for the three subtasks. Since most systems share similar model architecture for subtasks A and B, we discuss the two subtasks together.

4.1 Subtask A and Subtask B

Figure 1: The most commonly used model architectures used in the three subtasks. This figure is mostly based on Solomon’s system. For Subtask B and C, the connector can be simply ”No, ”, to help in constraining the model to learn a choice that explains the unreasonability of the statement. For Subtask A and B, the pretrained models are finetuned on the task-specific data with MLM-objective, and then trained as a binary classification task to score each input. For Subtask C, the cross-entropy loss of next-token-prediction is used to train the model, and beam search is used at inference.
Team Name Accuracy Team Name Accuracy Team Name Accuracy
CN-HIT-IT.NLP 97.0 panaali* 92.5 Lijunyi 83.0
ECNU-SenseMaker 96.7 ZhengxianFan* 92.4 ehsantaher* 82.5
IIE-NLP-NUT 96.4 LMVE 90.4 TakeLab* 81.2
nlpx* 96.4 Warren* 90.4 Vicki* 79.8
Solomon 96.0 TMLab* 89.2 TR 79.7
Qiaoning 95.9 UAICS 89.1 KDE SenseForce 79.6
BUT-FIT 95.8 JUST 89.1 Hitachi* 78.4
olenet* 95.5 eggy* 89.0 CUHK 72.4
KaLM 95.3 UI 88.2 paramitamirza* 69.2
CS-NET 94.8 Armins* 87.1 UoR 67.6
fkerem* 94.4 DEEPYANG 85.1 chenggguang* 62.3
JUSTers 92.9 WUY* 84.2 praveenjoshi007* 55.9
CS-NLP 92.7 YNU-oxz 83.6 dania* 21.6
Table 6: Subtask A results of all the submitted systems. Those marked with * did not submit system description paper.
Team Name Accuracy Team Name Accuracy Team Name Accuracy
ECNU-SenseMaker 95.0 JBNU 91.4 Masked Reasoner 73.5
CN-HIT-IT.NLP 94.8 Qiaoning 90.8 KDE SenseForce 72.8
IIE-NLP-NUT 94.3 CS-NET 89.0 SSN-NLP 68.3
Solomon 94.0 WUY* 85.3 TakeLab* 66.8
LMVE 93.8 SWAGex 84.6 UoR 65.9
CS-NLP 93.7 TMLab* 82.0 dania* 55.5
KaLM 93.2 UI 80.5 CUHK 51.2
BUT-FIT 93.1 ehsantaher* 79.3 bhu* 36.4
JUSTers 92.3 uzh* 75.8 praveenjoshi007* 32.6
Table 7: Subtask B results of all the submitted systems. Those marked with * did not submit system description paper.

The formal evaluation results of Subtask A and B are shown in Table 6 and 7. There are in total 39 valid submissions for Subtask A and 27 valid submissions for Subtask B. Most top-performing submissions adopted the pretrained language models such as BERT [BERT], RoBERTa [RoBERTa], XLNET [yang2019xlnet] and ALBERT [lan2019albert] as the encoder of the model, and then finetune on the training set of the task. See Figure 1

for the most commonly-used model architectures for Subtask A and B. Also, the top-performing systems take advantage of external knowledge graphs such as ConceptNet 

[ConceptNet5.5], or unstructured text containing commonsense knowledge. Below we introduce in detail several top-performing systems and their main features.

  • CN-HIT-IT.NLP ranks top in Subtask A. They use a variant of K-BERT  [liu2019k] as the encoder to enhance language representations through knowledge graphs. K-BERT is a Transformer-based model, which enhances the language representations of the text by injecting relevant triples from a KG to form a knowledge-rich sentence tree, and then use a mask-Transformer to make the triples visible only to the corresponding entity. They use ConceptNet as the commonsense repository to extract the triples for the statements.

  • ECNU-SenseMaker ranks top in Subtask B. It uses Knowledge-enhanced Graph Attention Network to leverage heterogeneous knowledge from both the structured knowledge base (i.e. ConceptNet) and the unstructured text to better improve the commonsense understanding. Like CN-HIT-IT.NLP, their model is also based on K-BERT. In addition, they use unstructured text from ConceptNet and Subtask C to pretrain the language model.

  • IIE-NLP-NUT uses RoBERTa as the encoder, and conduct a second pretraining on the original RoBERTa model with the textual corpus from Open Mind Common Sense [singh2002open]. They also explore several prompt templates to constructs as the inputs to the model.

  • Solomon, KaLM, CS-NET, JUSTers, CS-NLP, UI, TR UoR, Masked Reasoner have similar model architecture, with RoBERTa as the encoder. In addition, UoR finetunes the pretrained language model on NLI and STS dataset, and UI finetunes on MNLI data. TR

    combines RoBERTa features with additional features from text-to-image generation using Gradient Boosted Decision Tree, and give better results in post-evaluation.

  • Qiaoning and JUST use several ensembles of BERT, ALBERT, XLNet and RoBERTa.

  • BUT-FIT, LMVE, Lijunyi use ALBERT as the encoder. BUT-FIT uses back-translation from Czech for data augmentation, and LMVE

    uses hint sentence, back-translation from French and intra-subtask transfer learning between Subtask A and B to enhance their system.

  • UAICS, DEEPYANG, YNU-oxz, KDE-SenseForce, CUHK JBNU, SWAGex are BERT-based. JBNU put an BiLSTM on top of BERT, and SWAGex finetunes BERT with SWAG data. CUHK uses a Multitask Learning framework MTDNN [liu2019multi], adopting the ”Explain, Reason and Predict” system.

It can be seen from the results that pretrained language models such as RoBERTa can achieve rather high performance; (e.g., the team Solomon achieves 96.0% and 94.0% on Subtask A and Subtask B, respectively, without using further resources). This shows that large-scale pretrained language models do contain commonsense knowledge to deal with the Subtask A and B in this challenge. Additionally finetuning the pretrained language models on commonsense-related text such as OMCS, which we use as inspirational materials, can push the results even higher. The best-performing teams on Subtask A and Subtask B both adopt K-BERT, which incorporates the external knowledge base (i.e. ConceptNet) to complement the pretrained language models with knowledge triples. This shows that KG-enhanced approaches, such as K-BERT can effectively incorporate external knowledge. However, the high number may also indicate data leaking to some extent, since in the data creation stage, both ConceptNet and OMCS are used as references for the annotator to write the data instances.

4.2 Subtask C

Team BLEU Human Eval Team BLEU Human Eval
BUT-FIT 22.4 1.84 CN-HIT-IT.NLP+ 9.7 1.74
Solomon 19.3 1.84 SWAGex 7.1 1.75
KaLM 18.5 2.08 UI 5.5 0.73
panaali* 17.2 1.22 TMLab* 5.4 1.05
JUSTers 16.1 1.94 CUHK 4.3 0.58
cdjhz* 16.0 1.75 SSN-NLP 2.2 0.59
JBNU 15.9 1.80 UoR+ 0.9 0.53
ANA 15.7 2.10 Masked Reasoner+ 0.6 0.81
LMVE+ 12.9 1.78
Table 8: Subtask C results of all the submitted systems. Those marked with * did not submit system description paper, and those marked with + means they do not include Subtask C in their system description paper.

The results for Subtask C is shown in Table 8

. There are in total 17 valid submissions for Subtask C. There are generally two approaches: (1) sequence-to-sequence approach, where the source side is the non-sensical statement, and the reason is the target sequence. (2) language model generation approach, which uses large-scale pretrained auto-regressive language models such as GPT-2 

[radford2019language] for reason generation, where the non-sensical sentence acts as prompt. An example of the language model generation approach is shown in Figure 1, which is most commonly used and achieves relatively good results. Below we describe in detail the systems and their main features.

  • BUT-FIT experiments with both the sequence-to-sequence approach and the language generation approach. For the sequence-to-sequence approach, they use BART [lewis2019bart] with beam-search decoding to achieves the highest BLEU among all the teams. For the language generation approach, the nonsensical statement is used as a prompt. At the training stage, the statement and the explanation are concatenated together, and a GPT-2 is trained on these sequences with a next token prediction objective. At the test time, based on the statement, the model generates the reason tokens until the end-of-sentence token is generated.

  • KaLM uses the sequence-to-sequence architecture BART. To enhance the source side statement, they extract keywords from the statement and search for evidence from Wiktionary111Wiktionary version: enwiktionary-20200220. After that, they concatenate the evidence along with the original statement as the source sentence for the generation. This approach proves effective and makes their system second-best for human evaluations.

  • ANA has the highest human evaluation score with a multitask learning framework. Specifically, they use a decoder-only transformer based on GPT-2 as the backbone model, and train the model with two heads: one for language model and another for classification. They then use data from both task B and task C to calculate language model loss and classification loss. Furthermore, they use OMCS at the pretraining stage and use CoS-E [rajani2019explain] and OpenBook [OpenBookQA] at the task-specific training stage.

  • Solomon, JUSTers, SWAGex, UI, CUHK use GPT or GPT-2 finetuned on the task training data. JBNU uses UniLM, which incorporates three LM tasks: unidirectional LM, bidirectional LM and sequence-to-sequence prediction LM, and only use one of the reference correct reasons. UI does not use the training data and treats the generation as a Cloze task. SSN-NLP uses the seq2seq NMT framework without a pretrained LM.

Large-scale pretrained language models such as BART and GPT-2 dominates the submissions. The two systems with the highest human evaluations, namely ANA and KaLM, use additional resources such as Wiktionary, OMCS, and other commonsense datasets. This again shows that additional knowledge from structured databases can help with the generation of the reasons. From Table 8 we can see that BLEU does not correlate well with Human Evaluations, especially for the top-performed systems. According to a further experiment of BUT-FIT, the naive baseline of “copying source sentence as the reason” can get BLEU of 17.23, which can rank No. 4 among all the submissions. This indicates that BLEU, which focuses on the surface token overlap, has difficulty in evaluating the generated text reliably. The top-performed system achieves the human evaluation score of 2.1, showing the power of pretrained language models, but considering the full score 3.0, we still have a long way to go to generate human acceptable reasons.

5 Related Work

Commonsense reasoning in natural language has been studied in different forms of tasks and has recently attracted extensive attention. In the Winograd Schema Challenge (WSC) [WSC2012, WSC2015], a model needs to solve hard co-reference resolution problems based on commonsense knowledge. For example, “The trophy would not fit in the brown suitcase because it was too big. What was too big (trophy or suitcase)?” The Choice of Plausible Alternatives (COPA) [COPA] emphasizes on events and consequences. Each question in COPA aims to find the suitable cause or result of the premise from two given alternatives. All premises and alternatives are simple sentences. For example, the premise can be “The man broke his toe. What was the CAUSE of this?” and the two candidate answers are “(1) He got a hole in his sock.” and “(2) He dropped a hammer on his foot.” Several subsequent datasets are inspired by COPA. The JHU Ordinal Common-sense Inference (JOCI) [JOCI] aims to label the plausibility from 5 (very likely) to 1 (impossible) of human response after a particular situation. Situations with Adversarial Generations (SWAG) [SWAG] request a system to choose the most likely-to-happen alternative after a specific situation. Those datasets emphasize the pre-situations and/or the after-situations of certain situations, but not on the reasons why they occur or are caused.

Some datasets are inspired by reading comprehension. The Story Cloze Test and ROCStories Corpora [RocStories2016, RocStories2018] aim to figure out the right ending from two candidate sentences after a four-sentence story. For a narrative text, MCScript [MCScript] gives various types of questions and pairs of answer candidates for each question. Most questions require knowledge beyond the facts mentioned in the text. Compared to those reading comprehension tasks, our benchmark encourages people to use any external resources they want.

Some other datasets evolved from QA problems and care more about factual commonsense knowledge. SQUABU [SQUABU] provides a small hand-constructed test of commonsense and scientific questions. CommonsenseQA [COMMONSENSEQA] asks crowd workers to create questions from ConceptNet [ConceptNet5.5], which is a large graph of commonsense knowledge, where each question discriminates its answer candidates between three target concepts that all share the same relationship to a single source drawn from ConceptNet. OpenBookQA [OpenBookQA] provides questions and answer candidates, as well as thousands of diverse facts about elementary level science that are related to the questions. The AI2 Reasoning Challenge (ARC) [ARC] gives thousands of questions with different knowledge types, as well as a relevant 14M-sentence corpus, mixed with science facts and other narrative sentences. MuTual provides a dataset for Multi-Turn dialogue reasoning in the commonsense area [mutual]. Those questions are not easy to answer without specializing certain knowledge, while our questions are based on common sense.

Some datasets focus on physical knowledge validation [wang2018modeling, porada2019can], or only limited attributes or actions of world knowledge [verbphysics]. In contrast, our dataset concerns general commonsense knowledge beyond just the physical world. For example, the sentence in our task “Tom’s mom become (happy)/(upset) when Tom gets high grades in the exam” is about social and emotional common sense. Besides, our dataset is based on statements which includes events, descriptions, assertion etc, not merely events.

More importantly, compared with our work, the above tasks do not directly estimate general common sense or ask the logical reasons behind the correct answers and questions. In recent years, some large-scale commonsense inference knowledge resources have been developed, which may be helpful in commonsense reasoning tasks. Atomic [ATOMIC] presents a large-scale everyday commonsense knowledge graph, which has nine if-then relations with variables, including causes, effects, and so on. Event2Mind [Event2Mind] proposes a new corpus and task, aiming to find out the mentioned/unmentioned people’s intents and reactions under various daily circumstances. These datasets are not directly useful for our benchmark since they focus only on a small domain. ConceptNet is a seminal knowledge graph that has been upgraded over time [ConceptNet, ConceptNet3, ConceptNet5, ConceptNet5.5]. ConceptNet constructs triples using labeled edges as relations and various words and/or phrases as entities. It also has the sentences describing the corresponding triples. In contrast to these resources, we investigate the evaluation of common sense, rather than building a resource.

A pilot study [wang-etal-2019-make] has been performed, showing that there is still a significant gap between human and machine performance when no training data is provided, despite that the models have already been pretrained with over 100 million natural language sentences. In our task here, we also provide training data with human annotations.

6 Summary

This paper summarizes SemEval2020-Task4: Commonsense Validation and Explanation. In this task, we constructed a dataset that consists of 11,997 instances and 83,986 sentences. The task attracted over forty participating teams, out of which 31 teams submit their system papers. Participants show that the pretrained models are very effective in Subtask A and Subtask B, but there is still significant room to improve model performance in Subtask C.

We attribute the high performance on Subtask A and B to several main reasons: 1) Subtask A is a relatively easy question by definition: a model needs only to detect less plausible content from two sentences; 2) Pretrained models are trained on billion-words large corpora such as Wikipedia data, which seem to contain adequate commonsense knowledge [zhou2019evaluating]

3) As described in the annotation process, we use sentences of OMCS to inspire crowd-sourcing workers. The top-3 systems also use OMCS, which possibly leads to overfitting in our task; 4) for Subtask B, as discussed in our data analysis section, the data has some flaws in the average length and common words, which reduces the difficulty. 5) Some instances have obvious patterns. For example, there are tens of instances that contain ‘put XXX into YYY’, and ‘XXX is bigger than YYY’, making the problems simper. 6) Hundreds of crowd-sourcing workers write instances and it is likely for workers to think about the same commonsense knowledge, such as ‘XXX is bigger/shorter/quicker/slower than YYY’. We can classify the reasons into three categories: A. Task Design (reason 1); B. Overfitting by External Resources (reason 2/3) B. Overfitting by Data/Quality Control (reason 4/5/6)

We consider future works in four directions: 1) There is still some gap between machine performance and human performance in Subtask C, and the reason generation task still needs further investigation. 2) The dataset can be fixed in length, cut instances with repeated commonsense knowledge, balance in common words, and remove the common patterns. 3) Subtask A can be turned into a difficult form. Instead of comparing which statement makes more sense, we can form it into a classification task, which directly validates whether one statement makes sense or not. 4) We notice that the BLEU score does not align with human evaluation in systems with high performance, so that is need to develop an auto-metric for comparing the semantic correlation between two reasons.


This work is supported by the National Science Foundation of China (Grant No. 61976180), the Westlake University, and the Bright Dream Joint Institute for Intelligent Robotics. Yue Zhang is the corresponding author.