CUHK at SemEval-2020 Task 4: CommonSense Explanation, Reasoning and Prediction with Multi-task Learning

by   Hongru Wang, et al.

This paper describes our system submitted to task 4 of SemEval 2020: Commonsense Validation and Explanation (ComVE) which consists of three sub-tasks. The task is to directly validate the given sentence whether or not it makes sense and require the model to explain it. Based on BERTarchitecture with a multi-task setting, we propose an effective and interpretable "Explain, Reason and Predict" (ERP) system to solve the three sub-tasks about commonsense: (a) Validation, (b)Reasoning, and (c) Explanation. Inspired by cognitive studies of common sense, our system first generates a reason or understanding of the sentences and then chooses which one statement makes sense, which is achieved by multi-task learning. During the post-evaluation, our system has reached 92.9 subtask B (rank 9), andBLEU score of 12.9 in subtask C (rank 8)



There are no comments yet.


page 3


LMVE at SemEval-2020 Task 4: Commonsense Validation and Explanation using Pretraining Language Model

This paper describes our submission to subtask a and b of SemEval-2020 T...

ANA at SemEval-2020 Task 4: mUlti-task learNIng for cOmmonsense reasoNing (UNION)

In this paper, we describe our mUlti-task learNIng for cOmmonsense reaso...

Explain and Predict, and then Predict again

A desirable property of learning systems is to be both effective and int...

SemEval-2020 Task 4: Commonsense Validation and Explanation

In this paper, we present SemEval-2020 Task 4, Commonsense Validation an...

QiaoNing at SemEval-2020 Task 4: Commonsense Validation and Explanation system based on ensemble of language model

In this paper, we present language model system submitted to SemEval-202...

TAL EmotioNet Challenge 2020 Rethinking the Model Chosen Problem in Multi-Task Learning

This paper introduces our approach to the EmotioNet Challenge 2020. We p...

BUT-FIT at SemEval-2020 Task 4: Multilingual commonsense

This paper describes work of the BUT-FIT's team at SemEval 2020 Task 4 -...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Introducing common sense to natural language understanding systems is attracting more and more attention. Common sense, as ordinarily conceived, present themselves as the aspect of the grammar of expressions and sentences on which their semantic properties and relations depend [1]. And one important difference between human and machine text understanding lies in the fact that humans can access commonsense knowledge while processing text, which helps them to draw inferences about facts that are not mentioned in a text. Thus, it’s a fundamental question on how to validate whether a system has a common sense capability, and more importantly, let the system explain how it inferences using hidden facts. Existing benchmarks measure commonsense knowledge indirectly and without explanation, and also existing datasets test common sense indirectly through tasks that require extra knowledge, such as co-reference resolution, or reading comprehension. They verify whether a system is equipped with common sense by testing whether it can give a correct answer when the input does not contain such knowledge. However, there are some limitations to such benchmarks. First, they do not give a direct quantitatively standard to measure sense masking capability. Second, they do not explicitly identify the key factors required in a sense-making process. And also they do not require the model to explain why it make that prediction.

Common sense reasoning tasks are intended to require the model to go beyond pattern recognition. Instead, the model should use “common sense” or world knowledge to make inferences. Some empirical analysis has been done previously for common sense reasoning, mainly focus on the form of question answering (QA)

[15]. But question-answering is hard to directly evaluate the commonsense in contextualized representations. And there has been few work investigating commonsense in pre-trained language models [21], such as ELMo [10] and BERT [3]. Introduced by [17], sense-making is a task to tests whether a model can differentiate sense-making and non-sensemaking statements. Specifically, the statements typically differ only in one keyword which covers nouns, verbs, adjectives, and adverbs. There are two existing approaches that can address this problem, one simple way is to use more commonsense knowledge can be learned from larger training sets [18]. On the other hands, some works [6]

focus on effectively utilizing external, structured commonsense knowledge graphs, such as ConceptNet 

[14] and COMET  [2]. Insipred by previous works, more researchers are trying to fuse commonsense knowledge and language model [4], and apply them to downstream tasks [20]. Recently, a new hybrid approach has been proposed for common sense reasoning [5]. The core idea behind it is multi-task learning [8], which has been widely applied in natural language tasks [7].

However, progress in this field has on the whole been frustratingly slow and much of the work is purely theoretical. There may be no single perfect set of benchmark problems, and as yet there is essentially none at all, nor anything like an agreed-upon evaluation metric, benchmarks and evaluation marks would serve to move the field forward. The field might well benefit if commonsense reasoning were systematically described and evaluated. To tackle it, this system focuses on a benchmark to directly test whether a system can differentiate natural language statements that make sense from those that do not make sense. Our results indicate that pre-trained models are not able to demonstrate well on the benchmark, and some remaining cases demonstrating that human level is not achieved yet. Thus, we design a new procedure to handle the commonsense challenge inspired by human cognition. It firstly explain its understanding of the given sentences by a language model, and induce the hidden common sense fact. And then, the explanation is used as a supplementary input to the prediction module. Still, we believe that our approach also can be applied to more challenging data sets.

The organization of this paper is as follows: in Section 2, we introduce the basic information about pre-trained language model and task definition. We then describe the framework of our model in Section 3. Empirical results are given and discussed in Section 4. And then we provide more exhaustive analysis for some bad cases that appeared at our experiment in Section 5. Finally, we conclude this survey and in Section 6.

2 Preliminaries

2.1 Task Definition

Formally, the dataset is composed of 10 sentences: two sentences , three options , three references . and are two similar statements which in the same syntactic structure and differ by only a few words, but only one of them makes sense while the other does not. They are used on our subtask A called Validation, which requires the model to identify which one makes sense. For the against-common-sense statement or , we have three optional sentences , and to explain why the statement does not make sense. Our subtask B, named Explanation (Multi-Choice), requires that the only correct reason be identified from distractors. For the same against-common-sense statement or , our subtask C naming Explanation (Generation), asks the participants to generate the reason why it does not make sense. The 3 referential reasons , and are used for evaluating subtask C.

Subtask A: Unlike other classification problem, subtask A gives us two statements which have similar wordings. Their dependency tree or semantic structure is extremely similar and that requires us to build a model which can recognize these subtle differences and reasoning to judge the sentence whether or not it makes sense.

Subtask B: Subtask B gives us one false sentence (either or ) which means this sentence does not make sense and three options . We need to choose one right option which can explain why the give sentence does not make sense.

Subtask C: Subtask C provides one false sentence as same as in subtask B and three references . All these three references can explain why the false sentence does not make sense. This task requires us to build a model to generate the correct reason automatically given one false sentence.

2.2 Pretrained Language Model

BERT is the state-of-the-art bidirectional pre-trained language model that has recently shown excellent performance in a wide range of NLP tasks[3]

. It is an encoder based on multi-head attention with the self-attention mechanism in a fully connected layer. The input representation of BERT is constructed by summing the corresponding token, segment, and position embeddings. As an autoencoding (AE) model, It can see the context in both forward and backward directions. The pre-train of BERT uses two unsupervised tasks. 1) Masked LM; 2) Next Sentence Prediction (NSP). By optimizing for both of two tasks, BERT not only can learn semantic and synthetic knowledge but also world knowledge

[13]. These explain why BERT has astonishing performance.

RoBERTa is a replication study of BERT which showed that carefully tuning hyper-parameters and increase training data size lead to significantly improved results on language understanding. More specifically, [9] proposed three methods to improve BERT 1) training the model longer, with bigger batches, over more data; 2) removing the next sentence prediction objective; 3) training on longer sequences, and 4) dynamically changing the masking pattern applied to the training data. As same as other NLP tasks, RoBERTa gets more higher accuracy compared with BERT.

3 Models

Our proposed ERP system first generates (explain) its understanding of the given sentences by a language model, and then the explanation is used as a supplementary input to the prediction module. For subtask A the input is a sentence pair and , and the input is the gainst-common-sense statement for subtask B. Subtask C is an explanation generation task and in this way, we could explore common-sense reasoning in two settings – 1) explain-and-then-predict and 2) predict-and-then-explain to evaluate the effectiveness of our ERP system. Therefore, we illustrate the ERP system consecutively for different sub-tasks in Sections 3.1 and 3.2.

The architecture of our model is shown in Figure 1, the input x represents sequences (either one sentence or stacked sentences), and then for each token in this sequence is constructed by summing the corresponding token, segment and position embeddings. Then the semantic encoder map the input token into a vector in word-level (token-level), the transformer encoder captures the contextual information in sentence-level via the self-attention mechanism. After we get the contextual embedding vector, we use task-specific layer to apply downstream tasks, we use text classification layer here for both of task A and task B. We choose to introduce subtask A and subtask B first, and followed by subtask C for intuitive understanding, but for the competition, the organizer release the datasets of subtask A, subtask C, and subtask B in turn, which supports our ERP system.

Figure 1: Explain, Reason and Predict (ERP) system, during traing and validation or test, subtask B uses different inputs.

3.1 Sub-task A and Sub-task B

For both of subtask A and subtask B, we cast them as text classification problems. First of all, the training set of subtask A is , in which stands for two similar sentences and is label. In order to fine-tuning our model, we modified the input sequence x to , the [CLS] token is used for the final classfication, the [SEP] token is used to separate different sentences.

Secondly, for subtask B, during training and validation, the organizer already release the correct explanation for each sent in subtask B, we need to use these data to generate the explanation for test data in subtask B, section 3.2 will introduce more details. Therefore, the generated explanation can be used to improve the performance of our model. As shown in Figure 1, the input sequence consists of one false sent, three options, and some explanations (either ground-truth or generated) according to different periods. The training and validation sample can be cast as , the test samples are . we still use the same structure to arrange our input but use additional special token to disambiguate different functions of sentences, like we use [OPTION] to represent three options, [EXP] to represent explanations. The objective of both subtask A and subtask B is to maximize:


3.2 Sub-task C

Here, we employ Commonsense Auto-Generated Explanations in [17], generated by a language model. Subtask C provides one incorrect sentence and three references for explanation. All these three references can explain why the incorrect sentence does not make sense. Our LM is the large, pre-trained OpenAI GPT [11], which is a multi-layer, transformer [16] decoder. GPT is fine-tuned on the datasets. Thus, the input contains during the fine-tuning can be described as follows:


where the special token CUZ means “is wrong may because”. the input context during testing is defined as follows:


The model is trained to generate the explanation on the basis of conditional language modeling objective, the objective is to maximize:


where k is the size of the context window (in our case k is always greater than the length of e so that the entire explanation is within the context). The conditional probability

is modeled by a neural network with parameters

conditioned on and previous explanation tokens.

4 Experiment

It is important to make it clear that all our experiments are conducted which meet the requirement of the competition. We can not use the dataset which is not released during the formal competition which means we can not use subtask B data for subtask A and subtask C, because subtask B is released at last, etc.

4.1 Baseline of Sub-task A and Subtask B

As described before, the project consists of three subtasks. Subtask A is to choose from two natural language statements with similar wordings which one makes sense and another one does not make sense. Subtask B is to find the key reason why a given statement does not make sense. Subtask C asks the machine to generate reasons. Subtask A and B are evaluated by accuracy and Subtask C is evaluated using BLEU. To improve the reliability of the evaluation of Subtask C, we use a random subset of the test set and will do a human evaluation to further evaluate the systems with a relatively high BLEU score (which is not conducted in the post-evaluation period).

First of all, we use BERT and RoBERTa as our baseline since both of them show impressive performance in many NLP downstream tasks. Table 1 shows results compare BERT with RoBERTa that use different corpus for each task. For subtask A, the RoBERTa model reaches the highest accuracy 86.2% in the test and 88.5% in dev datasets. For subtask B, when we add data from subtask A, the performance get the peak at 82.3% accuracy, but it attracts our attention when we use additional subtask C data that the dev accuracy is extremely high with the test accuracy is obviously lower. We assume 1) the data from subtask C show tremendous potential ability to solve subtask B 2) the model relies too much on Subtask C data, resulting in very low performance without it. After we use the generated explanation during the test, the model gets considerable improvement which validates our assumption.

Model Task Test Dev Label
BERT Task A 85.3% 85.9% 2
Task B 79.1% 79.7% 3
 + Task A 73.4% 82.4% 3
 + Task C 82.5% 3
RoBERTa Task A 86.2% 88.5% 2
Task B 81.4% 84.6% 3
 + Task A 82.3% 84.5% 3
 + Task C 99.9% 3
Table 1: Task A and B: Baseline Results, represents we do not have explanations during test, but means we use our generated explanations, etc.

4.2 Explain and Predict

To better understand this deviant phenomenon, we present results with different sample percentages when we randomly choose whether or not to use the subtask C data which is shown at Table 2. Specifically, under the condition of 7:3 sample percent, when we get a sample from subtask B, and then we choose to inject additional explanation with a 30% probability, but not with 70% probability. If we decide to inject additional knowledge, then we will sample one, two, and three explanations with the equal possibility. We observe that the accuracy of dev datasets becomes a little lower, but the test accuracy gets comparable improvement. We force the model to learn more with limited external knowledge through this approach, and the result further validates our assumption before. This leads to the appearance of our “Explain, Reason and Predict” (ERP) system and provides an interpretable foundation. Then all we need to do is to improve our baseline, in which we try different ways such as knowledge inject and multi-tasks.

Sample Percent Corpus Test Dev
5:5 Single Exp 80.6% 89.9%
  + Task A 80.5% 86.6%
All Exp + Task A 80.0% 88.2%
7:3 Single Exp 82.3% 87.1%
  + Task A 81.1% 85.5%
All Exp + Task A 81.9% 86.6%
Table 2: Rational Experiment on Task B, 5:5 means we use 50% of samples to train with exp, 50% to train without exp, etc.

4.3 Multi-task

During the experiment, we find an interesting case illustrating that multi-task learning may help a lot at sub-tasks A and B. Given a false sentence s: “an umbrella can help you keep warm in snowy days.” and three options: A. “we don’t wear umbrellas”, B: “umbrellas can keep you dry in snowy days”, C: “going outside is very crazy in snowy days”. The ground truth is A, but the model outputs B which can better explain the sentence. After we check the other true sentence “a thicker cloth can help you keep warm in snowy days” in subtask A dataset, we know why the ground truth is A. Obviously, we need some knowledge at subtask A to help us to solve task B, so we think multi-tasks learning is a direction worthy to try [8].

Rather than enriching semantic embedding with knowledge graph, we leverage existing datasets across different domains which also require common sense reasoning like ARC, CommonseQA, and so on. We believe multi-task learning can learn more robust and universal embedding and then make our model get better performance and improve our baseline. In the following experiments, we will validate including additional datasets as external input information can boost our performance of our ERP system.

Task Model Test Dev
Task A MT-SAN on Task A (single task) 91.7% 94.8%
MT-SAN on Task A+MNLI+SciTail+MRPC 92.8% 94.8%
MT-SAN (ensemble) 92.9% 95.1%
Task B MT-SAN on Task B (single task) 87.3% 88.1%
MT-SAN on Task B+ARC+CommonseQA 91.0%,
MT-SAN (ensemble) 89.7% 92.2%
Table 3: Multi-Task Result, means Explain and Predict we said before.

Table 3 shows the results obtained by our final Multi-task ERP model. We report our two best models that ensemble models using different dropout rates, see more details in the following section. At subtask A, our ensemble model reaches 92.9% accuracy and 95.1% accuracy during test and dev respectively, while getting 89.7% accuracy at the test of subtask B and 93.5% accuracy at dev of subtask B. The highest accuracy in dev dataset of subtask B indicates the tremendous potential of our ERP system. Using additional datasets together to train provides marginal improvement compare with a single task which attributes to better model generalization under multi-task setting from our point of view.

4.4 Implement details

Our implementation of MT-DNN is based on [8]

. We used Adamax as our optimizer with a learning rate of 5e-5 and a batch size of 4. The maximum number of epochs was set to 5. A linear learning rate decay schedule with warm-up over 0.1 was used unless stated otherwise. We also set the dropout rate of all the task-specific layers as 0.1, except for ensemble models which we set different dropout rates to get different models. According to

[7], we set dropout rate ranged in

. To avoid the exploding gradient problem, we clipped the gradient norm within 1. All the texts were tokenized using wordpieces and were chopped to spans no longer than 512 tokens. We set the mixture ratio as 0.4 to re-weighting different tasks


4.5 Subtask C

Since this is a text generation problem, we choose to use the GPT model as our baseline. Since some of the samples use knowledge from subtask A, we conducted contrast experiments by using data from subtask A and CoS-E

[12]. We observed that adding explanations led to a very small decrease in the performance compared to the baseline at test datasets, but adding data from subtask A improve about 0.3.

Model Corpus Test Dev
GPT Task C 12.65 5.96
 + Task A 12.94 5.99
 + Aug 12.31 6.54
Table 4: Task C: CommonSense Explanation

Compared with the original paper[17], our model gets much higher accuracy in both of subtask A and subtask B. Our performance rank 10th on 29 April 2019, with 92.9% accuracy at subtask A (rank 11), 89.7% at subtask B (rank 9), 12.9 at subtask C (rank 8)222All results before 29 April, 2020.

5 Analysis

Despite the strong performance of our model, it still fails to detect some samples at subtask A and subtask B, and few sentences generated by our model can not well explain why the given sentence does not make sense. An in-depth analysis of these samples shows that they can be clustered into some classes.

5.1 Error Analysis at Subtask A

  • Basic common sense knowledge which can be solved by introduce external knowledge graph like ConceptNet (eg., : The moon sets at night, : The sun sets at night, label: 1, prediction: 2).

  • Implicit common sense knowledge.

    Current knowledge graphs do not contain everything about common sense knowledge because the limitation of memory and the huge volume of common sense, and it still needs better solutions by using more comprehensive knowledge representation and transfer learning or other methods (eg.,

    : Cats have got seven lives, : Cats have got one life, label: 1, prediction: 2).

  • Specific domain knowledge required to make a correct judgment (eg., : Hair is already dead, : Hair screams when you cut it, label: 2, prediction: 1), since the human may not know that the hair is dead protein cells, so it is a big challenge for the model to learn this rare domain knowledge from large corpus and datasets.

  • Others (eg., : coffee takes sleep, : coffee depresses people, label: 2, prediction: 1; : The sun is black, : The sun is white, label: 1, prediction: 2)

5.2 Error Analysis at Subtask B

For subtaskB, there are two different ways to address it: the conventional and the explain and predict methods, that leads three different cases 1) both methods are wrong 2) the conventional one is correct but the other wrong 3) ERP system is correct but the other wrong. According to a comprehensive analysis below, we find that our ERP system can reason and make a more persuasive decision than the conventional one.

  1. Both methods output the wrong judgment

    1. Explain in different perspectives or levels (eg., : Everyone loves reading horror novels. : Horror novels are scary. : Reading novels can be a good way to relax. : Not everyone likes to read horror novels. label: C, prediction: A). Why the given does not make sense can have multiple explanations in different levels. Here, to explain why “Everyone loves reading horror novels” does not make sense, from our point of view, both and are correct if we assume the given is already false, since they can composite “ is wrong because or ”. We think gives explanation from more subtle and deeper level than .

    2. Implicit common sense knowledge as same as in subtask A (eg., : drama plays are often performed before cows, : this rural drama tells the story of a cow, : the cow is a kind of animal while drama isn’t, : a cow is unable to appreciate and understand the drama, label: C, prediction: B), we need to know that drama plays are appreciated and understand by people in this example.

  2. Examples classified wrongly by the conventional methods but not our ERP system

    1. Lack of reasoning capability which equipped in our ERP system (eg., : shoes can fly, : There are many creatures that can fly, : Shoes do not have wings, : People cannot fly, label: B, prediction: C). The conventional can not reason those wings are needed to fly here.

    2. Basic common sense knowledge. The model still needs external knowledge to support making the right classification.

  3. About 1.10% of samples are not classified correctly by our model but the conventional ones, we think this mostly attribute to noise introduced by multi-task setting.

    1. Capture plausible knowledge (eg., : the lava was warm and soft, : lava can destroy the warm and soft cake, : lava is too hard to be soft, : lava is too hot to be warm or soft, label: C, prediction: B). The model captured that something is too hard to be soft, but it ignores the attributes of lava.

    2. Others (eg., : it is said that Santa comes on Thanksgiving Days, : Santa comes on Christmas day, not Thanksgiving Day, : Santa is a figure in legend, not reality, : Santa is a figure in western culture, not eastern culture, label: A, prediction: B). We first think this is caused by explaining in different perspectives or levels which described above, but after we check the whole data, and we find an example (: Santa Claus sent Jim a Christmas present, : There aren’t Santa Claus in the world, : Santa Claus is very busy, : Santa Claus is old, label: A) in the training data of subtask B. This proves that our model can learn more robust and universal embedding than the conventional method.

5.3 Error Analysis at Subtask C

Although most of the results make sense, but there are still some generated reasons which can not well explain why the given sentence does not make sense. Most cases, as we found, are with:

  • Wrong explain direction (eg., : The inverter was able to power the continent, : inverter is not a living thing).

  • Repetition (eg., : sugar is used to make coffee sour, : sugar is used to make coffee). Like the example, some cases contain repeatedly generated words.

6 Conclusion

In this paper we present our model on the task of Commonsense Validation and Explanation (ComVE) in SemEval-2020. We explore multi-task learning to jointly learn how to inference the hidden common-sense fact and do common-sense reasoning with the RoBERTa model and achieved competitive results. Our result analysis indicates that our ”Explain, Reason and Predict” approach helps improve the performance of RoBERTa and have a strong reasoning capability. The biggest regret in this competition is that we did not incorporated with world knowledge by introducing some knowledge graphs. Due to implicit common sense knowledge and different explanation perspectives, we still need more efforts on the model architecture and find more elegant ways to inject knowledge. Our positive results point to future work in extending the ERP approach to a variety of other types of common sense reasoning tasks.


  • [1] N. Asher and L. Vieu (1995) Toward a geometry of common sense: a semantics and a complete axiomatization of mereotopology. In IJCAI (1), pp. 846–852. Cited by: §1.
  • [2] A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Çelikyilmaz, and Y. Choi (2019) COMET: commonsense transformers for automatic knowledge graph construction. In ACL, Cited by: §1.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. ArXiv abs/1810.04805. Cited by: §1, §2.2.
  • [4] M. Forbes and Y. Choi (2017) Verb physics: relative physical knowledge of actions and objects. In ACL, Cited by: §1.
  • [5] P. He, X. Liu, W. Chen, and J. Gao (2019) A hybrid neural network model for commonsense reasoning. ArXiv abs/1907.11983. Cited by: §1.
  • [6] B. Y. Lin, X. Chen, J. Chen, and X. Ren (2019) KagNet: knowledge-aware graph networks for commonsense reasoning. In EMNLP/IJCNLP, Cited by: §1.
  • [7] X. Liu, P. He, W. Chen, and J. Gao (2019) Improving multi-task deep neural networks via knowledge distillation for natural language understanding. ArXiv abs/1904.09482. Cited by: §1, §4.4.
  • [8] X. Liu, P. He, W. Chen, and J. Gao (2019) Multi-task deep neural networks for natural language understanding. In ACL, Cited by: §1, §4.3, §4.4.
  • [9] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: §2.2.
  • [10] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proc. of NAACL, Cited by: §1.
  • [11] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §3.2.
  • [12] N. F. Rajani, B. McCann, C. Xiong, and R. Socher (2019) Explain yourself! leveraging language models for commonsense reasoning. In ACL, Cited by: §4.5.
  • [13] A. Rogers, O. Kovaleva, and A. Rumshisky (2020) A primer in bertology: what we know about how bert works. ArXiv abs/2002.12327. Cited by: §2.2.
  • [14] R. Speer, J. Chin, and C. Havasi (2016) ConceptNet 5.5: an open multilingual graph of general knowledge. In AAAI, Cited by: §1.
  • [15] A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. ArXiv abs/1811.00937. Cited by: §1.
  • [16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.2.
  • [17] C. Wang, S. Liang, Y. Zhang, X. Li, and T. Gao (2019-07) Does it make sense? and why? a pilot study for sense making and explanation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4020–4026. External Links: Link Cited by: §1, §3.2, §4.5.
  • [18] X. Wang, P. Kapanipathi, R. Musa, M. Yu, K. Talamadupula, I. Abdelaziz, M. Chang, A. Fokoue, B. Makni, N. Mattei, and M. J. Witbrock (2019) Improving natural language inference using external knowledge in the science questions domain. In AAAI, Cited by: §1.
  • [19] Y. Xu, X. Liu, Y. Shen, J. Liu, and J. Gao (2018) Multi-task learning with sample re-weighting for machine reading comprehension. In NAACL-HLT, Cited by: §4.4.
  • [20] P. Zhong, Di. Wang, and C. Miao (2019) Knowledge-enriched transformer for emotion detection in textual conversations. In EMNLP/IJCNLP, Cited by: §1.
  • [21] X. Zhou, Y. Zhang, L. Cui, and D. Huang (2019) Evaluating commonsense in pre-trained language models. arXiv preprint arXiv:1911.11931. Cited by: §1.