LMVE at SemEval-2020 Task 4: Commonsense Validation and Explanation using Pretraining Language Model

by   Shilei Liu, et al.
Northeastern University

This paper describes our submission to subtask a and b of SemEval-2020 Task 4. For subtask a, we use a ALBERT based model with improved input form to pick out the common sense statement from two statement candidates. For subtask b, we use a multiple choice model enhanced by hint sentence mechanism to select the reason from given options about why a statement is against common sense. Besides, we propose a novel transfer learning strategy between subtasks which help improve the performance. The accuracy scores of our system are 95.6 / 94.9 on official test set and rank 7^th / 2^nd on Post-Evaluation leaderboard.



There are no comments yet.


page 1

page 2

page 3

page 4


CUHK at SemEval-2020 Task 4: CommonSense Explanation, Reasoning and Prediction with Multi-task Learning

This paper describes our system submitted to task 4 of SemEval 2020: Com...

QiaoNing at SemEval-2020 Task 4: Commonsense Validation and Explanation system based on ensemble of language model

In this paper, we present language model system submitted to SemEval-202...

SemEval-2020 Task 4: Commonsense Validation and Explanation

In this paper, we present SemEval-2020 Task 4, Commonsense Validation an...

Autoencoding Language Model Based Ensemble Learning for Commonsense Validation and Explanation

An ultimate goal of artificial intelligence is to build computer systems...

Does It Make Sense? And Why? A Pilot Study for Sense Making and Explanation

Introducing common sense to natural language understanding systems has r...

A Benchmark Arabic Dataset for Commonsense Explanation

Language comprehension and commonsense knowledge validation by machines ...

CS-NET at SemEval-2020 Task 4: Siamese BERT for ComVE

In this paper, we describe our system for Task 4 of SemEval 2020, which ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/

. Common sense verification and explanation is an important and challenging task in artificial intelligence and natural language processing. This is a simple task for human beings, because human beings can make full use of external knowledge accumulated in their daily lives. However, common sense verification and reasoning is difficult for machines. According to Wang WangLZLG19, even some state-of-the-art language models such as ELMO

[9] and BERT[3] have very poor performance. So it is crucial to integrate the ability of commonsense-aware to natural language understanding model[2].

SemEval-2020 task4[11] aims to improve the ability of common sense judgment for model, and we participated in two subtasks of this task. The dataset of SemEval-2020 task4 named ComVE. Each instance in ComVE is composed of 5 sentences . and will be used for subtask a, and or with will be used for subtask b.

Subtask a(also known as Sen-Making task) aims to test a model’s ability of commonsense validation. Specifically, given two statements whose lexical and syntactic are similar, the object of Sen-Making model is to determine which statement is common sense(compared to another one). For example, is put the elephant in the refrigerator and is put the turkey in the refrigerator, a good model needs to judge that the latter is more common sense.

Subtask b(also known as Explanation task) is a multiple choice task that aims to find the key reason why a given statement does not make sense. For example, given a sentence that violates common sense with three options , where is he put an elephant into the fridge, is an elephant is much bigger than a fridge, is elephants are usually gray while fridges are usually white, and is an elephant cannot eat a fridge, the model needs to judge that is the correct option.

The official baseline of Sen-Making

use a pretraining language model(PLM) to dynamic encoding the two input sentence, and use a simple full connection neural network to calculate the perplexities respectively, and choosing the one with lower scores as the correct one. We believe that the baseline method treats two sentences independently and ignores the inner relationship between the two sentences, so we propose a novel model structure that fully considers the interaction between statements.

The official baseline of Explanation treats the task as BERT-like multiple choice task[3]. We think that the baseline model doesn’t make full use of the input data. So we design a structure to incorporate another statement that is common sense to existing model.

In addition, we believe that fine-tuning on similar subtask can improve the performance of the current subtask because there are many commonalities between the two subtasks, so we propose a novel transfer learning mechanism between Sen-Making and Explanation.

The proposed system named LMVE, it is a neural network model(includes two sub-modules to solve both subtask a and b) bases on large scale pretraining language model.

Our contributions are as follows:

  • First, we propose subtask level transfer learning that help share information between subtasks.

  • Second, we propose a novel structure to calculate the perplexity of sentence, which takes into account the interaction between sentences in a pair.

  • Third, we propose the hint sentence mechanism that will help improve the performance of multiple choice task.(subtask b).

2 System Description

We consider our model for both Sen-Making and Explanation as two parts: encoder and decoder. Encoder is mainly used for getting the contextual representation of input sentence tokens. In recent years, some pretraining language models including BERT[3], RoBERTa[7] and ALBERT[6] have been proven beneficial for many natural language processing (NLP) tasks[10, 1]. These pretrained models have learned general-purpose language representations on a large amount of unlabeled data, therefore, adapting these models to the downstream tasks can bring a good initialization for parameters and avoid training from scratch[14]. So we tried some popular PLMs as encoders. Decoder consists of several simple linear layers whose number of parameters are far less than encoder, and the role of decoder is to fuse the output of encoder and predict the answer.

2.1 LMVE for Sen-Making Task

Figure 1: The model architecture for Sen-Making

task, (a) is official baseline and (b) is ours. The yellow point denotes the vector representation of the output sequence(same as below).

Figure 1(a) is the official baseline model[12], which regards two sentences as independent individuals. But in ComVE, there are certain similarities(lexical and grammatical) between the two statements, so we think that the interaction between the two sentences is helpful to improve the performance of the model. Figure 1(b) gives an overview of our model for Sen-Making task which is mainly composed of three modules including token encoding, feature fusion and answer prediction.

Encoding: Let and represent one-hot vectors for the first sentence and the second sentence in an instance, we first concatenate them and add some special tokens like Figure 1(b), then we will get two sequences and . The two sequences will be fed into ALBERT respectively. We use and denote outputs of -th transformer block in ALBERT, where is the hidden size of model, is the sequence length, and .

Fusion: Some pretraining language model(BERT et al.) usually take the first token (corresponds to [CLS]) of the output of last transformer block as the representation of a sequence, but we use the weighted sum of the representation of first token in all transformer block outputs as the final representation333Subsequent experimental results show that the performance using the last 4 layers is the best.. The following equations describe the process of fusion:


where is a trainable parameter and . We can regard as the representation of -th statement().

Answer Prediction:

This module maps the output of the fusion layer to a probability distribution of answer. Given

and as learnable parameters, it calculates the answer possibility as


We define the training loss as cross entropy loss function:


where is the number of samples in the dataset and .

2.2 Hint Sentence mechanism

Before formally introducing our model for Explanation task, let’s first introduce the hint sentence mechanism.

In official baseline[12], the against common sense statement will be concatenated with three options respectively and fed into the model. We believe that this form of input does not make full use of the data in the ComVE. Specifically, for an instance in ComVE, it is assumed that does not conform to common sense, then will be used to train the baseline model or to predict answer. However, in this process, was abandoned. We believe that another common sense statement() in statement pair contains some useful information, and should be incorporated into our model.

So we propose hint sentence mechanism: A hint sentence is common sense and its lexical and syntactic are similar to the given against common sense statement and they differ by only few words. In other words, we call the another sentence in the sentence pair a hint sentence.

The process of how the hint sentence is integrated into the existing model can be referred to next section and Figure 2. The results of ablation experiment(Sec 3.6) show that hint sentence mechanism can greatly improve the performance of our model for Explanation task.

2.3 LVME for Explanation Task

Figure 2: The model architecture for Explanation task.

Figure 2 gives an overview of our model for Explanation task. it also has three modules.

Let , and represent one-hot vectors for the input statement, hint sentence and -th option in an instance, where and , and is the length of them, we first concatenate them and add some special tokens like Figure 2, then we will get three sequences . Then the three sequences will be fed into ALBERT respectively.

Similar to last sub-section, each sequence will get a representation vector after fusion, and then the three representation vector will pass a linear layer like Equation 4 to calculate the probability distributions of answer.

We define training loss as


where is the number of samples in the dataset and is true label.

2.4 Subtask Level Transfer Learning

Figure 3: The process of subtask level transfer learning.

Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. PLM is a typical example of transfer learning and we call it

task level transfer learning.

Sen-Making task and Explanation task are both generalized multiple choice tasks, and there is an association between the input data for them, so we believe that in SemEval-2020 Task 4, fine-tuning on similar subtask can improve the performance of the current subtask.

Subtask level transfer learning refers to use the encoder after fine-tuning on subtask a(Sen-Making) to train subtask b(Explanation) and vice versa. The process of Subtask level transfer learning are shown in Figure 3.

3 Experiments and Analysis

Figure 4: Learning curve on the training dataset.

3.1 Dataset

ComVE include 10000 samples in train set and 1000 samples in dev/test set for both Sen-Making and Explanation task. The average length of two statements in the Sen-Making task are both 8.26, exactly the same. The average length of true reasons is 7.63 in Explanation task.

It should be noted that in SemEval-2020 Task 4, the test set of Explanation task is issued only after Sen-Making task is completed, so it is impossible to use the test set of Explanation task to reverse deduce the answer of subtask a test set.

3.2 Baseline

To verify the effectiveness of our model, we used ALBERT to replace the BERT in the official baseline, leaving the rest unchanged. We do not perform subtask level transfer learning (Sec 2.4) on them.

3.3 Preprocessing

Data Augmentation: To enhance the robustness of our model, we use Google Sheets444https://www.google.com/sheets/about to perform back translation technology on original texts to get augmented texts. Specifically, given a training sample we first translate the original statements to French and then translate them back to English (denoted as ). will add to training dataset as a new sample. the size of the dataset has doubled after augmentation.

Tokenization: We employ the tokenizer that comes with the HuggingFace[13]PyTorch implementation of ALBERT. The tokenizer lowercases the input and applies the SentencePiece encoding[5] to split input words into most frequent subwords present in the pre-training corpus. Non-English characters will be removed.

3.4 Implementation Details

We use the Transformers555https://github.com/huggingface/transformers toolkit to implemented our model and tune the hyper-parameters according to validation performance on the development set. The hidden size is equal to the corresponding PLM. To train our model, we employ the AdamW algorithm[8] with the initial learning rate as 2e-5 and the mini-batch size as 48.

We also prepared an ensemble model consisting of 7 models for Sen-Making task and 19 for Explanation

task with different hyperparameter settings and random seeds. We used majority voting strategy to fuse the candidate predictions of different models together.

Model Params Sen-Making Explanation
Random - 49.52 32.77
BERT[3] 117M 88.56 85.32
BERT[3] 340M 86.55 90.12
XLNet[15] 340M 90.33 91.07
SpanBERT[4] 340M 89.46 90.47
RoBERTa[7] 355M 93.56 92.37
ALBERT[6] 12M 86.63 84.37
ALBERT[6] 18M 88.01 89.72
ALBERT[6] 60M 92.03 92.45
Ours(ALBERT) 235M 95.68 95.48
Our-ensemble - 95.91 96.39
Table 1: Performance with different encoder.

3.5 Main Result

The result of our model for subtask a and subtask b are summarized in Table 1. We have tried different pretraining language model as our encoder, and found that ALBERT based model achieves the state-of-the-art performance.

Figure 4 shows a learning curve computed over the provided training data with testing against the development set, and we can see that in the case of low-resource (only use 10%-20% training data of target task), the performance of introducing subtask level transfer learning is significantly higher than original implementation.

Sen-Making Model Acc Our-single 95.68 - w/o method b 94.88 0.80 w/o data augmentation 95.43 0.25 w/o weighted sum fusion 95.32 0.36 w/o subtask level transfer 94.85 0.83 Baseline 93.24 2.44 Explanation Model Acc Our-single 95.48 - w/o hint sentence 93.47 2.01 w/o data augmentation 95.39 0.09 w/o weighted sum fusion 95.11 0.37 w/o subtask level transfer 94.98 0.50 Baseline 93.12 2.36
Table 2: Ablation study on model components. means we use the model structure as Figure 1(a) and Baseline means the model in Sec 3.2.

3.6 Ablation Study

To get better insight into our model architecture, we conduct an ablation study on dev set of ComVE, and the results are shown in Table 2.

From the results we can see that subtask level transfer learning has a relatively large contribution for both subtask a and b, which confirms our hypothesis that fine-tuning on similar task can improve the performance of the current task. Data augmentation and weighted sum fusion also have minor contributions due to the more robust dataset and more robust model.

For subtask a, we can see that compared with baseline method(Figure 1(a)), concatenating another sentence as input can have a higher performance. We speculate that the reason is traditional method treat two statements as independent individuals, and our method takes into account the inherent connection between the two statements.

For subtask b, we can see from Table 2 that hint sentence makes a great contribution the overall improvement. We think the reason is a common sense statement with a similar grammar and syntax does help the model to determine why the input sentence is against common sense.

4 Conclusions

This paper introduces our system for commonsense validation and explanation. For Sen-Making task, we use a novel pretraining language model based architecture to pick out one of the two given statements that is against common sense. For Explanation task, we use a hint sentence mechanism to improve the performance greatly. In addition, we propose a subtask level transfer learning to share information between subtasks.

As future work, we plan to integrate the external knowledge base(such as ConceptNet666http://www.conceptnet.io/) into commonsense inference.


We thank the reviewers for their helpful comments. This work is supported by the National Natural Science Foundation of China (No. 61572120) and the Fundamental Research Funds for the Central Universities (No.N181602013).


  • [1] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and Y. Marton (Eds.), pp. 632–642. External Links: Link, Document Cited by: §2.
  • [2] E. Davis (2017) Logical formalizations of commonsense reasoning: A survey. J. Artif. Intell. Res. 59, pp. 651–723. External Links: Link, Document Cited by: §1.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §1, §2, Table 1.
  • [4] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy (2019) SpanBERT: improving pre-training by representing and predicting spans. CoRR abs/1907.10529. External Links: Link, 1907.10529 Cited by: Table 1.
  • [5] T. Kudo (2018) Subword regularization: improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, I. Gurevych and Y. Miyao (Eds.), pp. 66–75. External Links: Link, Document Cited by: §3.3.
  • [6] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020)

    {ALBERT}: A Lite {BERT} for Self-supervised Learning of Language Representations

    In International Conference on Learning Representations, External Links: Link Cited by: §2, Table 1.
  • [7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §2, Table 1.
  • [8] I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §3.4.
  • [9] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §1.
  • [10] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, J. Su, X. Carreras, and K. Duh (Eds.), pp. 2383–2392. External Links: Link, Document Cited by: §2.
  • [11] C. Wang, S. Liang, Y. Jin, Y. Wang, X. Zhu, and Y. Zhang (2020) SemEval-2020 task 4: commonsense validation and explanation. In Proceedings of The 14th International Workshop on Semantic Evaluation, Cited by: §1.
  • [12] C. Wang, S. Liang, Y. Zhang, X. Li, and T. Gao (2019) Does it make sense? and why? A pilot study for sense making and explanation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 4020–4026. External Links: Link, Document Cited by: §2.1, §2.2.
  • [13] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §3.3.
  • [14] Y. Xu, X. Qiu, L. Zhou, and X. Huang (2020-02) Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation. arXiv:2002.10345 [cs]. Note: arXiv: 2002.10345Comment: 7 pages, 6 figures External Links: Link Cited by: §2.
  • [15] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, pp. 5754–5764. External Links: Link Cited by: Table 1.