Commonsense reasoning has been seen as one of the key ability for intelligent machines to perform various activities . SemEval 2020 Task 4  is a commonsense validation and explanation task which is inspired by Wang et al. DBLP:journals/corr/abs-1906-00363. This task consists of three subtasks. The first subtask is to choose the one which makes sense from two natural language statements with similar wordings; The second subtask is to decide among three options the most crucial reason why a given statement does not make sense; The third subtask requires the machine to generate the reasons.
To make predictions or generate reasons, background knowledge is essential. A simple way to supplement that knowledge is utilizing plain texts from natural language databases, e.g. Wikipedia. Intuitively, for a specific given statement, plain texts can be provided by searching similar sentences in the database. In other words, each evidence sentences contain the keywords of the given statement. This method is used in some of the state-of-the-art models . However, for the purpose of explaining the reason, evidence which has similar wording with the given statement may lack information, and sometimes can be misleading when the given statement does not make sense.
In this work, we propose a novel way for evidence-searching using plain texts. We obtain evidence by searching for the meaning of the keywords in the given statement. In other words, using evidence containing the meanings of the keywords rather than containing the keywords themselves. The reason for our method is that these meanings may provide important information to explain why a statement makes sense or does not make sense. For example in Figure 1, the definition of ”aircraft carrier”—”A warship designed to carry aircraft”—is given by the evidence obtained from the database. Such a definition well explains why the given statement is wrong and can be used when generating the reason. In contrast, evidence containing both ”aircraft carrier” and ”human” will not be helpful.
We conduct experiments on subtask A, B, and C. Results show that our evidence-searching method boosts the performance on subtask C. In subtask A and B, our team is in the top 10. In subtask C, Our approach achieves the BLEU score of 18.5 (3rd place) and human evaluation score of 2.08 (2nd place). Moreover, the best BLEU score in our experiments (20.4) even outperforms the score we obtain in competition.
2 Related Work
A task that is closely related to SemEval 2020 Task 4 is CommonsenseQA , in which the commonsense knowledge is required to make the correct prediction. In CommonsenseQA task, large-scale pre-trained models have brought significant performance gains. These gains are obtained by developing training strategies and enlarging training data , or improving parameter efficiency .
While some of the improvement is achieved by developing the pre-trained model itself, some other approaches resort to external modules, e.g. knowledge extraction and graph-based reasoning in . In our work, we are interested in the knowledge extraction method because developing the pre-trained model itself will be computationally expensive.
3 System Description
We first describe our evidence-searching approach in Section 3.1, which can be utilized in downstream tasks. Then we describe our systems for subtask A&B in Section 3.2 and system for subtask C in Section 3.3
3.1 Evidence-Searching Approach
As shown in Figure 1, we first extract more than 1M two-element tuples (word, gloss) from Wiktionary222Wiktionary version: enwiktionary-20200220 with the help of Wiktextract package333https://github.com/tatuylonen/wiktextract and adopt Elastic Search tools444https://www.elastic.co/ to index these tuples (note that the same word can have many meanings, thus in practice the tuples actually consist of 3 elements, adding an element which indicates the importance of this meaning). Then for each given statement, we extract its keywords with the help of Spacy555https://spacy.io/. For each keyword, we search for those tuples whose ”word” field matches the keyword. The Elastic Search engine ranks the matching score for tuples. We select top K tuples for each keyword. Thus the number of evidence tuples for a given statement is K*M (M denotes the number of keywords). In short, we search for the meaning of the keywords. Finally, the input sentences are produced by concatenating the original statement and corresponding evidence together. The detailed format is varied according to different subtasks and will be discussed later.
3.2 Systems for Subtask A&B
Both of subtask A&B are to select one among several choices. Thus we implement two of the same model for subtaskA&B. In a nutshell, our model is RoBERTa (A Robustly Optimized BERT Pretraining Approach) since it has been found that large-scale pre-trained contextualized representation masters a certain degree of commonsense knowledge . This is also supported by our experimental results.
For subtask A, we adapt pre-trained to subtask A dataset. We denote the hidden size for each layer (transformer blocks) as , aggregate sequence representation as
(final hidden state corresponding to the special [CLS] word embedding). The task-specific parameters we introduce is a vector. For each example, we simply input the two choice respectively and obtain the final aggregate representation for each choice whose dot product with the vector denotes a score for choice
. Thus the probability distribution is the softmax over the two choices:
denotes the number of choices, which is 2 in subtask A. At testing time, the model’s prediction is the choice with the highest probability:
The model is trained with back propagation, using negative log-likelihood as loss function.
In subtask B, given a statement that does not make sense, select the key reason from three options to explain why it does not make sense. Adapting RoBERTa to subtask B dataset is similar to the adaptation for subtask A. For each example, we construct three input sentences by concatenating statements with three choices respectively. Then input these sentences respectively. We compute the score for each of choice according to Equation 1 where n is 3 in subtask B. Then we follow the procedure described in subtask A.
3.3 System for Subtask C
Subtask C is an NLG (Natural Language Generation) task, which is quite different from subtask A&B. Meanwhile, this subtask also requires commonsense knowledge and reasoning ability, which makes it more challenging. Our model for subtask C is BART (Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension)
. BART is a sequence-to-sequence model using a standard Transformer-based neural machine translation architecture. It is pre-trained by learning a model to reconstruct the original text from the corrupted text. It uses the same pre-training data as RoBERTa, consisting of 160Gb of news, books, stories, and web text.
Specifically, we adapt pre-trained to subtask C dataset. For each example, we follow our aforementioned evidence-searching approach to obtain evidence for the given statement. In the subtask C dataset, each statement has 3 referential reasons. During training, we construct 3 new example for each example in original dataset (e.g., for one original example we construct three new example , , ). Thus the total number of training examples is 3*N (N denotes the number of training examples originally). We denote this method as training since the same input has multiple different targets. For each new example, the input sentence is the concatenation of the statement and evidence. Because BART has an autoregressive decoder, it can be directly fine-tuned for such a sequence generation task and can generate outputs autoregressively.
We use the officially released dataset and standard train/trial/dev split of SemEval 2020 task 4 for experiments. We will give the performance of the best settings on test split. Note that we compare different settings through performance obtained by training the model on train&trial split and testing it on dev split since testing on test split is inconvenient. We will also give our configuration of final submissions for subtask A, B, C in section 4.1, 4.2, 4.3
4.1 Experiments for Subtask A
We implement RoBERTa in FAIRSEQ . RoBERTa is optimized with Adam  with the following parameters: and weight decay of 0.01. The learning rate is warmed up over the first 800 steps to a peak value of 1e-5, and then polynomially decayed. The clipping threshold of gradients is 0.1. RoBERTa is fine-tuned with a dropout of 0.1 on all layers and attention weights. It is fine-tuned for S=8,000 updates, with mini-batches containing B=8 sequences of maximum length T=512 tokens.
In experiments where we add evidence, for each statement we have several tuples (word, gloss). The evidence for the statement is in such format: ”: : ”. We construct two input sentences for each example in such format: ” Context: ”, ” Context: ”. Note that due to the unavoidable memory limitation problem, we use memory efficient floating point numbers option provided by FAIRSEQ when we add evidence.
We train the model on 1 11GB GeForce RTX 2080 GPU for around 15 minutes when we input statements without evidence and 2 GPUs for around 100 minutes when we add evidence.
We notice that there are some statements whose letters are all capitalized (e.g., A GIRL WON THE RACE WITH HER FRIEND) in Subtask A dataset. We capitalize the first letter and make other letters in lowercase. We denote this operation as in Table 1.
As shown in Table 1, we can see that the performance is slightly improved after we make some letter lowercase, since otherwise different forms of a word can be mapped to different embeddings while they have the same meaning. Then we explore the effect of evidence. By adding evidence to the input statement, we obtain a slightly better result in the development dataset while performance degradation is found in the test dataset. We hypothesize this discrepancy is because the amount of noise data in evidence is unstable.
Our configuration of the final submission of subtask A has the same setting discussed in the first paragraph, with the lowercase operation, and without additional evidence. Our approach achieves the accuracy of 95.3% on subtask A test dataset
|Dev Acc||Test Acc|
|+ Lowercase + Evidence||96.3||94.1|
4.2 Experiments for Subtask B
We primarily follow the optimization hyperparameters, given in Section4.1, except for the batch size, number of warmup steps, and number of total updates which are 4, 500, and 10,000 separately. In the following experiments, we also follow the lowercase operation in Section 4.1 as part of the default setting.
In subtask A, given two similar statements, one makes sense while another one does not. In subtask B, the one that does not make sense is given, thus the other one—the statement that makes sense—can be used as a kind of evidence since the different words between two statements may be the keywords for explaining why given statement does not make sense. We denote this evidence as , the evidence from wiktionary as .
In experiments where we do not add any evidence, we construct input sentences in such format ”The statement ’’ is absurd. Because ” in which we concatenate some additional words (”The statement ’’ is absurd. Because ”) to the sentence and denote this technique as . Moreover, in experiments where we add the reasonable statement, the input format for each choice is ”Reasonable statement: The statement ’’ is absurd. Because ”. If the evidence is added, then the format will be ”Context: Reasonable statement: The statement ’’ is absurd. Because ”
In Table 2 we present the result of different settings. We see that the extra words can bring 1.6% absolute improvement since they indicate that the given statement does not make sense and the choice is the reason for that. We also see that using corresponding reasonable statement and wiktionary evidence achieve comparable performance while they involve extra computational cost. Therefore, we only use extra words in the final submission of subtask B and achieve accuracy of 93.2%
|Dev Acc||Test Acc|
|+ Extra Words||93.2||93.2|
|+ Extra Words + Reasonable Statement||93.2||-|
|+ Extra Words + Wiktionary||92.4||-|
|+ Extra Words + Reasonable Statement + Wiktionary||93.1||-|
4.3 Experiments for Subtask C
BART is also implemented in FAIRSEQ . It is optimized with Adam  with the following parameters: and weight decay of 0.01. The learning rate is warmed up over the first 500 steps to a peak value of 3e-5, and then polynomially decayed. The clipping threshold of gradients is 0.1. BART is fine-tuned with a dropout of 0.1 on all layers and attention weights. It is fine-tuned for S=1,200 updates, with mini-batches containing B=32 sequences of maximum length T=512 tokens. Note that we keep the same experiment settings in the following experiments. We train the model on 4 11GB GeForce RTX 2080 GPU, for 15 minutes to 1 hour according to different settings.
In the following experiments, we follow the input format described in Section 4.2, removing the ”” only. We conduct experiments on different combinations of the methods described above (Multi-target: Section 3.3; Extra Words, Reasonable Statement, and Wiktionary: Section 4.2) and explore the effect of them.
As shown in Table 3, by using Multi-target training, we can obtain a 4.51 improvement on BLEU score. Compare to the baseline where we simply use the first referential answer of each example in training data as the target of the model output, the Multi-target method provides a larger amount of training data and thus helps the model get better performance. From Table 3 we can see performance degradation appears as we add some extra material but then performance improved as we add more material. We hypothesize the degradation is because the complexity of the input sentence increases as we add extra material. When Extra Words, Reasonable Statement, and Wiktionary are all added, the benefits outweigh the disadvantages they bring. Therefore, we use all of them in the final submission of subtask C and achieve the BLEU score of 18.5 (3rd place) and human evaluation score of 2.08 (2nd place), which obtains a 0.14 gain over 3rd place and only 0.02 less than 1st place.
The best score (20.39) on the test set in Table 3 outperforms the score we achieved during the competition (18.5). Note that we might achieve a better human evaluation score accordingly. There are two reasons for that. Firstly, we optimize our evidence-searching approach after the competition, removing a big proportion of noisy data. Secondly, we observe that during the training process, the model performs well at the beginning but turns to mess later. Thus it’s difficult to choose the best model during the training process when we cannot evaluate it on the test set. When the competition has ended, however, we can evaluate our models and choose the best one during the training process.
|Dev BLEU||Test BLEU|
|+Multi-target + Extra Words||18.10||-|
|+Multi-target + Wiktionary||18.98||-|
|+Multi-target + Extra Words + Wiktionary||19.50||-|
|+Multi-target + Extra Words + Reasonable Statement||18.98||-|
|+Multi-target + Extra Words + Reasonable Statement + Wiktionary||20.03||20.39|
In this work, we choose the different large-scale pre-trained models as the backbone for three subtasks and propose a novel way to search for evidence, which aims to obtain the meaning of the keywords in the given statement. Our experiments demonstrate the importance of additional knowledge for language models to understand the content. The results show that our evidence-searching approach is helpful to commonsense explanation task.
Commonsense reasoning and commonsense knowledge in artificial intelligence. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. External Links: Cited by: §4.1, §4.3.
ALBERT: a lite bert for self-supervised learning of language representations. External Links: Cited by: §2.
-  (2019) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. External Links: Cited by: §3.3.
-  (2019) KagNet: knowledge-aware graph networks for commonsense reasoning. External Links: Cited by: §2.
-  (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Cited by: §2, §3.2, §3.3.
-  (2019) Graph-based reasoning over heterogeneous external knowledge for commonsense question answering. External Links: Cited by: §1, §2, §2.
-  (2019) Towards generalizable neuro-symbolic systems for commonsense question answering. External Links: Cited by: §2.
-  (2019-06) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota, pp. 48–53. External Links: Cited by: §4.1, §4.3.
-  (2016) ConceptNet 5.5: an open multilingual graph of general knowledge. CoRR abs/1612.03975. External Links: Cited by: §2.
-  (2018) CommonsenseQA: A question answering challenge targeting commonsense knowledge. CoRR abs/1811.00937. External Links: Cited by: §2.
-  (2020) SemEval-2020 task 4: commonsense validation and explanation. In Proceedings of The 14th International Workshop on Semantic Evaluation, Cited by: §1.
-  (2019) Evaluating commonsense in pre-trained language models. External Links: Cited by: §3.2.