Align, Mask and Select: A Simple Method for Incorporating Commonsense Knowledge into Language Representation Models

08/19/2019 ∙ by Zhi-Xiu Ye, et al. ∙ USTC 0

Neural language representation models such as Bidirectional Encoder Representations from Transformers (BERT) pre-trained on large-scale corpora can well capture rich semantics from plain text, and can be fine-tuned to consistently improve the performance on various natural language processing (NLP) tasks. However, the existing pre-trained language representation models rarely consider explicitly incorporating commonsense knowledge or other knowledge. In this paper, we develop a pre-training approach for incorporating commonsense knowledge into language representation models. We construct a commonsense-related multi-choice question answering dataset for pre-training a neural language representation model. The dataset is created automatically by our proposed "align, mask, and select" (AMS) method. We also investigate different pre-training tasks. Experimental results demonstrate that pre-training models using the proposed approach followed by fine-tuning achieves significant improvements on various commonsense-related tasks, such as CommonsenseQA and Winograd Schema Challenge, while maintaining comparable performance on other NLP tasks, such as sentence classification and natural language inference (NLI) tasks, compared to the original BERT models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pre-trained language representation models, including feature-based methods [Pennington, Socher, and Manning2014, Peters et al.2017] and fine-tuning methods [Howard and Ruder2018, Radford et al.2018, Devlin et al.2018], can capture rich language information from text and then benefit many NLP tasks. Bidirectional Encoder Representations from Transformers (BERT) [Devlin et al.2018]

, as one of the most recently developed models, has produced the state-of-the-art results by simple fine-tuning on various NLP tasks, including named entity recognition (NER)

[Sang and De Meulder2003], text classification [Wang et al.2018], natural language inference (NLI) [Bowman et al.2015], question answering (QA) [Rajpurkar et al.2016, Zellers et al.2018], and has achieved human-level performances on several datasets [Rajpurkar et al.2016, Zellers et al.2018].

However, commonsense reasoning is still a challenging task for modern machine learning methods. For example, recently

[Talmor et al.2018] proposed a commonsense-related task, CommonsenseQA, and showed that the BERT model accuracy remains dozens of points lower than human accuracy on the questions about commonsense knowledge. Some examples from CommonsenseQA are shown in Table 1 part A. As can be seen from the examples, although it is easy for humans to answer the questions based on their knowledge about the world, it is a great challenge for machines when there is limited training data.

A) Some examples from CommonsenseQA dataset
What can eating lunch cause that is painful?
headache, gain weight, farts, bad breath, heartburn
What is the main purpose of having a bath?
cleanness, water, exfoliation, hygiene, wetness
Where could you find a shark before it was caught?
business, marine museum, pool hall, tomales bay, desert
B) Some triples from ConceptNet
(eating dinner, Causes, heartburn)
(eating dinner, MotivatedByGoal, not get headache)
(lunch, Antonmy, dinner)
(have bath, HasSubevent, cleaning)
(shark, AtLocation, tomales bay)
Table 1: Some examples from the CommonsenseQA dataset shown in part A and some related triples from ConceptNet shown in part B. The correct answers in part A are in boldface.

We hypothesize that exploiting knowledge graphs for commonsense in QA modeling can help model choose correct answers. For example, as shown in the part B of Table 

1, some triples from ConceptNet [Speer, Chin, and Havasi2017] are quite related to the questions above. Exploiting these triples in the QA modeling may benefit the QA models to make a correct decision.

In this paper, we propose a pre-training approach that can leverage commmonsense knowledge graphs, such as ConceptNet [Speer, Chin, and Havasi2017], to improve the commonsense reasoning capability of language representation models, such as BERT. And at the same time, the proposed approach targets maintaining comparable performances on other NLP tasks with the original BERT models. It is challenging to incorporate the commonsense knowledge into language representation models since the commonsense knowledge is represented as a structured format, such as (concept, relation, concept) in ConceptNet, which is inconsistent with the data used for pre-training language representation models. For example, BERT is pre-trained on the BooksCorpus and English Wikipedia that are composed of unstructured natural language sentences.

To tackle the challenge mentioned above, inspired by the distant supervision approach [Mintz et al.2009], we propose the “align, mask and select” (AMS) method that can align the commonsense knowledge graphs with a large text corpus to construct a dataset consisting of sentences with labeled concepts. Different from the pre-training tasks for BERT, the masked language model (MLM) and next sentence prediction (NSP) tasks, we use the generated dataset in a multi-choice question answering task. We then pre-train the BERT model on this dataset with the multi-choice question answering task and fine-tune it on various commonsense-related tasks, such as CommonsenseQA [Talmor et al.2018] and Winograd Schema Challenge (WSC) [Levesque, Davis, and Morgenstern2012], and achieve significant improvements. We also fine-tune and evaluate the pre-trained models on other NLP tasks, such as sentence classification and NLI tasks, such as GLUE [Wang et al.2018], and achieve comparable performance with the original BERT models.

In summary, the contributions of this paper are threefold. First, we propose a pre-training approach for incorporating commonsense knowledge into language representation models for improving the commonsense reasoning capabilities of these models. Second, We propose an “align, mask and select” (AMS) method, inspired by the distant supervision approaches, to automatically construct a multi-choice question answering dataset. Third, Experiments demonstrate that the pre-trained model from the proposed approach with fine-tuning achieves significant performance improvements on several commonsense-related tasks, such as CommonsenseQA [Talmor et al.2018] and Winograd Schema Challenge [Levesque, Davis, and Morgenstern2012], and still maintains comparable performances on several sentence classification and NLI tasks in GLUE [Wang et al.2018].

2 Related Work

2.1 Language Representation Model

Language representation models have demonstrated their effectiveness for improving many NLP tasks. These approaches can be categorized into feature-based approaches and fine-tuning approaches. The early Word2Vec [Mikolov et al.2013] and Glove models [Pennington, Socher, and Manning2014]

focused on feature-based approaches to transform words into distributed representations. However, these methods suffered from the insufficiency for word disambiguation.

[Peters et al.2018]

further proposed Embeddings from Language Models (ELMo) that derive context-aware word vectors from a bidirectional LSTM, which is trained with a coupled language model (LM) objective on a large text corpus.

The fine-tuning approaches are different from the above-mentioned feature-based language approaches which only use the pre-trained language representations as input features. [Howard and Ruder2018] pre-trained sentence encoders from unlabeled text and fine-tuned for a supervised downstream task. [Radford et al.2018] proposed a generative pre-trained Transformer [Vaswani et al.2017] (GPT) to learn language representations. [Devlin et al.2018] proposed a deep bidirectional model with multi-layer Transformers (BERT), which achieved the state-of-the-art performance for a wide variety of NLP tasks. The advantage of these approaches is that few parameters need to be learned from scratch.

Though both feature-based and fine-tuning language representation models have achieved great success, they did not incorporate the commonsense knowledge. In this paper, we focus on incorporate commonsense knowledge into pre-training of language representation models.

2.2 Commonsense Reasoning

Commonsense reasoning is a challenging task for modern machine learning methods. As demonstrated in recent work [Zhong et al.2018], incorporating commonsense knowledge into question answering models in a model-integration fashion helped improve commonsense reasoning ability. Instead of ensembling two independent models as in [Zhong et al.2018], an alternative direction is to directly incorporate commonsense knowledge into an unified language representation model. [Sun et al.2019] proposed to directly pre-training BERT on commonsense knowledge triples. For any triple (concept, relation, concept), they took the concatenation of concept and relation as the question and concept as the correct answer. Distractors were formed by randomly picking words or phrases in the ConceptNet. In this work, we also investigate directly incorporating commonsense knowledge into an unified language representation model. However, we hypothesize that the language representations learned in [Sun et al.2019] may be tampered since the inputs to the model constructed this way are not natural language sentences. To address this issue, we propose a pre-training approach for incorporating commonsense knowledge that includes a method to construct large-scale, natural language sentences. [Rajani et al.2019] collected the Common Sense Explanations (CoS-E) dataset using Amazon Mechanical Turk and applied a Commonsense Auto-Generated Explanations (CAGE) framework to language representation models, such as GPT and BERT. However, collecting this dataset used a large amount of human efforts. In contrast, in this paper, we propose an “align, mask and select” (AMS) method, inspired by the distant supervision approaches, to automatically construct a multi-choice question answering dataset.

2.3 Distant Supervision

The distant supervision approach was originally proposed for generating training data for the relation classification task. The distant supervision approach [Mintz et al.2009] assumes that if two entities/concepts participate in a relation, all sentences that mention these two entities/concepts express that relation. Note that it is inevitable that there exists noise in the data labeled by distant supervision [Riedel, Yao, and McCallum2010]. In this paper, instead of employing the relation labels labeled by distant supervision, we focus on the aligned entities/concepts. We propose the AMS method to construct a multi-choice QA dataset that align sentences with commonsense knowledge triples, mask the aligned words (entities/concepts) in sentences and treat the masked sentences as questions, and select several entities/concepts from knowledge graphs as candidate choices.

(1) A triple from ConceptNet
(population, AtLocation, city)
(2) Align with the English Wikipedia dataset to obtain a sentence containing “population” and “city”
The largest city by population is Birmingham, which has long been the most industrialized city.
(3) Mask ”city” with a special token “[QW]”
The largest [QW] by population is Birmingham, which has long been the most industrialized city?
4) Select distractors by searching (population, AtLocation, ) in ConceptNet
(population, AtLocation, Michigan)
(population, AtLocation, Petrie dish)
(population, AtLocation, area with people inhabiting)
(population, AtLocation, country)
5) Generate a multi-choice question answering sample
question: The largest [QW] by population is Birmingham, which has long been the most industrialized city?
candidates: , Michigan, Petrie dish, area with people inhabiting, country
Table 2: The detailed procedures of constructing one multi-choice question answering sample. The in the fourth step is a wildcard character. The correct answer for the question is underlined.

3 Commonsense Knowledge Base

This section describes the commonsense knowledge base investigated in our experiments. We use the ConceptNet111https://github.com/commonsense/conceptnet5/wiki [Speer, Chin, and Havasi2017], one of the most widely used commonsense knowledge bases. ConceptNet is a semantic network that represents the large sets of words and phrases and the commonsense relationships between them. It contains over 21 million edges and over 8 million nodes. Its English vocabulary contains approximately 1,500,000 nodes, and for 83 languages, it contains at least 10,000 nodes for each of them, respectively. ConceptNet contains a core of 36 relations.

Each instance in ConceptNet can be generally represented as a triple = (concept, relation, concept), indicating relation between the two concepts concept and concept. For example, the triple (semicarbazide, IsA, chemical compound) means that “semicarbazide is a kind of chemical compounds”; the triple (cooking dinner, Causes, cooked food) means that “the effect of cooking dinner is cooked food”, etc.

4 Constructing Pre-training Dataset

In this section, we describe the details of constructing the commonsense-related multi-choice question answering dataset. Firstly, we filter the triples in ConceptNet with the following steps: (1) Filter triples in which one of the concepts is not English words. (2) Filter triples with the general relations “RelatedTo” and “IsA”, which hold a large proportion in ConceptNet. (3) Filter triples in which one of the concepts has more than four words or the edit distance between the two concepts is less than four. After filtering, we obtain 606,564 triples.

Each training sample is generated by three steps: align, mask and select, which we call as AMS method. Each sample in the dataset consists of a question and several candidate answers, which has the same form as the CommonsenseQA dataset. An example of constructing one training sample by masking concept is shown in Table 2.

Firstly, we align each triple (concept, relation, concept) from ConceptNet to the English Wikipedia dataset to extract the sentences with their concepts labeled. Secondly, we mask the concept/concept in one sentence with a special token [QW] and treat this sentence as a question, where QW is a replacement word of the question words “what”, “where”, etc. And the masked concept/concept is the correct answer for this question. Thirdly, for generating the distractors, [Sun et al.2019] proposed a method to form distractors by randomly picking words or phrases in ConceptNet. In this paper, in order to generate more confusing distractors than the random selection approach, we request those distractors and the correct answer share the same concept or concept and the relation. That is to say, we search (, relation, concept) and (concept, relation, ) in ConceptNet to select the distractors instead of random selection, where is a wildcard character that can match any word or phrase. For each question, we reserve four distractors and one correct answer. If there are less than four matched distractors, we discard this question instead of complementing it with random selection. If there are more than four distractors, we randomly select four distractors from them. After applying the AMS method, we create 16,324,846 multi-choice question answering samples.

Dataset Train Develop Test Candidates
CommonsenseQA 9741 1221 1140 5
WSC Task 1322 564 273 2
Table 3: The statistics of CommonsenseQA and Winograd Schema Challenge datasets.
Model Accuracy
BERT 53.0
BERT 56.7
CoS-E [Rajani et al.2019] 58.2
BERT_CS 56.2
BERT_CS 62.2
Table 4: Accuracy (%) of different models on the CommonsenseQA test set.

5 Pre-training BERT_CS

We investigate a multi-choice question-answering task for pre-training the English BERT base and BERT large models released by Google 222https://github.com/google-research/bert on our constructed dataset. The resulting models are denoted BERT_CS and BERT_CS, respectively. We then investigate the performance of fine-tuning the BERT_CS models on several NLP tasks, including commonsense-related tasks and common NLP tasks, presented in Section 6.

To reduce the large cost of training BERT_CS models from scratch, we initialize the BERT_CS models (for both BERT and BERT

models) with the parameter weights released by Google. We concatenate the question with each answer to construct a standard input sequence for BERT_CS (i.e., “[CLS] the largest [QW] by … ? [SEP] city [SEP]”, where [CLS] and [SEP] are two special tokens), and the hidden representations over the [CLS] token are run through a softmax layer to create the predictions.

The objective function is defined as follows:

(1)
(2)

where is the correct answer, are the parameters in the softmax layer, N is the total number of all candidates, and is the vector representation of the special token [CLS]. We pre-train BERT_CS models with the batch size 160, the initial learning rate

and the max sequence length 128 for 1 epoch. The pre-training is conducted on 16 NVIDIA V100 GPU cards with 32G memory for about 3 days for the BERT_CS

model and 1 day for the BERT_CS model.

Model WSC non-assoc. assoc. unswitched switched consist. WNLI
LM ensemble 63.7 60.6 83.8 63.4 53.4 44.3 -
Knowledge Hunter 57.1 58.3 50.0 58.8 58.8 90.1 -
BERT + MTP 70.3 70.8 67.6 73.3 70.1 59.5 70.5
[Ruan et al.2019] 71.1 69.5 81.1 74.1 72.5 66.4 -
[Kocijan et al.2019] 72.2 71.6 75.7 74.8 72.5 61.1 71.9
BERT + MCQA 71.4 69.9 81.1 71.8 64.9 82.4 78.5
BERT_CS + MCQA 75.5 73.7 86.5 74.8 73.3 86.3 83.6
Table 5: Accuracy (%) of different models on the Winograd Schema Challenge dataset together with its subsets and the WNLI test set. MTP denotes masked token prediction, which is employed in [Kocijan et al.2019]. MCQA denotes multi-choice question-answering format, which is employed in this paper.
Model MNLI-(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE
BERT 84.6/83.4 71.2 90.5 93.5 52.1 85.8 88.9 66.4
BERT_triple 83.8/82.6 70.5 89.9 92.9 49.6 85.3 88.7 65.1
BERT_CS 84.7/83.9 72.1 91.2 93.6 54.3 86.4 85.9 69.5
BERT 86.7/85.9 72.1 92.7 94.9 60.5 86.5 89.3 70.1
BERT_CS 86.7/85.8 72.1 92.6 94.1 60.7 86.6 89.0 70.7
Table 6: The accuracy (%) of different models on the GLUE test sets. We report Matthews corr. on CoLA, Spearman corr. on STS-B, accuracy on MNLI, QNLI, SST-2 and RTE, F1-score on QQP and MRPC, which is the same as [Devlin et al.2018].

6 Experiments

In this section, we investigate the performance of fine-tuning the BERT_CS models on several NLP tasks. Note that when fine tuning on multi-choice QA tasks, e.g., CommonsenseQA and Winograd Schema Challenge (see section 5.3), we fine-tune all parameters in BERT_CS, including the last softmax layer from the token [CLS]; whereas, for other tasks, we randomly initialize the classifier layer and train it from scratch.

Additionally, as described in [Devlin et al.2018], fine-tuning on BERT sometimes is observed to be unstable on small datasets, so we run experiments with 5 different random seeds and select the best model based on the development set for all of the fine-tuning experiments in this section.

6.1 CommonsenseQA

In this subsection, we conduct experiments on a commonsense-related multi-choice question answering benchmark, the CommonsenseQA dataset [Talmor et al.2018]. The CommonsenseQA dataset consists of 12,247 questions with one correct answer and four distractor answers. This dataset consists of two splits – the question token split and the random split. Our experiments are conducted on the more challenging random split, which is the main evaluation split according to [Talmor et al.2018]. The statistics of the CommonsenseQA dataset are shown in Table 3.

Same as the pre-training stage, the input data for fine-tuning the BERT_CS models is formed by concatenating each question-answer pair as a sequence. The hidden representations over the [CLS] token are run through a softmax layer to create the predictions. The objective function is the same as Equations 1 and 2. We fine-tune the BERT_CS models on CommonsenseQA for 2 epochs with a learning rate of 1e-5 and a batch size of 16.

Table 4 shows the accuracies on the CommonsenseQA test set from the baseline BERT models released by Google, the previous state-of-the-art model CoS-E [Rajani et al.2019], and our BERT_CS models. Note that CoS-E model requires a large amount of human effort to collect the Common Sense Explanations (CoS-E) dataset. In comparison, we construct our multi-choice question-answering dataset automatically. The BERT_CS models significantly outperform the baseline BERT model counterparts. BERT_CS achieves a 5.5% absolute improvement on the CommonsenseQA test set over the baseline BERT model and a 4% absolute improvement over the previous SOTA CoS-E model.

6.2 Winograd Schema Challenge

The Winograd Schema Challenge (WSC) [Levesque, Davis, and Morgenstern2012] is introduced for testing AI agents for commonsense knowledge. The WSC consists of 273 instances of the pronoun disambiguation problem (PDP). For example, for sentence “The delivery truck zoomed by the school bus because it was going so fast.” and a corresponding question “What does the word it refers to?”, the machine is expected to answer “delivery truck” instead of “school bus”. In this task, we follow [Kocijan et al.2019] and employ the WSCR dataset [Rahman and Ng2012] as the extra training data. The WSCR dataset is split into a training set of 1322 examples and a test set of 564 examples. We use these data for fine-tuning and validating BERT_CS models, respectively, and test the fine-tuned BERT_CS models on the WSC dataset.

We transform the pronoun disambiguation problem into a multi-choice question answering problem. We mask the pronoun word with a special token [QW] to construct a question, and put the two candidate paragraphs as candidate answers. The remaining procedures are the same as QA tasks. We use the same loss function as

[Kocijan et al.2019], that is, if c is correct and c is not, the loss is

(3)

where follows Equation 2 with , and are two hyper-parameters. Similar to [Kocijan et al.2019], we search and by comparing the accuracy on the WSCR test set (i.e., the development set for the WSC data set). We set the batch size 16 and the learning rate . We evaluate our models on the WSC dataset, as well as the various partitions of the WSC dataset, as described in [Trichelair et al.2018]. We also evaluate the fine-tuned BERT_CS model (without using the WNLI training data for further fine-tuning) on the WNLI test set, one of the GLUE tasks. We first transform the examples in WNLI from the premise-hypothesis format into the pronoun disambiguation problem format and then transform it into the multi-choice QA format [Kocijan et al.2019].

The results on the WSC dataset and its various partitions and the WNLI test set are shown in Table 5. Note that the results for [Ruan et al.2019] are fine-tuned on the whole WSCR dataset, including the training and test sets. Results for LM ensemble [Trinh and Le2018] and Knowledge Hunter[Emami et al.2018] are taken from [Trichelair et al.2018]. Results for “BERT + MTP” is taken from [Kocijan et al.2019] as the baseline of applying BERT to the WSC task.

No. Model Source Data Tasks Accuracy
1 BERT - - 58.2
2 BERT_triple ConceptNet MCQA 59.1
3 BERT_CS_random Wikipedia and ConceptNet MCQA 59.4
4 BERT_CS_MLM Wikipedia and ConceptNet MCQA+MLM 59.9
5 BERT_MLM Wikipedia and ConceptNet MLM 58.8
6 BERT_CS Wikipedia and ConceptNet MCQA 60.8
Table 7: Accuracy (%) of different models on CommonsenseQA development set. The source data and tasks are employed to pre-train BERT_CS. MCQA represents for multi-choice question answering task and MLM represents for masked language modeling task.

As can be seen from Table 5, the “BERT + MCQA” achieves better performance than “BERT + MTP” on four of the seven evaluation criteria and achieves significant improvement on the assoc. and consist. partitions, which demonstrates that MCQA is a better pre-processing method than MTP for the WSC task. Also, the “BERT_CS + MCQA” achieves the best performance on all of the evaluation criteria but consist.333Knowledge Hunter achieves better performance on consist. by a rule-based method., and achieves a 3.3% absolute improvement on the WSC dataset over the previous SOTA results from [Kocijan et al.2019].

6.3 Glue

The General Language Understanding Evaluation (GLUE) benchmark [Wang et al.2018] is a collection of diverse natural language understanding tasks, including MNLI, QQP, QNLI, SST-2, CoLA, STS-B, MRPC, of which CoLA and SST-2 are single-sentence tasks, MRPC, STS-B and QQP are similarity and paraphrase tasks, and MNLI, QNLI, RTE and WNLI are natural language inference tasks. To investigate whether our multi-choice QA based pre-training approach degenerates the performance on common sentence classification tasks, we evaluate the BERT_CS and BERT_CS models on 8 GLUE datasets and compare the performances with those from the baseline BERT models.

Following [Devlin et al.2018], we use the batch size 32 and fine-tune for 3 epochs for all GLUE tasks, and select the fine-tuning learning rate (among 1e-5, 2e-5, and 3e-5) based on the performance on the development set. Results are presented in Table 6. We observe that the BERT_CS model achieves comparable performance with the BERT model and the BERT_CS model achieves slightly better performance than the BERT model. We hypothesize that the commonsense knowledge may not be required for GLUE tasks. On the other hand, these results demonstrate that our proposed multi-choice QA pre-training task does not degrade the sentence representation capabilities of BERT models.

Question Candidates BERT BERT_CS
1) Dan had to stop Bill from toying with the injured bird. [He] is very compassionate.
A) Dan
B) Bill
B A
2) Dan had to stop Bill from toying with the injured bird. [He] is very cruel.
A) Dan
B) Bill
B B
3) The trophy doesn’t fit into the brown suitcase because [it] is too large.
A) the trophy
B) the suitcase
B B
4) The trophy doesn’t fit into the brown suitcase because [it] is too small.
A) the trophy
B) the suitcase
A A
Table 8: Several cases from the Winograd Schema Challenge dataset. The pronouns in questions are in square brackets. The correct candidates and correct decisions by models are in boldface.

7 Analysis

7.1 Pre-training Strategy

In this subsection, we conduct several comparison experiments using different data and different pre-training tasks on the BERT model. For simplicity, we discard the subscript in this subsection.

The first set of experiments is to compare the efficacy of our data creation approach versus the data creation approach in [Sun et al.2019]. First, same as [Sun et al.2019], we collect 606,564 triples from ConceptNet, and construct 1,213,128 questions, each with a correct answer and four distractors. This dataset is denoted the TRIPLES dataset. We pre-train BERT models on the TRIPLES dataset with the same hyper-parameters as the BERT_CS models and the resulting model is denoted BERT_triple. We also create several model counterparts based on our constructed dataset:

  • Distractors are formed by randomly picking concept/concept in ConceptNet instead of those sharing the same concept/concept and the relation with the correct answers. We denote the resulting model from this dataset BERT_CS_random.

  • Instead of pre-training BERT with a multi-choice QA task that chooses the correct answer from several candidate answers, we mask concept and concept and pre-train BERT with a masked language model (MLM) task. We denote the resulting model from this pre-training task BERT_MLM.

  • We randomly mask 15% WordPiece tokens [Wu et al.2016] of the question as in [Devlin et al.2018] and then conduct both multi-choice QA task and MLM task simultaneously. The resulting model is denoted BERT_CS_MLM.

All these BERT models are fine-tuned on the CommonsenseQA dataset with the same hyper-parameters as described in Section 6.1 and the results are shown in Table 7. We observe the following from Table 7.

Comparing model 1 and model 2, we find that pre-training on ConceptNet benefits the CommonsenseQA task even with the triples as input instead of sentences. Further comparing model 2 and model 6, we find that constructing sentences as input for pre-training BERT performs better on the CommonsenseQA task than using triples for pre-training BERT. We also conduct more detailed comparisons between fine-tuning model 1 and model 2 on GLUE tasks. The results are shown in Table 6. BERT_triple yields much worse results than BERT and BERT_CS, which demonstrates that pre-training directly on triples may hurt the sentence representation capabilities of BERT.

Comparing model 3 and model 6, we find that pre-training BERT benefits from a more difficult dataset. In our selection method, all candidate answers share the same (concept, relation) or (relation, concept), that is, these candidates have close meanings. These more confusing candidates force BERT_CS to distinguish synonym meanings, resulting in a more powerful BERT_CS model.

Comparing model 5 and model 6, we find that the multi-choice QA task works better than the masked LM task as the pre-training task for the target multi-choice QA task. We argue that, for the masked LM task, BERT_CS is required to predict each masked wordpieces (in concepts) independently and for the multi-choice QA task, BERT is required to model the whole candidate phrases. In this way, BERT is able to model the whole concepts instead of paying much attention to the single wordpieces in the sentences. Comparing model 4 and model 6, we observe that adding the masked LM task may hurt the performance of BERT_CS. This is probably because the masked words in questions may have a negative influence on the multi-choice QA task. Finally, our proposed model BERT_CS achieves the best performance on the CommonsenseQA development set among these model counterparts.

Figure 1: The model performance curve on CommonsenseQA development set along with the pre-training steps.

7.2 Performance Curve

In this subsection, we plot the performance curve on CommonsenseQA development set from BERT_CS over the pre-training steps. For every 10,000 training steps, we save the model as the initial model for fine-tuning. For every of these models, we run experiments for 10 times repeatedly with random restarts, that is, we use the same pre-trained checkpoint but perform different fine-tuning data shuffling. Due to the unstability of fine-tuning BERT [Devlin et al.2018], we remove the results that are significantly lower than the mean. In our experiments, we remove the accuracy lower than 0.57 for BERT_CS and 0.60 for BERT_CS

. We plot the mean and standard deviation values in Figure 

1. We observe that the performance of BERT_CS converges around 50,000 training steps and BERT_CS converges around the end of the pre-training stage or may not have converged, which demonstrates that the BERT_CS is more powerful at incorporating commonsense knowledge. We also compare with pre-training BERT_CS models for 2 epochs. However, our model produces worse performance probably due to over-fitting. Pre-training on a larger corpus (with more QA samples) may benefit the BERT_CS models and we leave this to the future work.

Model CLOSE FAR
BERT 82.4 60.6
BERT_CS 82.4 68.6
Table 9: The accuracy (%) of different models on two partitions of WSC dataset.

7.3 Error Analysis

Table 8 shows several cases from the Winograd Schema Challenge dataset. Questions 1 and 2 only differ in the words “compassionate” and “cruel”. Our model BERT_CS chooses correct answers for both questions while BERT chooses the same choice “Bill” for both questions. We speculate that BERT tends to choosing the closer candidates. We split WSC test set into two parts CLOSE and FAR according as the correct candidate is closer or farther to the pronoun word in the sentence than another candidate. As shown in Table 9, our model BERT_CS achieves the same performance on CLOSE set and better performance on FAR set than BERT. That’s to say, BERT_CS is more robust to the position of the words and focuses more on the semantic of the sentence.

Questions 3 and 4 only differ in the words “large” and “small”. However, neither BERT_CS nor BERT chooses the correct answers. We hypothesize that since “suitcase is large” and “trophy is small” are probably quite frequent for language models, both BERT and BERT_CS models make mistakes. In future work, we will investigate other approaches for overcoming the sensitivity of language models and improving commonsense reasoning.

8 Conclusion

In this paper, we develop a pre-training approach for incorporating commonsense knowledge into language representation models such as BERT. We construct a commonsense-related multi-choice question answering dataset for pre-training BERT. The dataset is created automatically by our proposed “align, mask, and select” (AMS) method. Experimental results demonstrate that pre-training models using the proposed approach followed by fine-tuning achieves significant improvements on various commonsense-related tasks, such as CommonsenseQA and Winograd Schema Challenge, while maintaining comparable performance on other NLP tasks, such as sentence classification and natural language inference (NLI) tasks, compared to the original BERT models. In future work, we will incorporate the relationship information between two concepts into language representation models. We will also explore other structured knowledge graphs, such as Freebase, to incorporate entity information into language representation models. We also plan to incorporate commonsense knowledge information into other language representation models such as XLNet [Yang et al.2019].

Acknowledgments

The authors would like to thank Lingling Jin, Pengfei Fan, Xiaowei Lu for supporting 16 NVIDIA V100 GPU cards.

References