Log In Sign Up

Incorporating Relation Knowledge into Commonsense Reading Comprehension with Multi-task Learning

by   Jiangnan Xia, et al.

This paper focuses on how to take advantage of external relational knowledge to improve machine reading comprehension (MRC) with multi-task learning. Most of the traditional methods in MRC assume that the knowledge used to get the correct answer generally exists in the given documents. However, in real-world task, part of knowledge may not be mentioned and machines should be equipped with the ability to leverage external knowledge. In this paper, we integrate relational knowledge into MRC model for commonsense reasoning. Specifically, based on a pre-trained language model (LM). We design two auxiliary relation-aware tasks to predict if there exists any commonsense relation and what is the relation type between two words, in order to better model the interactions between document and candidate answer option. We conduct experiments on two multi-choice benchmark datasets: the SemEval-2018 Task 11 and the Cloze Story Test. The experimental results demonstrate the effectiveness of the proposed method, which achieves superior performance compared with the comparable baselines on both datasets.


page 1

page 2

page 3

page 4


Multi-choice Dialogue-Based Reading Comprehension with Knowledge and Key Turns

Multi-choice machine reading comprehension (MRC) requires models to choo...

Multi-Task Learning for Machine Reading Comprehension

We propose a multi-task learning framework to jointly train a Machine Re...

MMM: Multi-stage Multi-task Learning for Multi-choice Reading Comprehension

Machine Reading Comprehension (MRC) for question answering (QA), which a...

Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge

We introduce a neural reading comprehension model that integrates extern...

Commonsense Evidence Generation and Injection in Reading Comprehension

Human tackle reading comprehension not only based on the given context i...

Multi-Perspective Fusion Network for Commonsense Reading Comprehension

Commonsense Reading Comprehension (CRC) is a significantly challenging t...

Analyzing Multi-Task Learning for Abstractive Text Summarization

Despite the recent success of multi-task learning and pre-finetuning for...

1. Introduction

Machine reading comprehension (MRC) enables machines with the ability to answer questions with given corresponding documents. Recent years have witnessed the bloom of various well-designed MRC models (Wang et al., 2017; Yu et al., 2018), which achieve promising performance when provided with adequate manually labeled instances (Rajpurkar et al., 2016). However, most of these models generally assume that the knowledge required to answer the questions has already existed in the given documents, which does not hold at some time. How to leverage the commonsense knowledge for better reading comprehension remains largely unexplored.

Recently, some preliminary studies have begun to incorporate certain side information (e.g., triplets from external knowledge base) into the model design of various NLP tasks, such as question answering (Bauer et al., 2018) and conversation generation (Zhou et al., 2018). Generally, there are two lines of this work. The first line focuses on designing task-specific model structures (Yang and Mitchell, 2017; Bauer et al., 2018), which exploit the retrieved concepts from external knowledge base for enhancing the representation. Recently, the other line has studied to pre-train a language model over large corpus to learn the inherent word-level knowledge in an unsupervised way (Radford et al., 2018; Devlin et al., 2018), which achieves very promising performance.

The first line of work is usually carefully designed for the target task, which is not widely applicable. The second line can only learn the co-occurrence of words or entities in the context, while it may not be that robust for some complex scenarios such as reasoning task. For example, to answer the question ”Was the light bulb still hot?” when the document is given as ”I went into my bedroom and flipped the light switch. Oh, I see that the ceiling lamp is not turning on…”, machines should have the commonsense knowledge that ”the bulb is not hot when turning off” to correctly answer the question. The explicit relation information can act as a bridge to connect the scattered context, which may not be easily captured. Therefore, the aim of this paper is to take the advantage of both the pre-trained language model and the explicit relation knowledge from the external knowledge base for commonsense reading comprehension.

Specifically, we first extract the triplets from the popular ConceptNet knowledge base (Speer et al., 2017) and design two auxiliary relation-aware tasks to predict if there exists any relation and what is the relation type between two concepts. To make the MRC model be aware of the commonsense relations between concepts, we propose a multi-task learning framework to jointly learn the prediction of the target MRC task and the two relation-aware tasks in a unified model. We conduct experiments on two multi-choice commonsense reading comprehension datasets: Story Cloze Test (Mostafazadeh et al., 2017) and SemEval-2018 Task 11 (Ostermann et al., 2018). Experimental results demonstrate the effectiveness of our method, which achieves superior performance compared with the comparable baselines on both datasets.

2. Related Work

Previous studies mainly focused on developing effective model structures to improve the reading ability of the systems (Wang et al., 2017; Yu et al., 2018), which have achieved promising performance. However, the success on these tasks is not adequate considering the model’s ability of commonsense reasoning. Recently, a number of efforts have been invested in developing datasets for commonsense reading comprehension such as Story Cloze Test and SemEval-2018 Task 11 (Ostermann et al., 2018; Mostafazadeh et al., 2017)

. In these datasets, part of required knowledge may not be mentioned in the document and machines should be equipped with commonsense knowledge to make correct prediction. There exists increasing interest in incorporating commonsense knowledge into commonsense reading comprehension tasks. Most of previous studies focused on developing special model structures to introduce external knowledge into neural network models

(Yang and Mitchell, 2017; Bauer et al., 2018), which have achieved promising results. For example, Yang and Mitchell (Yang and Mitchell, 2017)

use concepts from WordNet and weighted average vectors of the retrieved concepts to calculate a new LSTM state. These methods relied on task-specific model structures which are difficult to adapt to other tasks. Pre-trained language model such as BERT and GPT

(Radford et al., 2018; Devlin et al., 2018) is also used as a kind of commonsense knowledge source. However, the LM method mainly captures the co-occurrence of words and phrases and cannot address some more complex problems which may require the reasoning ability.

Unlike previous work, we incorporate external knowledge by jointly training MRC model with two auxiliary tasks which are relevant to commonsense knowledge. The model can learn to fill in the knowledge gap without changing the original model structure.

3. Knowledge-enriched MRC model

3.1. Task Definition

Here we first formally define the task of multi-choice commonsense reading comprehension. Given a reference document (a question if possible), a set of answer options and an external knowledge base

, the goal is to choose the correct answer option according to their probabilities

given by MRC model, where is the total number of options.

In this paper, we use ConceptNet knowledge base (Speer et al., 2017), which is a large semantic network of commonsense knowledge with a total of about 630k facts. Each fact is represented as a triplet , where and can be a word or phrase and is a relation type. An example is:

3.2. Overall Framework

The proposed method can be roughly divided into three parts: a pre-trained LM encoder, a task-specific prediction layer for multi-choice MRC and two relation-aware auxiliary tasks. The overall framework is shown in Figure 1.

The pre-trained LM encoder acts as the foundation of the model, which is used to capture the relationship between question, document and answer options. Here we utilize the bidirectional Transformer network BERT

(Devlin et al., 2018) as the pre-trained encoder for its superior performance in a range of natural language understanding tasks. Specially, we concatenate the given document, question and each option with special delimiters as one segment, which is then fed into BERT encoder. The input sequence is packed as “[CLS]D(Q)[SEP]O[SEP]” 111We directly concatenate the question after the document to form the BERT input if a question is given., where [CLS] and [SEP] are the special delimiters. After BERT encoder, we obtain the contextualized word representation for the -th input token from the final layer of BERT. is the dimension of hidden state.

Next, on top of BERT encoder, we add a task-specific output layer and view the multi-choice MRC as a multi-class classification task. Specifically, we apply a linear head layer plus a softmax layer on the final contextualized word representation of [CLS] token

. We minimize the Negative Log Likelihood (NLL) loss with respect to the correct class label, as:


where is the final hidden state of the correct option , is the number of options and is a learnable vector.

Figure 1. Reading comprehension model structure with two relation-aware tasks

Finally, to make the model be aware of certain implicit commonsense relations between concepts, we further introduce two auxiliary relation-aware prediction tasks for joint learning. Since it may be difficult to directly predict the actual relation type between two concepts without adequate training data. We split the relation prediction problem into two related tasks: i.e., relation-existence task and relation-type task. In relation-existence, we basically add an auxiliary task to predict if there exists any relation between two concepts, which is a relatively easy task. Then, we take one step further to decide what is the right type of the relation in relation-existence. The basic premise is that by guiding the MRC model training with extra relation information the proposed model can be equipped with the ability to capture some underlying commonsense relationships. The two auxiliary relation-aware tasks are jointly trained with the multi-choice MRC answer prediction for updating parameters. In the following, we will describe the two auxiliary tasks in detail.

3.3. Incorporating Relation Knowledge

Task 1 is the relation-existence task. Following (Devlin et al., 2018), we first convert the concept to a set of BPE tokens tokens A and tokens B, with beginning index and in the input sequence respectively. The probability of whether there is a relation in each pair (tokens A, tokens B) is computed as:


where is a trainable matrix.

We define the pair (tokens A, tokens B) that has relation in ConceptNet as a positive example and others as negative examples. We down-sample the negative examples and keep ratio of positive vs negative is . We define the relation-existence loss as the average binary cross-entropy (BCE) loss:


where , are the number of sampled concepts in sentence A and sentence B respectively, is the label of whether there is a relation between concepts.

Task 2 is the relation-type task. We predict the relation type between tokens A and tokens B. The relation-type probabilities are computed as:


where and are new trainable matrices, is the number of selected relation types 222We don’t use the entire set of relation types because not all the types are defined clearly such as ”RelatedTo”. We choose 34 kind of relations in ConceptNet 5 except ”RelatedTo”, ”ExternalURL” and ”dbpedia”..

The relation-type loss is computed as:


We define as the label whether there is a relation from sentence A to sentence B. is the indexes of ground-truth relation in ConceptNet, is the number of relations among the tokens in two sentences.

According to the design, the three tasks share the same BERT architecture with only different linear head layers. Therefore, we propose to train them together as multi-task learning. The joint objective function is formulated as follows:


where and are two hyper-parameters that control the weight of the tasks, is number of candidates.

4. Experiments

4.1. Dataset

We conduct experiments on two commonsense reading comprehension tasks: SemEval-2018 shared task11 (Ostermann et al., 2018) and Story Cloze Test (Mostafazadeh et al., 2017). The statistics of the datasets are shown in Table 1 333Story Cloze Test consists 98,161 unlabeled examples 1,871 labeled examples. Following (Radford et al., 2018), we divide the labeled examples to a new training set and development set with 1,433 and 347 examples respectively..

Dataset Train Dev Test
SemEval-2018 Task11 9,731 1,411 2,797
Story Cloze Test 1,433 347 1,871
Table 1. Number of examples in datasets.
Model ACC.(Test) %
NN-T (Merkhofer et al., 2018) 80.23
HMA (Chen et al., 2018) 80.94
TriAN (Wang, 2018) 81.94
TriAN + + (Zhong et al., 2018) 81.80
TriAN + relation-aware tasks 82.84
BERT(base) 87.53
Bert(base) + relation-aware tasks 88.23
Table 2. Performance on SemEval-2018 Task 11.
Model ACC.(Test) %
Style+RNNLM (Schwartz et al., 2017) 75.2
HCM (Chaturvedi et al., 2017) 77.6
GPT (Radford et al., 2018) 86.5
BERT(base) 86.7
BERT(base) + relation-aware tasks 87.4
Table 3. Performance on Cloze Story Test.
Model ACC.(Dev)%
Basic Model 87.51 -
+ 88.02 +0.51
+ 87.87 +0.36
+ + 88.25 +0.74
+ + ”No Relation” 87.78 +0.41
Table 4. Performance with different tasks
Document Question Options Commonsense Facts
[SemEval] … We organized it from the fruit to the dairy in an organized order. We put the shopping list on our fridge so that we wouldn’t forget it the next day when we went to go buy the food. What did they write the list on? (A) Paper. (B) Fridge. Correct: A [paper, RelatedTo, write]
[Cloze Story Test] My kitchen had too much trash in it. I cleaned it up and put it into bags. I took the bags outside of my house. I then carried the bags down my driveway to the trash can. - (A) I missed the trash in the kitchen. (B) I was glad to get rid of the trash. Correct: B [get rid of, RelatedTo, clean up]
[SemEval] … I settled on Earl Gray, which is a black tea flavored with bergamot orange. I filled the kettle with water and placed it on the stove, turning on the burner… Why did they use a kettle? (A) To drink from. (B) To boil water. Correct: B [kettle, UsedFor, boil water]; [kettle, RelatedTo, drinking]
Table 5. Examples that require commonsense relations between concepts

4.2. Implementation Details

We use the uncased BERT(base) (Devlin et al., 2018)

as pre-trained language model. We set the batch size to 24, learning rate to 2e-5. The maximum sequence length is 384 for SemEval-2018 Task 11 and 512 for Story Cloze Test. We fine-tune for 3 epochs on each dataset. The task-specific hyper-parameters

and are set to 0.5 and the ratio is set to 4.0.

4.3. Experimental Results and Analysis

Model Comparison The performances of our model on two datasets are shown in Table 2 and Table 3. We compare our single model with other existing systems (single model). We also adopt relation-aware tasks on the co-attention layer of the TriAN model (Wang, 2018), compared with From the result, we can observe that: (1) Our method achieves better performance on both datasets compared with previous methods and Bert(base) model. (2) By adopt relation-aware tasks on the attention layer of the TriAN model (Wang, 2018) on SemEval, the model performance can also be improved. The results show that the relation-aware tasks are important for the an incorporate commonsense into the model and help to better align different sentences due to knowledge gap.
Effectiveness of Relation-aware Tasks To get better insight into our model, we analyze the benefit brought by using relation-aware tasks on the development set of SemEval-2018 Task 11. The performance of jointly training the basic answer prediction model with different tasks is shown in table 4. From the result we can see that by incorporating the auxiliary relation-existence task () or relation-type task () in the joint learning framework, the performance can always improved. The result shows the advantage of incorporating auxiliary tasks. Besides, the performance gain by adding relation-existence task is larger, which shows relation-existence task can incorporate more knowledge into the model. We also attempt to merge two relation-aware tasks into one task by simply taking ”No-Relation” as a special type of relation. We find that the model performance is just slightly higher than using relation-type task and lower than using two tasks separately. The result is due to the number of ”No-Relation” labels is much more than other relation types, which makes the task hard to train during the fine-tuning stage.
Analysis Table 5 shows the examples that are incorrectly predicted by BERT(base), while correctly solved by incorporating relation-aware tasks. The first two examples are benefit from relation-existence knowledge. From the first example we can see that the retrieved relation between concepts from ConceptNet provide useful evidence to connect the question to the correct option (A). The second example is from Cloze Story Test dataset, we can see that the retrieved relation is also helpful in making correct prediction. The third example from SemEval requires relation-type knowledge. but the relation type in option (A), is more relevant to the question, which shows that relation type knowledge can be used as side information to do the prediction.

5. Conclusion

In this paper, we aim to enrich the neural model with external knowledge to improve commonsense reading comprehension. We use two auxiliary relation-aware tasks to incorporate ConceptNet knowledge into the MRC model based on a pre-trained language model. Experimental results demonstrate the effectiveness of our method for commonsense reading comprehension which achieves improvements compared with the pre-trained language model baselines on both datasets.