Exploring Machine Reading Comprehension with Explicit Knowledge

09/10/2018 ∙ by Chao Wang, et al. ∙ 6

To apply general knowledge to machine reading comprehension (MRC), we propose an innovative MRC approach, which consists of a WordNet-based data enrichment method and an MRC model named as Knowledge Aided Reader (KAR). The data enrichment method uses the semantic relations of WordNet to extract semantic level inter-word connections from each passage-question pair in the MRC dataset, and allows us to control the amount of the extraction results by setting a hyper-parameter. KAR uses the extraction results of the data enrichment method as explicit knowledge to assist the prediction of answer spans. According to the experimental results, the single model of KAR achieves an Exact Match (EM) of 72.4 and an F1 Score of 81.1 on the development set of SQuAD, and more importantly, by applying different settings in the data enrichment method to change the amount of the extraction results, there is a 2% variation in the resulting performance of KAR, which implies that the explicit knowledge provided by the data enrichment method plays an effective role in the training of KAR.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine reading comprehension (MRC) is a challenging task in artificial intelligence. As the name suggests, MRC requires a machine to read a passage and answer a relevant question. Since the answer to each question is supposed to stem from the corresponding passage, a common solution for MRC is to train an MRC model that predicts for each given passage-question pair an answer span (i.e. the answer start position and the answer end position) in the passage. To encourage the exploration of MRC models, many MRC datasets have been published, such as SQuAD

Rajpurkar et al. (2016) and MS-MARCO Nguyen et al. (2016). In this paper, we focus on SQuAD.
A lot of MRC models have been proposed for the challenge of SQuAD. Although the top models on the leader-board have achieved almost the same performance as human beings, we are firmly convinced that the way human beings conduct reading comprehension is still worth studying for us to make further innovations in MRC. Therefore, let us briefly review human reading comprehension before diving into MRC. Given a passage and a relevant question, we may wish to match the passage words with the question words, so that we could find the answer around the matched passage words. However, due to the complexity and diversity of natural languages, this naive method is often useless in practice. Instead, we must rely on our reasoning skills to deal with reading comprehension, which makes it necessary for us to obtain enough inter-word connections from each given passage-question pair. Inter-word connections have a wide coverage, they exist not only on the syntactic level (e.g. dependency), but also on the semantic level (e.g. synonymy). The examples provided in Table 1 demonstrate how human reading comprehension could benefit from semantic level inter-word connections.

Passage Question Answer
Teachers may use a lesson plan to facilitate student learning, providing a course of study which is called the curriculum. What can a teacher use to help students learn? lesson plan
Manufacturing accounts for a significant but declining share of employment, although the city’s garment industry is showing a resurgence in Brooklyn. In what borough is the garment business prominent? Brooklyn
Table 1: Two examples about the effects of semantic level inter-word connections on human reading comprehension. In the first example, we can find the answer because we know “facilitate” and “help” are synonyms. Similarly, in the second example, we can find the answer because we know “borough” is a hypernym of “Brooklyn”, or “Brooklyn” is a hyponym of “borough”. Both of the two examples are selected from SQuAD.

By roughly analyzing the MRC models proposed for SQuAD, we find that leveraging neural attention mechanisms Bahdanau et al. (2014)

based on recurrent neural networks, such as LSTM

Hochreiter and Schmidhuber (1997) and GRU Cho et al. (2014), is currently the dominant approach. Since neural network models are usually deemed as simulations of human brains, we may as well interpret the training of an MRC model as a process of teaching knowledge to it, where the knowledge comes from the training samples, and thus can be absorbed into the model parameters through gradient descent. However, neural network models are also known as black boxes, that is to say, by just updating model parameters according to training samples, we cannot understand the meaning of the knowledge taught to an MRC model, neither can we control the amount of the knowledge taught to it, therefore we name such knowledge as implicit knowledge.
So far, human beings have accumulated a tremendous amount of general knowledge. These general knowledge, despite being an essential component of human intelligence, has never been effectively applied to MRC, which we believe is the biggest gap between MRC and human reading comprehension. We intend to bridge this gap with the help of knowledge bases, which store general knowledge in structured forms. In recent years, many knowledge bases have been established, such as WordNet Fellbaum (1998) and Freebase Bollacker et al. (2008), and they have made it convenient for machines to access and process the general knowledge of human beings. Therefore, it is both meaningful and feasible to integrate the general knowledge in a knowledge base with the training of an MRC model. However, rather than leveraging knowledge base embeddings Bordes et al. (2011, 2013); Yang et al. (2014); Yang and Mitchell (2017), we would prefer our MRC model to use general knowledge in an understandable and controllable way, and we name the general knowledge used in this way as explicit knowledge.
In this paper, by using WordNet as our knowledge base, we propose an innovative MRC approach, which consists of two components: a WordNet-based data enrichment method, which uses WordNet to extract semantic level inter-word connections from each passage-question pair in the MRC dataset, and an MRC model named as Knowledge Aided Reader (KAR), which uses the extraction results of the data enrichment method as explicit knowledge to assist the prediction of answer spans. There are two important features in our MRC approach: on the one hand, the data enrichment method allows us to control the amount of the extraction results; on the other hand, this amount in turn affects the performance of KAR. According to the experimental results, by applying different settings in the data enrichment method to change the amount of the extraction results, there is a variation in the resulting performance of KAR, which implies that the explicit knowledge provided by the data enrichment method plays an effective role in the training of KAR.

2 Task Description

The MRC task considered in this paper is defined as the following prediction problem: given a passage , which is a sequence of words, and a relevant question , which is a sequence of words, predict an answer start position and an answer end position , where , so that the fragment in is the answer to .

3 WordNet-based Data Enrichment

To provide our MRC model with explicit knowledge, we would like to enrich the content of the MRC dataset by extracting semantic level inter-word connections from each passage-question pair in it, therefore we propose a WordNet-based data enrichment method.

3.1 What and how to extract from each passage-question pair

WordNet is a lexical database for English. Words in WordNet are organized into synsets, which in turn are related to each other through semantic relations, such as “hypernym” and “hyponym”. In our data enrichment method, we use the semantic relations of WordNet to extract semantic level inter-word connections from each passage-question pair in the MRC dataset. Considering the requirements of our MRC model, we need to represent the extraction results as positional information. Specifically, for each word in a passage-question pair, we need to obtain a set , which contains the positions of the passage words that is semantically connected to. Besides, when itself is a passage word, we also need to ensure that its position is excluded from .
The key problem to obtain the above extraction results is to determine if a subject word is semantically connected to an object word. To solve this problem, we introduce two concepts: the directly-involved synsets and indirectly-involved synsets of a word. Given a word , its directly-involved synsets represents the synsets that belongs to, and its indirectly-involved synsets represents the synsets that the synsets in are related to through semantic relations. Based on the two concepts, we propose the following hypothesis: given a subject word and an object word , is semantically connected to if and only if . According to the hypothesis, Algorithm 1 describes the process of extracting semantic level inter-word connections from each passage-question pair.

procedure Extract() Given a passage and a relevant question
     for  in  do For each passage word
          Obtain the extraction results
     end for
     for  in  do For each question word
          Obtain the extraction results
     end for
end procedure Return the extraction results on P and Q
Algorithm 1 Extract semantic level inter-word connections from each passage-question pair

3.2 How to obtain the indirectly-involved synsets of each word

The above hypothesis and process can work only if we know how to obtain the directly-involved synsets and indirectly-involved synsets of each word. Given a word , we can easily obtain its directly-involved synsets from WordNet, but obtaining its indirectly-involved synsets is much more complicated, because in WordNet, the way synsets are related to each other is flexible and extensible. In some cases, a synset is related to another synset through a single semantic relation. For example, the synset “cold.a.01” is related to the synset “temperature.n.01” through the semantic relation “attribute”. However, in more cases, a synset is related to another synset through a semantic relation chain. For example, first the synset “keratin.n.01” is related to the synset “feather.n.01” through the semantic relation “substance holonym”, then the synset “feather.n.01” is related to the synset “bird.n.01” through the semantic relation “part holonym”, and finally the synset “bird.n.01” is related to the synset “parrot.n.01” through the semantic relation “hyponym”, thus we can say that the synset “keratin.n.01” is related to the synset “parrot.n.01” through the semantic relation chain “substance holonym part holonym hyponym”. We name each semantic relation in a semantic relation chain as a hop, so that a semantic relation chain having semantic relations is a -hop semantic relation chain. Besides, each single semantic relation is a -hop semantic relation chain.
Let us use to represent the semantic relations of WordNet, and use to represent the synsets that a synset is related to through a single semantic relation . Since is easy to obtain from WordNet, we can further obtain the synsets that is related to through -hop semantic relation chains: , the synsets that is related to through -hop semantic relation chains: , and by induction, the synsets that is related to through -hop semantic relation chains: . In theory, if we do not limit the hop counts of semantic relation chains, can be related to all other synsets in WordNet, which is meaningless in many cases. Therefore, we use a hyper-parameter to represent the maximum hop count of semantic relation chains, and only consider the semantic relation chains that have no more than hops. Based on the above descriptions, given a word and its directly-involved synsets , we can obtain its indirectly-involved synsets: .

3.3 About controlling the amount of the extraction results

The hyper-parameter is crucial in controlling the amount of the extraction results. When we set to , the indirectly-involved synsets of each word contains no synset, so that semantic level inter-word connections only exist between synonyms. As we increase , the indirectly-involved synsets of each word usually contains more synsets, so that semantic level inter-word connections are likely to exist between more words. As a result, by increasing within a certain range, we can extract more semantic level inter-word connections from the MRC dataset, and thus provide our MRC model with more explicit knowledge. However, due to the limitations of WordNet, only a part of the extraction results are useful explicit knowledge, while the rest are useless for the prediction of answer spans. According to our observation, the proportion of the useless explicit knowledge increases as gets larger. Therefore, there exists an optimal setting for , which can result in the best performance of our MRC model.

4 Knowledge Aided Reader

As depicted in Figure 1, our MRC model, Knowledge Aided Reader (KAR), consists of five layers: given a passage-question pair, the lexical embedding layer encodes the lexical features of each word to generate the passage lexical embeddings and the question lexical embeddings; based on the lexical embeddings, the contextual embedding layer encodes the contextual clues about each word to generate the passage contextual embeddings and the question contextual embeddings; based on the contextual embeddings, the memory generation layer performs passage-to-question attention and question-to-passage attention to generate the preliminary memories over the passage-question pair; based on the preliminary memories, the memory refining layer performs self-matching attention to generate the refined memories over the passage-question pair; based on the refined memories and the question contextual embeddings, the answer span prediction layer generates the answer start position distribution and the answer end position distribution. KAR is quite different from the existing MRC models in that it uses the semantic level inter-word connections, which are pre-extracted from the MRC dataset by the WordNet-based data enrichment method, as explicit knowledge to assist the prediction of answer spans. On the one hand, the memory generation layer uses the explicit knowledge to assist the passage-to-question attention and the question-to-passage attention. On the other hand, the memory refining layer uses the explicit knowledge to assist the self-matching attention. Besides, to better utilize the explicit knowledge, the lexical embedding layer encodes dependency and synonymy information into the lexical embedding of each word.

Figure 1: Our MRC model: Knowledge Aided Reader (KAR)

4.1 Lexical Embedding Layer

For each word, the lexical embedding layer generates its lexical embedding by merging the following four basic embeddings:
1. Word-level Embedding. We define our vocabulary as the intersection between the words in all training samples and those in the pre-trained -dimensional GloVe Pennington et al. (2014). Given a word , if it is in the vocabulary, we set its word-level embedding

to its GloVe word vector, which is fixed during the training, otherwise we have

, where is a trainable parameter serving as the shared word vector of all out-of-vocabulary (OOV) words.
2. Character-level Embedding. We represent each character as a separate -dimensional vector, which is a trainable parameter. Given a word consisting of a sequence of characters, whose vectors are represented as , we use a bidirectional FOFE Zhang et al. (2015) to process , concatenate the forward FOFE output () and the backward FOFE output () across rows to obtain , and perform self attention on to obtain the character-level embedding of :

where and are trainable parameters. Applying character-level embedding is helpful in representing OOV words.
3. Dependency Embedding. Inspired by Liu et al. (2017a), we use a dependency parser to obtain the dependent words of each word. Given a word having dependent words, whose word-level embeddings are represented as , we perform self attention on to obtain the dependency embedding of :

where and are trainable parameters. By applying dependency embedding, we make use of syntactic level inter-word connections, which serve as a supplement to the pre-extracted semantic level inter-word connections.
4. Synonymy Embedding. In the scope of the vocabulary, we use WordNet to obtain the synonyms of each word. Given a word having synonyms, whose word-level embeddings are represented as , we perform self attention on to obtain the synonymy embedding of :

where and are trainable parameters. By applying synonymy embedding, we improve the vector-space similarity between synonyms, and thus promote the effects of the pre-extracted semantic level inter-word connections.
Based on the above descriptions, given a word , we obtain , , , and , and concatenate them across rows to obtain . In this way, for all passage words, we obtain , and for all question words, we obtain . We put through a -layer highway network Srivastava et al. (2015) to obtain the passage lexical embeddings: , and put through the same highway network to obtain the question lexical embeddings: .

4.2 Contextual Embedding Layer

For each word, the contextual embedding layer fuses its lexical embedding with those of its surrounding words to generate its contextual embedding. Specifically, we use a bidirectional LSTM (BiLSTM), whose hidden state size is , to process and separately. For , we concatenate the forward LSTM output () and the backward LSTM output () across rows to obtain the passage contextual embeddings: . For , we concatenate the forward LSTM output () and the backward LSTM output () across rows to obtain the question contextual embeddings: .

4.3 Memory Generation Layer

For each passage word, the memory generation layer fuses its contextual embedding with both the passage contextual embeddings and the question contextual embeddings to generate its preliminary memory over the passage-question pair. Specifically, the task of this layer is decomposed into the following four steps:
1. Generating enhanced contextual embeddings. We enhance the contextual embedding of each word according to the pre-extracted semantic level inter-word connections. Given a word , whose contextual embedding is , suppose we have obtained through Algorithm 1, then we gather the columns in whose positions are contained in , represent these columns as , and perform attention on to obtain the -attended contextual embedding:

where and are trainable parameters, and represents concatenating a vector with each column in a matrix across rows. Based on the above descriptions, given a word , we concatenate and across rows to obtain . In this way, for all passage words, we obtain , and for all question words, we obtain . We put through a -layer highway network to obtain the enhanced passage contextual embeddings: , and put through the same highway network to obtain the enhanced question contextual embeddings: .
2. Constructing knowledge aided alignment matrix. Based on the enhanced contextual embeddings, we construct an alignment matrix , where each element (i.e. the -th row and -th column in ) represents the similarity between the enhanced contextual embedding of the passage word and the enhanced contextual embedding of the question word . We use the similarity function proposed by Seo et al. (2016) to obtain each element in :

where is a trainable parameter, represents concatenation across rows, and represents element-wise multiplication. Since the enhanced contextual embeddings are generated according to the pre-extracted semantic level inter-word connections, the alignment matrix is named as knowledge aided alignment matrix.
3. Performing passage-to-question attention and question-to-passage attention. On the one hand, following Seo et al. (2016), we perform passage-to-question attention to obtain the passage-attended question representations:

where represents normalizing each row in a matrix by softmax. On the other hand, following Xiong et al. (2016), we perform question-to-passage attention to obtain the question-attended passage representations:

where represents normalizing each column in a matrix by softmax.
4. Generating preliminary memories. We concatenate , , , and across rows, put this concatenation () through a -layer highway network, use a BiLSTM, whose hidden state size is , to process the output of the highway network (), and concatenate the forward LSTM output () and the backward LSTM output () across rows to obtain the preliminary memories over the passage-question pair: .

4.4 Memory Refining Layer

For each passage word, the memory refining layer fuses its preliminary memory with those of some other passage words to generate its refined memory over the passage-question pair. Inspired by Wang et al. (2017), we perform self-matching attention on the preliminary memories. However, we are different from Wang et al. (2017) in that for each passage word, we only match its preliminary memory with those of a corresponding subset of other passage words, which are selected according to the pre-extracted semantic level inter-word connections, therefore our self-matching attention is named as knowledge aided self-matching attention. Specifically, given a passage word , whose preliminary memory is , suppose we have obtained through Algorithm 1, then we gather the columns in whose positions are contained in , represent these columns as , and perform attention on to obtain the -attended preliminary memory:

where and are trainable parameters. Based on the above descriptions, given a passage word , we concatenate and across rows to obtain . In this way, for all passage words, we obtain . We put through a -layer highway network, use a BiLSTM, whose hidden state size is , to process the output of the highway network (), and concatenate the forward LSTM output () and the backward LSTM output () across rows to obtain the refined memories over the passage-question pair: .

4.5 Answer Span Prediction Layer

In the answer span prediction layer, we first perform self attention on to obtain a summary of the question:

where and are trainable parameters. Then with as the query, we perform attention on to obtain the answer start position distribution:

where and are trainable parameters. Next we concatenate and across rows to obtain . Finally with as the query, we perform attention on again to obtain the answer end position distribution:

where and are trainable parameters.

Based on the above descriptions, for the training, we minimize the sum of the negative log probabilities of the ground truth answer start position and the ground truth answer end position by the predicted distributions, and for the inference, the answer start position

and the answer end position are chosen such that the product of and is maximized and .

5 Related Works

Data enrichment has been widely used in the existing MRC models. For example, Yu et al. (2018) use translation models to paraphrase the original passages so as to generate extra training samples; Yang et al. (2017) generate extra training samples by training a generative model that generates questions based on unlabeled text; Chen et al. (2017), Liu et al. (2017b), and Pan et al. (2017) append linguistic tags, such as POS tag and NER tag, to each word in the original passages; and Liu et al. (2017a) generate a syntactic tree for each sentence in the original passage-question pairs. However, the above works just enrich the original MRC dataset with the outputs of certain external models or systems, therefore their MRC models can only make use of machine generated data, but cannot utilize human knowledge explicitly.
Attention mechanism has also been widely used in the existing MRC models. For example, Xiong et al. (2016) use a coattention encoder and a dynamic pointer decoder to address the local maximum problem; Seo et al. (2016) use a bidirectional attention flow mechanism to obtain the question-aware passage representation; and Wang et al. (2017) use a self-matching attention mechanism to refine the question-aware passage representation. The passage-to-question attention, question-to-passage attention, and self-matching attention in KAR draw on the ideas of the above works, but are different from them in that we integrate explicit knowledge with these attentions.

6 Experiments

6.1 MRC Dataset

The MRC dataset used in this paper is SQuAD, which contains over passage-question pairs and their answers. All questions and answers in SQuAD are human generated, and the answer to each question is a fragment in the corresponding passage. SQuAD has been randomly partitioned into three parts: the training set (), the development set (), and the test set (

). Both the training set and the development set are publicly available, while the test set is confidential. Besides, SQuAD adopts both Exact Match (EM) and F1 Score as the evaluation metrics.

6.2 Implementation Details

To implement KAR, we first preprocess SQuAD. Specifically, we put each passage and question in SQuAD through a Stanford CoreNLP Manning et al. (2014) pipeline, which performs tokenization, sentence splitting, POS tagging, lemmatization, and dependency parsing in order. With the outputs of the pipeline, we use the WordNet interface provided by NLTK Bird and Loper (2004)

to perform the WordNet-based data enrichment method, and thus obtain an enriched MRC dataset. Based on the data preprocessing, we implement KAR using TensorFlow

Abadi et al. (2016). For the character-level embedding, we set the forgetting factor of FOFE to . For each BiLSTM, we set its hidden state size to . For the training, we use ADAM Kingma and Ba (2014) as our optimizer, set the learning rate to , and set the mini-batch size to . To avoid overfitting, we apply dropout Srivastava et al. (2014)

to the word vectors, the character vectors, the input to each BiLSTM, and the linear transformation before each softmax in the answer span prediction layer, with a dropout rate of

, and apply early stopping with a patience of

. To avoid the exploding gradient problem, we apply gradient clipping

Pascanu et al. (2013) with a cutoff threshold of . Besides, we also apply exponential moving average with a decay rate of .

6.3 Experimental Process and Results

In this paper, we only consider the single model performance of MRC models on the development set of SQuAD. On this premise, we perform the following two experiments:
1. Verifying the effects of explicit knowledge. We obtain six enriched MRC datasets by setting to , , , , , and separately, and train a different KAR on each enriched MRC dataset. As shown in Table 2, the amount of the extraction results increases monotonically as we increase from to , but during this process, the performance of KAR first rises by until reaches , and then begins to drop gradually. Thus it can be seen that the explicit knowledge provided by the WordNet-based data enrichment method plays an effective role in the training of KAR.

Amount of the Extraction Results
(connections per word)
Performance of KAR
(EM / F1)
0 0.39 70.8 / 79.8
1 0.63 71.1 / 80.0
2 1.24 71.6 / 80.4
3 2.21 72.4 / 81.1
4 3.68 71.9 / 80.7
5 5.58 71.8 / 80.5
Table 2: The amount of the extraction results and the performance of KAR under each setting for .

2. Verifying the effects of dependency embedding and synonymy embedding. By applying the optimal setting for (i.e. ), we perform ablation analysis on the dependency embedding and the synonymy embedding. As shown in Table 3, both of the two basic embeddings contribute to the performance of KAR, but the synonymy embedding seems to be more important.

Ablation Part Performance
(EM / F1)
Dependency Embedding 71.6 / 80.2
Synonymy Embedding 70.9 / 79.5
No Ablation 72.4 / 81.1
Table 3: The ablation analysis on the dependency embedding and the synonymy embedding.

Besides, we also compare the best performance of KAR with the published performance of the MRC models mentioned in the related works. As shown in Table 4, although KAR has achieved fairly good performance, there is still some way to go to catch up with the cutting-edge MRC models. This is because the scope of the general knowledge in WordNet is very limited, so that KAR cannot obtain enough useful explicit knowledge.

MRC Models Performance
(EM / F1)
GDAN Yang et al. (2017)    –   / 67.2
DCN Xiong et al. (2016) 65.4 / 75.6
BiDAF Seo et al. (2016) 67.7 / 77.3
SEDT Liu et al. (2017a) 68.1 / 77.5
DrQA Chen et al. (2017) 69.5 / 78.8
MEMEN Pan et al. (2017) 70.9 / 80.3
R-NET Wang et al. (2017) 72.3 / 80.6
KAR (ours) 72.4 / 81.1
QANet Yu et al. (2018) 75.1 / 83.8
SAN Liu et al. (2017b) 76.2 / 84.0
Table 4: The comparison of different MRC models (published single model performance on the development set of SQuAD).

7 Conclusion

In this paper, we explore how to apply the general knowledge in WordNet as explicit knowledge to the training of an MRC model, and thereby propose the WordNet-based data enrichment method and KAR. Based on the explicit knowledge provided by the data enrichment method, KAR has achieved fairly good performance on SQuAD, and more importantly, the performance of KAR varies with the amount of the explicit knowledge. In the future work, we will use larger knowledge bases, such as Freebase, to improve the quality of the explicit knowledge provided to KAR.

References