MEMEN: Multi-layer Embedding with Memory Networks for Machine Comprehension

07/28/2017 ∙ by Boyuan Pan, et al. ∙ 0

Machine comprehension(MC) style question answering is a representative problem in natural language processing. Previous methods rarely spend time on the improvement of encoding layer, especially the embedding of syntactic information and name entity of the words, which are very crucial to the quality of encoding. Moreover, existing attention methods represent each query word as a vector or use a single vector to represent the whole query sentence, neither of them can handle the proper weight of the key words in query sentence. In this paper, we introduce a novel neural network architecture called Multi-layer Embedding with Memory Network(MEMEN) for machine reading task. In the encoding layer, we employ classic skip-gram model to the syntactic and semantic information of the words to train a new kind of embedding layer. We also propose a memory network of full-orientation matching of the query and passage to catch more pivotal information. Experiments show that our model has competitive results both from the perspectives of precision and efficiency in Stanford Question Answering Dataset(SQuAD) among all published results and achieves the state-of-the-art results on TriviaQA dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Machine comprehension(MC) has gained significant popularity over the past few years and it is a coveted goal in the field of natural language processing and artificial intelligence 

(Shen et al., 2016; Seo et al., 2016). Its task is to teach machine to understand the content of a given passage and then answer the question related to it. Figure 1 shows a simple example from the popular dataset SQuAD (Rajpurkar et al., 2016).

Figure 1: An example from the SQuAD dataset. The answer is a segment text of context.

Many significant works are based on this task, and most of them focus on the improvement of a sequence model that is augmented with an attention mechanism. However, the encoding of the words is also crucial and a better encoding layer can lead to substantial difference to the final performance. Many powerful methods (Hu et al., 2017; Wang et al., 2017; Seo et al., 2016) only represent their words in two ways, word-level embeddings and character-level embeddings. They use pre-train vectors, like GloVe(Pennington et al., 2014), to do the word-level embeddings, which ignore syntactic information and name entity of the words. Liu et al. (2017) construct a sequence of syntactic nodes for the words and encodes the sequence into a vector representation. However, they neglected the optimization of the initial embedding and didn’t take the semantic information of the words into account, which are very important parts in the vector representations of the words. For example, the word “Apple” is a fixed vector in GloVe and noun in syntactics whatever it represents the fruit or the company, but name entity tags can help recognize.

Moreover, the attention mechanism can be divided into two categories: one dimensional attention (Chen et al., 2016; Dhingra et al., 2016; Kadlec et al., 2016) and two dimensional attention (Cui et al., 2016; Wang et al., 2016; Seo et al., 2016). In one dimensional attention, the whole query is represented by one embedding vector, which is usually the last hidden state in the neural network. However, using only one vector to represent the whole query will attenuate the attention of key words. On the contrary, every word in the query has its own embedding vector in the situation of two dimensional attention, but many words in the question sentence are useless even if disturbing, such as the stopwords.

Figure 2: MEMEN structure overview. In the figure above, there are two hops in the memory network.

In this paper, we introduce the Multi-layer Embedding with Memory Networks(MEMEN), an end-to-end neural network for machine comprehension task. Our model consists of three parts: 1) the encoding of context and query, in which we add useful syntactic and semantic information in the embedding of every word, 2) the high-efficiency multi-layer memory network of full-orientation matching to match the question and context, 3) the pointer-network based answer boundary prediction layer to get the location of the answer in the passage. The contributions of this paper can be summarized as follows.

  • First, we propose a novel multi-layer embedding of the words in the passage and query. We use skip-gram model to train the part-of-speech(POS) tags and name-entity recognition(NER) tags embedding that represent the syntactic and semantic information of the words respectively. The analogy inference provided by skip-gram model can make the similar attributes close in their embedding space such that more adept at helping find the answer.

  • Second, we introduce a memory networks of full-orientation matching.To combines the advantages of one dimensional attention and two dimensional attention, our novel hierarchical attention vectors contain both of them. Because key words in query often appear at ends of the sentence, one-dimensional attention, in which the bi-directional last hidden states are regarded as representation, is able to capture more useful information compared to only applying two dimensional attention. In order to deepen the memory and better understand the passage according to the query, we employ the structure of multi-hops to repeatedly read the passage. Moreover, we add a gate to the end of each memory to improve the speed of convergence.

  • Finally, the proposed method yields competitive results on the large machine comprehension bench marks SQuAD and the state-of-the-art results on TriviaQA dataset. On SQuAD, our model achieves 75.37% exact match and 82.66% F1 score. Moreover, our model avoids the high computation complexity self-matching mechanism which is popular in many previous works, thus we spend much less time and memory when training the model.

Figure 3: The passage and its according transformed “passages”. The first row(green) is the original sentence from the passage, the second row(red) is the name-entity recognition(NER) tags, and the last row(blue) is the part-of-speech(POS) tags.

Model Structure

As Figure 2 shows, our machine reading model consists of three parts. First, we concatenate several layers of embedding of questions and contexts and pass them into a bi-directional RNN(Mikolov et al., 2010). Then we obtain the relationship between query and context through a novel full-orientation matching and apply memory networks in order to deeply understand. In the end, the output layer helps locate the answer in the passage.

Encoding of Context and Query

In the encoding layer, we represent all tokens in the context and question as a sequence of embeddings and pass them as the input to a recurrent neural network.

Word-level embeddings and character-level embeddings are first applied.We use pre-trained word vectors GloVe(Pennington et al., 2014)

to obtain the fixed word embedding of each word.The character-level embeddings are generated by using Convolutional Neural Networks(CNN) which is applied to the characters of each word(Kim, et al., 2014). This layer maps each token to a high dimensional vector space and is proved to be helpful in handling out-of-vocab(OOV) words.

We also use skip-gram model to train the embeddings of part-of-speech(POS) tags and named-entity recognition(NER) tags. We first transform all of the given training set into their part-of-speech(POS) tags and named-entity recognition(NER) tags, which can be showed in Figure 3. Then we employ skip-sram model, which is one of the core algorithms in the popular off-the-shelf embedding word2vec(Mikolov et al., 2013), to the transformed “passage” just like it works in word2vec for the normal passage. Given a sequence of training words in the transformed passage:

, the objective of the skip-gram model is to maximize the average log probability:

where is the size of the context which can be set manually, a large means more accurate results and more training time. The is defined by:

where and are the input and output vector of , and is the vocabulary size.

We finally get the fixed length embedding of each tag. Although the number of tags limits the effect of word analogy inference, it still be very helpful compared to simple one hot embedding since similar tags have similar surroundings.

In the end, we use a BiLSTM to encode both the context and query embeddings and obtain their representations and and the last hidden state of both directions of query representation .

where , , represent word-level embedding, character-level embedding and tags embedding respectively. is the concatenation of both directions’ last hidden state.

Memory Network of Full-Orientation Matching

Attention mechanism is a common way to link and blend the content between the context and query. Unlike previous methods that are either two dimensional matching or one dimensional matching, we propose a full-orientation matching layer that synthesizes both of them and thus combine the advantages of both side and hedge the weakness. After concatenating all the attention vectors, we will pass them into a bi-directional LSTM. We start by describing our model in the single layer case, which implements a single memory hop operation. We then show it can be stacked to give multiple hops in memory.

Integral Query Matching

The input of this step is the representations , and . At first we obtain the importance of each word in passage according to the integral query by means of computing the match between and each representation by taking the inner product followed by a softmax:

Subsequently the first matching module is the sum of the inputs weighted by attention :

Query-Based Similarity Matching

We then obtain an alignment matrix between the query and context by , is the weight parameter, is elementwise multiplication. Like Seo et al. (2017), we use this alignment matrix to compute whether the query words are relevant to each context word. For each context word, there is an attention weight that represents how much it is relevant to every query word:

means the softmax function is performed across the row vector, and each attention vector is , which is based on the query embedding. Hence the second matching module is , where each is the column of .

Context-Based Similarity Matching

When we consider the relevance between context and query, the most representative word in the query sentence can be chosen by , and the attention is . Then we obtain the last matching module

which is based on the context embedding. We put all of the memories in a linear function to get the integrated hierarchical matching module:

where is an simple linear function, and are matrixes that are tiled n times by and .

Moreover, Wang et al. (2017) add an additional gate to the input of RNN:

The gate is based on the integration of hierarchical attention vectors, and it effectively filtrates the part of tokens that are helpful in understanding the relation between passage and query. Additionally, we add a bias to improve the estimation:

Experiments prove that this gate can also accelerate the speed of convergence. Finally, the integrated memory is passed into a bi-directional LSTM, and the output will captures the interaction among the context words and the query words:

In multiple layers, the integrated hierarchical matching module can be regarded as the input of next layer after a dimensionality reduction processing. We call this memory networks of full-orientation matching.

Output layer

In this layer, we follow Wang and Jiang (2016) to use the boundary model of pointer networks (Vinyals et al., 2015) to locate the answer in the passage. Moreover, we follow Wang et al. (2017) to initialize the hidden state of the pointer network by a query-aware representation:

where , and are parameters, is the initial hidden state of the pointer network. Then we use the passage representation along with the initialized hidden state to predict the indices that represent the answer’s location in the passage:

where is parameter, that respectively represent the start point and the end point of the answer, is the vector that represents -th word in the passage of the final output of the memory networks. To get the next layer of hidden state, we need to pass weighted by current predicted probability

to the Gated Recurrent Unit(GRU)(Chung et al., 2014):

For the loss function, we minimize the sum of the negative probabilities of the true start and end indices by the predicted distributions.


Implementation Settings

The tokenizers we use in the step of preprocessing data are from Stanford CoreNLP (Manning et al., 2014). We also use part-of-speech tagger and named-entity recognition tagger in Stanford CoreNLP utilities to transform the passage and question. For the skip-gram model, our model refers to the word2vec module in open source software library, Tensorflow, the skip window is set as 2. The dataset we use to train the embedding of POS tags and NER tags are the training set given by SQuAD, in which all the sentences are tokenized and regrouped as a list. To improve the reliability and stabllity, we screen out the sentences whose length are shorter than 9. We use 100 one dimensional filters for CNN in the character level embedding, with width of 5 for each one. We set the hidden size as 100 for all the LSTM and GRU layers and apply dropout(Srivastava et al., 2014) between layers with a dropout ratio as 0.2. We use the AdaDelta (Zeiler, 2012) optimizer with a initial learning rate as 0.001. For the memory networks, we set the number of layer as 3.

TriviaQA Results

We first evaluate our model on a large scale reading comprehension dataset TriviaQA version1.0(Joshi et al., 2017). TriviaQA contains over 650K question-answer-evidence triples, that are derived from Web search results and Wikipedia pages – with highly differing levels of information redundancy. TriviaQA is the first dataset where questions are authored by trivia enthusiasts, independently of the evidence documents.

Figure 4: The performance of our MEMEN and baselines on TriviaQA dataset.
Figure 5: The performance of our MEMEN and competing approaches on SQuAD dataset as we submitted our model (May, 22, 2017). * indicates ensemble models.

There are two different metrics to evaluate model accuracy: Exact Match(EM) and F1 Score, which measures the weighted average of the precision and recall rate at character level. Because the evidence is gathered by an automated process, the documents are not guaranteed to contain all facts needed to answer the question. In addition to distant supervision evaluation, we also evaluate models on a verified subsets. Because the test set is not released, we train our model on training set and evaluate our model on dev set. As we can see in Figure 4, our model outperforms all other baselines and achieves the state-of-the-art result on all subsets on TriviaQA.

SQuAD Results

We also use the Stanford Question Answering Dataset (SQuAD) v1.1 to conduct our experiments. Passages in the dataset are retrieved from English Wikipedia by means of Project Nayuki’s Wikipedia’s internal PageRanks. They sampled 536 articles uniformly at random with a wide range of topics, from musical celebrities to abstract concepts.The dataset is partitioned randomly into a training set(80%), a development set(10%), and a hidden test set(10%). The host of SQuAD didn’t release the test set to the public, so everybody has to submit their model and the host will run it on the test set for them.

Figure 5 shows the performance of our model and competing approaches on the SQuAD. The results of this dataset are all exhibited on a leaderboard, and top methods are almost all ensemble models, our model achieves an exact match score of 75.37% and an F1 score of 82.66%, which is competitive to state-of-the-art method.

Ensemble Details

The main current ensemble methods in the machine comprehension is simply choosing the answer with the highest sum of confidence scores among several single models which are exactly identical except the random initial seed. However, the performance of ensemble model can obviously be better if there is some diversity among single models. In our SQuAD experiment, we get the value of learning rate and dropout ratio of each model by a gaussian distribution, in which the mean value are 0.001 and 0.2 respectively.

To keep the diversity in a reasonable scope, we set the variance of gaussian distribution as

and respectively. Finally, we build an ensemble model which consists of 14 single models with different parameters.

Speed and Efficiency

Compared to (Wang et al., 2017), which achieves state-of-the-art result on the SQuAD test set, our model doesn’t contain the self-matching attention layer which is stuck with high computational complexity. Our MEMEN was trained with NVIDIA Titan X GPU, and the training process of the 3-hops model took roughly 5 hours on a single GPU. However, an one-hop model took 22 hours when we added self-matching layer in attention memory. Although the accuracy is improved a little compared to one-hop MEMEN model, it declined sharply as the number of hops increased, not to speak of the disadvantage of running time. The reason might be that multi-hops model with self-matching layer is too complex to efficiently learn the features for this dataset. As a result, our model is competitive both in accuracy and efficiency.

Hops and Ablations

Figure 6 shows the performance of our single model on SQuAD dev set with different number of hops in the memory network. As we can see, both the EM and F1 score increase as the number of hops enlarges until it arrives 3. After the model achieves the best performance with 3 hops, the performance gets worse as the number of hops gets large, which might result in overfitting.

We also run the ablations of our single model on SQuAD dev set to evaluate the individual contribution. As Figure 7 shows, both syntactic embeddings and semantic embeddings contribute towards the model’s performance and the POS tags seem to be more important. The reason may be that the number of POS tags is larger than that of NER tags so the embedding is easier to train. For the full-orientation matching, we remove each kind of attention vector respectively and the linear function can handle any two of the rest hierarchical attention vectors. For ablating integral query matching, the result drops about 2% on both metrics and it shows that the integral information of query for each word in passage is crucial. The query-based similarity matching accounts for about 10% performance degradation, which proves the effectiveness of alignment context words against query. For context-based similarity matching, we simply took out the from the linear function and it is proved to be contributory to the performance of full-orientation matching.

Figure 6: Performance comparison among different number of hops on the SQuAD dev set.
Figure 7: Ablation results on the SQuAD dev set.

Related Work

Machine Reading Comprehension Dataset.

Several benchmark datasets play an important role in progress of machine comprehension task and question answering research in recent years. MCTest(Richardson et al., 2013) is one of the famous and high quality datasets. There are 660 fictional stories and 4 multiple choice questions per story contained in it, and the labels are all made by humans. Researchers also released cloze-style datasets(Hill et al., 2015; Hermann et al., 2015; Onishi et al., 2016; Paperno et al., 2016). However, these datasets are either not large enough to support deep neural network models or too easy to challenge natural language.

Recently, Rajpurkar et al. (2016) released the Stanford Question Answering dataset (SQuAD), which is almost two orders of magnitude larger than all previous hand-annotated datasets. Moreover, this dataset consists 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding passage, rather than a limited set of multiple choices or entities. TriviaQA (Joshi et al., 2017) is also a large and high quality dataset, and the crucial difference between TriviaQA and SQuAD is that TriviaQA questions have not been crowdsourced from pre-selected passages.

Attention Based Models for Machine Reading

Many works are based on the task of machine reading comprehension, and attention mechanism have been particularly successful(Xiong et al., 2016; Cui et al., 2016; Wang et al., 2016; Seo et al., 2016; Hu et al., 2017; Shen et al., 2016; Wang et al., 2017). Xiong et al. (2016) present a coattention encoder and dynamic decoder to locate the answer. Cui et al. (2016) propose a two side attention mechanism to compute the matching between the passage and query. Wang et al. (2016)

match the passage and query from several perspectives and predict the answer by globally normalizing probability distributions.

Seo et al. (2016) propose a bi-directional attention flow to achieve a query-aware context representation. Hu et al. (2017) propose self-aware representation and multi-hop query-sensitive pointer to predict the answer span. Shen et al. (2016)

propose iterarively inferring the answer with a dynamic number of steps trained with reinforcement learning.

Wang et al. (2017) employ gated self-matching attention to obtain the relation between the question and passage. Our MEMEN construct a hierarchical orientation attention mechanism to get a wider match while applying memory network(Sukhbaatar et al., 2015) for deeper understand.


In this paper, we introduce MEMEN for Machine comprehension style question answering. We propose the multi-layer embedding to encode the document and the memory network of full-orientation matching to obtain the interaction of the context and query. The experimental evaluation shows that our model achieves the state-of-the-art result on TriviaQA dataset and competitive result in SQuAD. Moreover, the ablations and hops analysis demonstrate the importance of every part of the hierarchical attention vectors and the benefit of multi-hops in memory network. For future work, we will focus on question generative method and sentence ranking in machine reading tasks.