Machine comprehension of text is an important problem in natural language processing. A recently released dataset, the Stanford Question Answering Dataset (SQuAD), offers a large number of real questions and their answers created by humans through crowdsourcing. SQuAD provides a challenging testbed for evaluating machine comprehension algorithms, partly because compared with previous datasets, in SQuAD the answers do not come from a small set of candidate answers and they have variable lengths. We propose an end-to-end neural architecture for the task. The architecture is based on match-LSTM, a model we proposed previously for textual entailment, and Pointer Net, a sequence-to-sequence model proposed by Vinyals et al.(2015) to constrain the output tokens to be from the input sequences. We propose two ways of using Pointer Net for our task. Our experiments show that both of our two models substantially outperform the best results obtained by Rajpurkar et al.(2016) using logistic regression and manually crafted features.READ FULL TEXT VIEW PDF
Machine comprehension of text is one of the ultimate goals of natural language processing. While the ability of a machine to understand text can be assessed in many different ways, in recent years, several benchmark datasets have been created to focus on answering questions as a way to evaluate machine comprehension (Richardson et al., 2013; Hermann et al., 2015; Hill et al., 2016; Weston et al., 2016; Rajpurkar et al., 2016). In this setup, typically the machine is first presented with a piece of text such as a news article or a story. The machine is then expected to answer one or multiple questions related to the text.
In most of the benchmark datasets, a question can be treated as a multiple choice question, whose correct answer is to be chosen from a set of provided candidate answers (Richardson et al., 2013; Hill et al., 2016). Presumably, questions with more given candidate answers are more challenging. The Stanford Question Answering Dataset (SQuAD) introduced recently by Rajpurkar et al. (2016) contains such more challenging questions whose correct answers can be any sequence of tokens from the given text. Moreover, unlike some other datasets whose questions and answers were created automatically in Cloze style (Hermann et al., 2015; Hill et al., 2016), the questions and answers in SQuAD were created by humans through crowdsourcing, which makes the dataset more realistic. Given these advantages of the SQuAD dataset, in this paper, we focus on this new dataset to study machine comprehension of text. A sample piece of text and three of its associated questions are shown in Table 1.
Traditional solutions to this kind of question answering tasks rely on NLP pipelines that involve multiple steps of linguistic analyses and feature engineering, including syntactic parsing, named entity recognition, question classification, semantic parsing, etc. Recently, with the advances of applying neural network models in NLP, there has been much interest in building end-to-end neural architectures for various NLP tasks, including several pieces of work on machine comprehension(Hermann et al., 2015; Hill et al., 2016; Yin et al., 2016; Kadlec et al., 2016; Cui et al., 2016). However, given the properties of previous machine comprehension datasets, existing end-to-end neural architectures for the task either rely on the candidate answers (Hill et al., 2016; Yin et al., 2016) or assume that the answer is a single token (Hermann et al., 2015; Kadlec et al., 2016; Cui et al., 2016), which make these methods unsuitable for the SQuAD dataset. In this paper, we propose a new end-to-end neural architecture to address the machine comprehension problem as defined in the SQuAD dataset.
Specifically, observing that in the SQuAD dataset many questions are paraphrases of sentences from the original text, we adopt a match-LSTM model that we developed earlier for textual entailment (Wang & Jiang, 2016). We further adopt the Pointer Net (Ptr-Net) model developed by Vinyals et al. (2015), which enables the predictions of tokens from the input sequence only rather than from a larger fixed vocabulary and thus allows us to generate answers that consist of multiple tokens from the original text. We propose two ways to apply the Ptr-Net model for our task: a sequence model and a boundary model. We also further extend the boundary model with a search mechanism. Experiments on the SQuAD dataset show that our two models both outperform the best performance reported by Rajpurkar et al. (2016). Moreover, using an ensemble of several of our models, we can achieve very competitive performance on SQuAD.
Our contributions can be summarized as follows: (1) We propose two new end-to-end neural network models for machine comprehension, which combine match-LSTM and Ptr-Net to handle the special properties of the SQuAD dataset. (2) We have achieved the performance of an exact match score of 67.9% and an F1 score of 77.0% on the unseen test dataset, which is much better than the feature-engineered solution (Rajpurkar et al., 2016). Our performance is also close to the state of the art on SQuAD, which is 71.6% in terms of exact match and 80.4% in terms of F1 from Salesforce Research. (3) Our further analyses of the models reveal some useful insights for further improving the method. Beisdes, we also made our code available online 111 https://github.com/shuohangwang/SeqMatchSeq.
|In 1870, Tesla moved to Karlovac, to attend school at the Higher Real Gymnasium, where he was profoundly influenced by a math teacher Martin Sekulić. The classes were held in German, as it was a school within the Austro-Hungarian Military Frontier. Tesla was able to perform integral calculus in his head, which prompted his teachers to believe that he was cheating. He finished a four-year term in three years, graduating in 1873.|
|1. In what language were the classes given?||German|
|2. Who was Tesla’s main influence in Karlovac?||Martin Sekulić|
|3. Why did Tesla go to Karlovac?||attend school at the Higher Real Gymnasium|
In this section, we first briefly review match-LSTM and Pointer Net. These two pieces of existing work lay the foundation of our method. We then present our end-to-end neural architecture for machine comprehension.
In a recent work on learning natural language inference, we proposed a match-LSTM model for predicting textual entailment (Wang & Jiang, 2016)
. In textual entailment, two sentences are given where one is a premise and the other is a hypothesis. To predict whether the premise entails the hypothesis, the match-LSTM model goes through the tokens of the hypothesis sequentially. At each position of the hypothesis, attention mechanism is used to obtain a weighted vector representation of the premise. This weighted premise is then to be combined with a vector representation of the current token of the hypothesis and fed into an LSTM, which we call the match-LSTM. The match-LSTM essentially sequentially aggregates the matching of the attention-weighted premise to each token of the hypothesis and uses the aggregated matching result to make a final prediction.
Vinyals et al. (2015) proposed a Pointer Network (Ptr-Net) model to solve a special kind of problems where we want to generate an output sequence whose tokens must come from the input sequence. Instead of picking an output token from a fixed vocabulary, Ptr-Net uses attention mechanism as a pointer to select a position from the input sequence as an output symbol. The pointer mechanism has inspired some recent work on language processing (Gu et al., 2016; Kadlec et al., 2016). Here we adopt Ptr-Net in order to construct answers using tokens from the input text.
Formally, the problem we are trying to solve can be formulated as follows. We are given a piece of text, which we refer to as a passage, and a question related to the passage. The passage is represented by matrix , where is the length (number of tokens) of the passage and is the dimensionality of word embeddings. Similarly, the question is represented by matrix where is the length of the question. Our goal is to identify a subsequence from the passage as the answer to the question.
As pointed out earlier, since the output tokens are from the input, we would like to adopt the Pointer Net for this problem. A straightforward way of applying Ptr-Net here is to treat an answer as a sequence of tokens from the input passage but ignore the fact that these tokens are consecutive in the original passage, because Ptr-Net does not make the consecutivity assumption. Specifically, we represent the answer as a sequence of integers , where each is an integer between 1 and , indicating a certain position in the passage.
Alternatively, if we want to ensure consecutivity, that is, if we want to ensure that we indeed select a subsequence from the passage as an answer, we can use the Ptr-Net to predict only the start and the end of an answer. In this case, the Ptr-Net only needs to select two tokens from the input passage, and all the tokens between these two tokens in the passage are treated as the answer. Specifically, we can represent the answer to be predicted as two integers , where an are integers between 1 and .
We refer to the first setting above as a sequence model and the second setting above as a boundary model. For either model, we assume that a set of training examples in the form of triplets are given.
An overview of the two neural network models are shown in Figure 1. Both models consist of three layers: (1) An LSTM preprocessing layer that preprocesses the passage and the question using LSTMs. (2) A match-LSTM layer that tries to match the passage against the question. (3) An Answer Pointer (Ans-Ptr) layer that uses Ptr-Net to select a set of tokens from the passage as the answer. The difference between the two models only lies in the third layer.
LSTM Preprocessing Layer
The purpose for the LSTM preprocessing layer is to incorporate contextual information into the representation of each token in the passage and the question. We use a standard one-directional LSTM (Hochreiter & Schmidhuber, 1997) 222As the output gates in the preprocessing layer affect the final performance little, we remove it in our experiments. to process the passage and the question separately, as shown below:
The resulting matrices and
are hidden representations of the passage and the question, whereis the dimensionality of the hidden vectors. In other words, the column vector (or ) in (or ) represents the token in the passage (or the question) together with some contextual information from the left.
We apply the match-LSTM model (Wang & Jiang, 2016) proposed for textual entailment to our machine comprehension problem by treating the question as a premise and the passage as a hypothesis. The match-LSTM sequentially goes through the passage. At position of the passage, it first uses the standard word-by-word attention mechanism to obtain attention weight vector as follows:
where , and are parameters to be learned, is the hidden vector of the one-directional match-LSTM (to be explained below) at position , and the outer product produces a matrix or row vector by repeating the vector or scalar on the left for times.
Essentially, the resulting attention weight above indicates the degree of matching between the token in the passage with the token in the question. Next, we use the attention weight vector to obtain a weighted version of the question and combine it with the current token of the passage to form a vector :
This vector is fed into a standard one-directional LSTM to form our so-called match-LSTM:
We further build a similar match-LSTM in the reverse direction. The purpose is to obtain a representation that encodes the contexts from both directions for each token in the passage. To build this reverse match-LSTM, we first define
Note that the parameters here (, , , , and ) are the same as used in Eqn. (2). We then define in a similar way and finally define to be the hidden representation at position produced by the match-LSTM in the reverse direction.
Let represent the hidden states and represent . We define as the concatenation of the two:
Answer Pointer Layer
The top layer, the Answer Pointer (Ans-Ptr) layer, is motivated by the Pointer Net introduced by Vinyals et al. (2015). This layer uses the sequence as input. Recall that we have two different models: The sequence model produces a sequence of answer tokens but these tokens may not be consecutive in the original passage. The boundary model produces only the start token and the end token of the answer, and then all the tokens between these two in the original passage are considered to be the answer. We now explain the two models separately.
The Sequence Model: Recall that in the sequence model, the answer is represented by a sequence of integers indicating the positions of the selected tokens in the original passage. The Ans-Ptr layer models the generation of these integers in a sequential manner. Because the length of an answer is not fixed, in order to stop generating answer tokens at certain point, we allow each to take up an integer value between 1 and , where is a special value indicating the end of the answer. Once is set to be , the generation of the answer stops.
In order to generate the answer token indicated by , first, the attention mechanism is used again to obtain an attention weight vector , where (
) is the probability of selecting thetoken from the passage as the token in the answer, and is the probability of stopping the answer generation at position . is modeled as follows:
where is the concatenation of with a zero vector, defined as , , and are parameters to be learned, follows the same definition as before, and is the hidden vector at position of an answer LSTM as defined below:
We can then model the probability of generating the answer sequence as
To train the model, we minimize the following loss function based on the training examples:
The Boundary Model: The boundary model works in a way very similar to the sequence model above, except that instead of predicting a sequence of indices , we only need to predict two indices and
. So the main difference from the sequence model above is that in the boundary model we do not need to add the zero padding to, and the probability of generating an answer is simply modeled as
We further extend the boundary model by incorporating a search mechanism. Specifically, during prediction, we try to limit the length of the span and globally search the span with the highest probability computed by . Besides, as the boundary has a sequence of fixed number of values, bi-directional Ans-Ptr can be simply combined to fine-tune the correct span.
In this section, we present our experiment results and perform some analyses to better understand how our models works.
We use the Stanford Question Answering Dataset (SQuAD) v1.1 to conduct our experiments. Passages in SQuAD come from 536 articles from Wikipedia covering a wide range of topics. Each passage is a single paragraph from a Wikipedia article, and each passage has around 5 questions associated with it. In total, there are 23,215 passages and 107,785 questions. The data has been split into a training set (with 87,599 question-answer pairs), a development set (with 10,570 question-answer pairs) and a hidden test set.
We first tokenize all the passages, questions and answers. The resulting vocabulary contains 117K unique words. We use word embeddings from GloVe (Pennington et al., 2014) to initialize the model. Words not found in GloVe are initialized as zero vectors. The word embeddings are not updated during the training of the model.
The dimensionality of the hidden layers is set to be 150 or 300. We use ADAMAX (Kingma & Ba, 2015) with the coefficients and to optimize the model. Each update is computed through a minibatch of 30 instances. We do not use L2-regularization.
The performance is measured by two metrics: percentage of exact match with the ground truth answers, and word-level F1 score when comparing the tokens in the predicted answers with the tokens in the ground truth answers. Note that in the development set and the test set each question has around three ground truth answers. F1 scores with the best matching answers are used to compute the average F1 score.
|Match-LSTM with Ans-Ptr (Sequence)||150||882K||54.4||-||68.2||-|
|Match-LSTM with Ans-Ptr (Boundary)||150||882K||61.1||-||71.2||-|
|Match-LSTM with Ans-Ptr (Boundary+Search)||150||882K||63.0||-||72.7||-|
|Match-LSTM with Ans-Ptr (Boundary+Search)||300||3.2M||63.1||-||72.7||-|
|Match-LSTM with Ans-Ptr (Boundary+Search+b)||150||1.1M||63.4||-||73.0||-|
|Match-LSTM with Bi-Ans-Ptr (Boundary+Search+b)||150||1.4M||64.1||64.7||73.9||73.7|
|Match-LSTM with Ans-Ptr (Boundary+Search+en)||150||882K||67.6||67.9||76.8||77.0|
The results of our models as well as the results of the baselines given by Rajpurkar et al. (2016) and Yu et al. (2016) are shown in Table 2. We can see that both of our two models have clearly outperformed the logistic regression model by Rajpurkar et al. (2016), which relies on carefully designed features. Furthermore, our boundary model has outperformed the sequence model, achieving an exact match score of 61.1% and an F1 score of 71.2%. In particular, in terms of the exact match score, the boundary model has a clear advantage over the sequence model. The improvement of our models over the logistic regression model shows that our end-to-end neural network models without much feature engineering are very effective on this task and this dataset. Considering the effectiveness of boundary model, we further explore this model. Observing that most of the answers are the spans with relatively small sizes, we simply limit the largest predicted span to have no more than 15 tokens and conducted experiment with span searching This resulted in 1.5% improvement in F1 on the development data and that outperformed the DCR model (Yu et al., 2016), which also introduced some language features such as POS and NE into their model. Besides, we tried to increase the memory dimension in the model or add bi-directional pre-processing LSTM or add bi-directional Ans-Ptr. The improvement on the development data using the first two methods is quite small. While by adding Bi-Ans-Ptr with bi-directional pre-processing LSTM, we can get 1.2% improvement in F1. Finally, we explore the ensemble method by simply computing the product of the boundary probabilities collected from 5 boundary models and then searching the most likely span with no more than 15 tokens. This ensemble method achieved the best performance as shown in the table.
To better understand the strengths and weaknesses of our models, we perform some further analyses of the results below.
First, we suspect that longer answers are harder to predict. To verify this hypothesis, we analysed the performance in terms of both exact match and F1 score with respect to the answer length on the development set. For example, for questions whose answers contain more than 9 tokens, the F1 score of the boundary model drops to around 55% and the exact match score drops to only around 30%, compared to the F1 score and exact match score of close to 72% and 67%, respectively, for questions with single-token answers. And that supports our hypothesis.
Next, we analyze the performance of our models on different groups of questions. We use a crude way to split the questions into different groups based on a set of question words we have defined, including “what,” “how,” “who,” “when,” “which,” “where,” and “why.” These different question words roughly refer to questions with different types of answers. For example, “when” questions look for temporal expressions as answers, whereas “where” questions look for locations as answers. According to the performance on the development data set, our models work the best for “when” questions. This may be because in this dataset temporal expressions are relatively easier to recognize. Other groups of questions whose answers are noun phrases, such as “what” questions, “which” questions and “where” questions, also get relatively better results. On the other hand, “why” questions are the hardest to answer. This is not surprising because the answers to “why” questions can be very diverse, and they are not restricted to any certain type of phrases.
Finally, we would like to check whether the attention mechanism used in the match-LSTM layer is effective in helping the model locate the answer. We show the attention weights in Figure 2. In the figure the darker the color is the higher the weight is. We can see that some words have been well aligned based on the attention weights. For example, the word “German” in the passage is aligned well to the word “language” in the first question, and the model successfully predicts “German” as the answer to the question. For the question word “who” in the second question, the word “teacher” actually receives relatively higher attention weight, and the model has predicted the phrase “Martin Sekulic” after that as the answer, which is correct. For the last question that starts with “why”, the attention weights are more evenly distributed and it is not clear which words have been aligned to “why”.
Machine comprehension of text has gained much attention in recent years, and increasingly researchers are building data-drive, end-to-end neural network models for the task. We will first review the recently released datasets and then some end-to-end models on this task.
A number of datasets for studying machine comprehension were created in Cloze style by removing a single token from a sentence in the original corpus, and the task is to predict the missing word. For example, Hermann et al. (2015) created questions in Cloze style from CNN and Daily Mail highlights. Hill et al. (2016) created the Children’s Book Test dataset, which is based on children’s stories. Cui et al. (2016) released two similar datasets in Chinese, the People Daily dataset and the Children’s Fairy Tale dataset.
Instead of creating questions in Cloze style, a number of other datasets rely on human annotators to create real questions. Richardson et al. (2013) created the well-known MCTest dataset and Tapaswi et al. (2016) created the MovieQA dataset. In these datasets, candidate answers are provided for each question. Similar to these two datasets, the SQuAD dataset (Rajpurkar et al., 2016) was also created by human annotators. Different from the previous two, however, the SQuAD dataset does not provide candidate answers, and thus all possible subsequences from the given passage have to be considered as candidate answers.
There have been a number of studies proposing end-to-end neural network models for machine comprehension. A common approach is to use recurrent neural networks (RNNs) to process the given text and the question in order to predict or generate the answers(Hermann et al., 2015). Attention mechanism is also widely used on top of RNNs in order to match the question with the given passage (Hermann et al., 2015; Chen et al., 2016). Given that answers often come from the given passage, Pointer Network has been adopted in a few studies in order to copy tokens from the given passage as answers (Kadlec et al., 2016; Trischler et al., 2016). Compared with existing work, we use match-LSTM to match a question and a given passage, and we use Pointer Network in a different way such that we can generate answers that contain multiple tokens from the given passage.
In this paper, We developed two models for the machine comprehension problem defined in the Stanford Question Answering (SQuAD) dataset, both making use of match-LSTM and Pointer Network. Experiments on the SQuAD dataset showed that our second model, the boundary model, could achieve an exact match score of 67.6% and an F1 score of 77% on the test dataset, which is better than our sequence model and Rajpurkar et al. (2016)’s feature-engineered model.
In the future, we plan to look further into the different types of questions and focus on those questions which currently have low performance, such as the “why’ questions. We also plan to test how our models could be applied to other machine comprehension datasets.
We thank Pranav Rajpurkar for testing our model on the hidden test dataset and Percy Liang for helping us with the Dockerfile for Codalab.
Proceedings of the International Conference on Machine Learning, 2016.
We show the performance breakdown by answer lengths and question types for our sequence model, boundary model and the ensemble model in Figure 3.