This is an implementation of the Attention Sum Reader model as presented in "Text Comprehension with the Attention Sum Reader Network" available at http://arxiv.org/abs/1603.01547.
Several large cloze-style context-question-answer datasets have been introduced recently: the CNN and Daily Mail news data and the Children's Book Test. Thanks to the size of these datasets, the associated text comprehension task is well suited for deep-learning techniques that currently seem to outperform all alternative approaches. We present a new, simple model that uses attention to directly pick the answer from the context as opposed to computing the answer using a blended representation of words in the document as is usual in similar models. This makes the model particularly suitable for question-answering problems where the answer is a single word from the document. Ensemble of our models sets new state of the art on all evaluated datasets.READ FULL TEXT VIEW PDF
We publicly release a new large-scale dataset, called SearchQA, for mach...
We propose a novel neural attention architecture to tackle machine
This paper presents results of Document Visual Question Answering Challe...
We present a novel deep learning architecture to address the cloze-style...
Understanding unstructured text is a major goal within natural language
Extractive QA models have shown very promising performance in predicting...
This paper introduces a new neural structure called FusionNet, which ext...
This is an implementation of the Attention Sum Reader model as presented in "Text Comprehension with the Attention Sum Reader Network" available at http://arxiv.org/abs/1603.01547.
Implementation of the attention-sum reader using tensorflow and keras.
Question Answering Using Attentive Reader and Recurrent Neural Networks
attention sum reader
Most of the information humanity has gathered up to this point is stored in the form of plain text. Hence the task of teaching machines how to understand this data is of utmost importance in the field of Artificial Intelligence. One way of testing the level of text understanding is simply to ask the system questions for which the answer can be inferred from the text. A well-known example of a system that could make use of a huge collection of unstructured documents to answer questions is for instance IBM’s Watson system used for the Jeopardy challenge[Ferrucci et al.2010].
|Document: What was supposed to be a fantasy sports car ride at Walt Disney World Speedway turned deadly when a Lamborghini crashed into a guardrail. The crash took place Sunday at the Exotic Driving Experience, which bills itself as a chance to drive your dream car on a racetrack. The Lamborghini’s passenger, 36-year-old Gary Terry of Davenport, Florida, died at the scene, Florida Highway Patrol said. The driver of the Lamborghini, 24-year-old Tavon Watson of Kissimmee, Florida, lost control of the vehicle, the Highway Patrol said. (…)|
|Question: Officials say the driver, 24-year-old Tavon Watson, lost control of a _______|
|Answer candidates: Tavon Watson, Walt Disney World Speedway, Highway Patrol, Lamborghini, Florida, (…)|
|CNN||Daily Mail||CBT CN||CBT NE|
|Max # options||527||187||396||371||232||245||10||10||10||10||10||10|
|Avg # options||26.4||26.5||24.5||26.5||25.5||26.0||10||10||10||10||10||10|
|Avg # tokens||762||763||716||813||774||780||470||448||461||433||412||424|
Cloze-style questions [Taylor1953], i.e. questions formed by removing a phrase from a sentence, are an appealing form of such questions (for example see Figure 1). While the task is easy to evaluate, one can vary the context, the question sentence or the specific phrase missing in the question to dramatically change the task structure and difficulty.
One way of altering the task difficulty is to vary the word type being replaced, as in [Hill et al.2015]. The complexity of such variation comes from the fact that the level of context understanding needed in order to correctly predict different types of words varies greatly. While predicting prepositions can easily be done using relatively simple models with very little context knowledge, predicting named entities requires a deeper understanding of the context.
Also, as opposed to selecting a random sentence from a text as in [Hill et al.2015]), the question can be formed from a specific part of the document, such as a short summary or a list of tags. Since such sentences often paraphrase in a condensed form what was said in the text, they are particularly suitable for testing text comprehension [Hermann et al.2015].
An important property of cloze-style questions is that a large amount of such questions can be automatically generated from real world documents. This opens the task to data-hungry techniques such as deep learning. This is an advantage compared to smaller machine understanding datasets like MCTest [Richardson et al.2013] that have only hundreds of training examples and therefore the best performing systems usually rely on hand-crafted features [Sachan et al.2015, Narasimhan and Barzilay2015].
In the first part of this article we introduce the task at hand and the main aspects of the relevant datasets. Then we present our own model to tackle the problem. Subsequently we compare the model to previously proposed architectures and finally describe the experimental results on the performance of our model.
In this section we introduce the task that we are seeking to solve and relevant large-scale datasets that have recently been introduced for this task.
The task consists of answering a cloze-style question, the answer to which depends on the understanding of a context document provided with the question. The model is also provided with a set of possible answers from which the correct one is to be selected. This can be formalized as follows:
The training data consist of tuples , where is a question, is a document that contains the answer to question , is a set of possible answers and is the ground truth answer. Both and are sequences of words from vocabulary . We also assume that all possible answers are words from the vocabulary, that is , and that the ground truth answer appears in the document, that is .
We will now briefly summarize important features of the datasets.
The first two datasets111The CNN and Daily Mail datasets are available at https://github.com/deepmind/rc-data [Hermann et al.2015] were constructed from a large number of news articles from the CNN and Daily Mail websites. The main body of each article forms a context, while the cloze-style question is formed from one of short highlight sentences, appearing at the top of each article page. Specifically, the question is created by replacing a named entity from the summary sentence (e.g. “Producer X will not press charges against Jeremy Clarkson, his lawyer says.”).
Furthermore the named entities in the whole dataset were replaced by anonymous tokens which were further shuffled for each example so that the model cannot build up any world knowledge about the entities and hence has to genuinely rely on the context document to search for an answer to the question.
Qualitative analysis of reasoning patterns needed to answer questions in the CNN dataset together with human performance on this task are provided in [Chen et al.2016].
The third dataset222The CBT dataset is available at http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz, the Children’s Book Test (CBT) [Hill et al.2015], is built from books that are freely available thanks to Project Gutenberg333https://www.gutenberg.org/. Each context document is formed by consecutive sentences taken from a children’s book story. Due to the lack of summary, the cloze-style question is then constructed from the subsequent (st) sentence.
One can also see how the task complexity varies with the type of the omitted word (named entity, common noun, verb, preposition). [Hill et al.2015] have shown that while standard LSTM language models have human level performance on predicting verbs and prepositions, they lack behind on named entities and common nouns. In this article we therefore focus only on predicting the first two word types.
Basic statistics about the CNN, Daily Mail and CBT datasets are summarized in Table 1.
Our model called the psr444Our implementation of the psr is available at https://github.com/rkadlec/asreader is tailor-made to leverage the fact that the answer is a word from the context document. This is a double-edged sword. While it achieves state-of-the-art results on all of the mentioned datasets (where this assumption holds true), it cannot produce an answer which is not contained in the document. Intuitively, our model is structured as follows:
We compute a vector embedding of the query.
We compute a vector embedding of the query.
We compute a vector embedding of each individual word in the context of the whole document (contextual embedding).
Using a dot product between the question embedding and the contextual embedding of each occurrence of a candidate answer in the document, we select the most likely answer.
Our model uses one word embedding function and two encoder functions. The word embedding function translates words into vector representations. The first encoder function is a document encoder that encodes every word from the document in the context of the whole document. We call this the contextual embedding. For convenience we will denote the contextual embedding of the -th word in as . The second encoder is used to translate the query into a fixed length representation of the same dimensionality as each . Both encoders use word embeddings computed by as their input. Then we compute a weight for every word in the document as the dot product of its contextual embedding and the query embedding. This weight might be viewed as an attention over the document .
To form a proper probability distribution over the words in the document, we normalize the weights using thesoftmax function. This way we model probability that the answer to query appears at position in the document . In a functional form this is:
Finally we compute the probability that word is a correct answer as:
where is a set of positions where appears in the document . We call this mechanism pointer sum attention since we use attention as a pointer over discrete tokens in the context document and then we directly sum the word’s attention across all the occurrences. This differs from the usual use of attention in sequence-to-sequence models [Bahdanau et al.2015] where attention is used to blend representations of words into a new embedding vector. Our use of attention was inspired by ptrnet [Vinyals et al.2015].
A high level structure of our model is shown in Figure 2.
In our model the document encoder
is implemented as a bidirectional Gated Recurrent Unit (GRU) network[Cho et al.2014, Chung et al.2014] whose hidden states form the contextual word embeddings, that is , where denotes vector concatenation and and denote forward and backward contextual embeddings from the respective recurrent networks. The query encoder is implemented by another bidirectional GRU network. This time the last hidden state of the forward network is concatenated with the last hidden state of the backward network to form the query embedding, that is . The word embedding function is implemented in a usual way as a look-up table . is a matrix whose rows can be indexed by words from the vocabulary, that is . Therefore, each row of contains embedding of one word from the vocabulary. During training we jointly optimize parameters of , and .
|what was supposed to be a fantasy sports car ride at @entity3 turned deadly when a @entity4 crashed into a guardrail . the crash took place sunday at the @entity8 , which bills itself as a chance to drive your dream car on a racetrack . the @entity4 ’s passenger , 36 - year - old @entity14 of @entity15 , @entity16 , died at the scene , @entity13 said . the driver of the @entity4 , 24 - year - old @entity18 of @entity19 , @entity16 , lost control of the vehicle , the @entity13 said .|
|officials say the driver , 24 - year - old @entity18 , lost control of a _____|
|@entity11 film critic @entity29 writes in his review that ”anyone nostalgic for childhood dreams of transformation will find something to enjoy in an uplifting movie that invests warm sentiment in universal themes of loss and resilience , experience and maturity . ” more : the best and worst adaptations of ”@entity” @entity43, @entity44 and @entity46 star in director @entity48’s crime film about a hit man trying to save his estranged son from a revenge plot. @entity11 chief film critic @entity52 writes in his review that the film|
|_____ stars in crime film about hit man trying to save his estranged son|
Several recent deep neural network architectures
Several recent deep neural network architectures[Hermann et al.2015, Hill et al.2015, Chen et al.2016, Kobayashi et al.2016] were applied to the task of text comprehension. The last two architectures were developed independently at the same time as our work. All of these architectures use an attention mechanism that allows them to highlight places in the document that might be relevant to answering the question. We will now briefly describe these architectures and compare them to our approach.
Attentive and Impatient Readers were proposed in [Hermann et al.2015]. The simpler Attentive Reader is very similar to our architecture. It also uses bidirectional document and query encoders to compute an attention in a similar way we do. The more complex Impatient Reader computes attention over the document after reading every word of the query. However, empirical evaluation has shown that both models perform almost identically on the CNN and Daily Mail datasets.
The key difference between the Attentive Reader and our model is that the Attentive Reader uses attention to compute a fixed length representation of the document that is equal to a weighted sum of contextual embeddings of words in , that is . A joint query and document embedding is then a non-linear function of and the query embedding . This joint embedding is in the end compared against all candidate answers using the dot product , in the end the scores are normalized by softmax. That is: .
In contrast to the Attentive Reader, we select the answer from the context directly using the computed attention rather than using such attention for a weighted sum of the individual representations (see Eq. 2). The motivation for such simplification is the following.
Consider a context “A UFO was observed above our city in January and again in March.” and question “An observer has spotted a UFO in ___ .”
Since both January and March are equally good candidates, the attention mechanism might put the same attention on both these candidates in the context. The blending mechanism described above would compute a vector between the representations of these two words and propose the closest word as the answer - this may well happen to be February (it is indeed the case for Word2Vec trained on Google News). By contrast, our model would correctly propose January or March.
A model presented in [Chen et al.2016] is inspired by the Attentive Reader. One difference is that the attention weights are computed with a bilinear term instead of simple dot-product, that is . The document embedding is computed using a weighted sum as in the Attentive Reader, . In the end , where is a new embedding function.
Even though it is a simplification of the Attentive Reader this model performs significantly better than the original.
The best performing memory networks model setup - window memory - uses windows of fixed length (8) centered around the candidate words as memory cells. Due to this limited context window, the model is unable to capture dependencies out of scope of this window. Furthermore, the representation within such window is computed simply as the sum of embeddings of words in that window. By contrast, in our model the representation of each individual word is computed using a recurrent network, which not only allows it to capture context from the entire document but also the embedding computation is much more flexible than a simple sum.
To improve on the initial accuracy, a heuristic approach called
To improve on the initial accuracy, a heuristic approach calledself supervision is used in [Hill et al.2015] to help the network to select the right supporting “memories” using an attention mechanism showing similarities to the ours. Plain MenNN without this heuristic are not competitive on these machine reading tasks. Our model does not need any similar heuristics.
Our model architecture was inspired by ptrnet [Vinyals et al.2015] in using an attention mechanism to select the answer in the context rather than to blend words from the context into an answer representation. While a ptrnet consists of an encoder as well as a decoder, which uses the attention to select the output at each step, our model outputs the answer in a single step. Furthermore, the pointer networks assume that no input in the sequence appears more than once, which is not the case in our settings.
Our model combines the best features of the architectures mentioned above. We use recurrent networks to “read” the document and the query as done in [Hermann et al.2015, Chen et al.2016, Kobayashi et al.2016] and we use attention in a way similar to ptrnet. We also use summation of attention weights in a way similar to MenNN [Hill et al.2015].
From a high level perspective we simplify all the discussed text comprehension models by removing all transformations past the attention step. Instead we use the attention directly to compute the answer probability.
|Attentive Reader †||61.6||63.0||70.5||69.0|
|Impatient Reader †||61.8||63.8||69.0||68.0|
|MemNNs (single model) ‡||63.4||66.8||NA||NA|
|MemNNs (ensemble) ‡||66.2||69.4||NA||NA|
|Dynamic Entity Repres. (max-pool)||71.2||70.7||NA||NA|
|Dynamic Entity Repres. (max-pool + byway)||70.8||72.0||NA||NA|
|Dynamic Entity Repres. + w2v||71.3||72.9||NA||NA|
|Chen et al. (2016) (single model)||72.4||72.4||76.9||75.8|
|psr (single model)||68.6||69.5||75.0||73.9|
|psr (avg for top 20%)||68.4||69.9||74.5||73.5|
|psr (avg ensemble)||73.9||75.4||78.1||77.1|
|psr (greedy ensemble)||74.5||74.8||78.7||77.7|
|Named entity||Common noun|
|LSTMs (context+query) ‡||51.2||41.8||62.6||56.0|
|MemNNs (window memory + self-sup.) ‡||70.4||66.6||64.2||63.0|
|psr (single model)||73.8||68.6||68.8||63.4|
|psr (avg for top 20%)||73.3||68.4||67.7||63.2|
|psr (avg ensemble)||74.5||70.6||71.1||68.9|
|psr (greedy ensemble)||76.2||71.0||72.4||67.5|
In this section we evaluate our model on the CNN, Daily Mail and CBT datasets. We show that despite the model’s simplicity its ensembles achieve state-of-the-art performance on each of these datasets.
To train the model we used stochastic gradient descent with the ADAM update rule
To train the model we used stochastic gradient descent with the ADAM update rule[Kingma and Ba2015] and learning rate of or . During training we minimized the following negative log-likelihood with respect to :
where is the correct answer for query and document , and represents parameters of the encoder functions and and of the word embedding function . The optimized probability distribution is defined in Eq. 2.
The initial weights in the word embedding matrix were drawn randomly uniformly from the interval . Weights in the GRU networks were initialized by random orthogonal matrices [Saxe et al.2014] and biases were initialized to zero. We also used a gradient clipping
and biases were initialized to zero. We also used a gradient clipping[Pascanu et al.2012] threshold of 10 and batches of size 32.
During training we randomly shuffled all examples in each epoch. To speedup training, we always pre-fetched
During training we randomly shuffled all examples in each epoch. To speedup training, we always pre-fetchedbatches worth of examples and sorted them according to document length. Hence each batch contained documents of roughly the same length.
For each batch of the CNN and Daily Mail datasets we randomly reshuffled the assignment of named entities to the corresponding word embedding vectors to match the procedure proposed in [Hermann et al.2015]. This guaranteed that word embeddings of named entities were used only as semantically meaningless labels not encoding any intrinsic features of the represented entities. This forced the model to truly deduce the answer from the single context document associated with the question. We also do not use pre-trained word embeddings to make our training procedure comparable to [Hermann et al.2015].
We did not perform any text pre-processing since the original datasets were already tokenized.
We do not use any regularization since in our experience it leads to longer training times of single models, however, performance of a model ensemble is usually the same. This way we can train the whole ensemble faster when using multiple GPUs for parallel training.
For Additional details about the training procedure see Appendix A.
We evaluated the proposed model both as a single model and using ensemble averaging. Although the model computes attention for every word in the document we restrict the model to select an answer from a list of candidate answers associated with each question-document pair.
For single models we are reporting results for the best model as well as the average of accuracies for the best 20% of models with best performance on validation data since single models display considerable variation of results due to random weight initialization555 even for identical hyperparameter values. Single model performance may consequently prove difficult to reproduce.
What concerns ensembles, we used simple averaging of the answer probabilities predicted by ensemble members. For ensembling we used 14, 16, 84 and 53 models for CNN, Daily Mail and CBT CN and NE respectively. The ensemble models were chosen either as the top 70% of all trained models, we call this avg ensemble. Alternatively we use the following algorithm: We started with the best performing model according to validation performance. Then in each step we tried adding the best performing model that had not been previously tried. We kept it in the ensemble if it did improve its validation performance and discarded it otherwise. This way we gradually tried each model once. We call the resulting model a greedy ensemble.
Performance of our models on the CNN and Daily Mail datasets is summarized in Table 2, Table 3 shows results on the CBT dataset. The tables also list performance of other published models that were evaluated on these datasets. Ensembles of our models set new state-of-the-art results on all evaluated datasets.
Table 4 then measures accuracy as the proportion of test cases where the ground truth was among the top answers proposed by the greedy ensemble model for .
CNN and Daily Mail. The CNN dataset is the most widely used dataset for evaluation of text comprehension systems published so far. Performance of our single model is a little bit worse than performance of simultaneously published models [Chen et al.2016, Kobayashi et al.2016]. Compared to our work these models were trained with Dropout regularization [Srivastava et al.2014] which might improve single model performance. However, ensemble of our models outperforms these models even though they use pre-trained word embeddings.
On the CNN dataset our single model with best validation accuracy achieves a test accuracy of 69.5%. The average performance of the top 20% models according to validation accuracy is 69.9% which is even 0.5% better than the single best-validation model. This shows that there were many models that performed better on test set than the best-validation model. Fusing multiple models then gives a significant further increase in accuracy on both CNN and Daily Mail datasets..
CBT. In named entity prediction our best single model with accuracy of 68.6% performs 2% absolute better than the MenNN with self supervision, the averaging ensemble performs 4% absolute better than the best previous result. In common noun prediction our single models is 0.4% absolute better than MenNN however the ensemble improves the performance to 69% which is 6% absolute better than MenNN.
To further analyze the properties of our model, we examined the dependence of accuracy on the length of the context document (Figure 5), the number of candidate answers (Figure 6) and the frequency of the correct answer in the context (Figure 7).
On the CNN and Daily Mail datasets, the accuracy decreases with increasing document length (Figure 4(a)). We hypothesize this may be due to multiple factors. Firstly long documents may make the task more complex. Secondly such cases are quite rare in the training data (Figure 4(b)) which motivates the model to specialize on shorter contexts. Finally the context length is correlated with the number of named entities, i.e. the number of possible answers which is itself negatively correlated with accuracy (see Figure 6).
On the CBT dataset this negative trend seems to disappear (Fig. 4(c)). This supports the later two explanations since the distribution of document lengths is somewhat more uniform (Figure 4(d)) and the number of candidate answers is constant () for all examples in this dataset.
The effect of increasing number of candidate answers on the model’s accuracy can be seen in Figure 5(a). We can clearly see that as the number of candidate answers increases, the accuracy drops. On the other hand, the amount of examples with large number of candidate answers is quite small (Figure 5(b)).
Finally, since the summation of attention in our model inherently favours frequently occurring tokens, we also visualize how the accuracy depends on the frequency of the correct answer in the document. Figure 6(a) shows that the accuracy significantly drops as the correct answer gets less and less frequent in the document compared to other candidate answers. On the other hand, the correct answer is likely to occur frequently (Fig. 6(a)).
In this article we presented a new neural network architecture for natural language text comprehension. While our model is simpler than previously published models, it gives a new state-of-the-art accuracy on all evaluated datasets.
An analysis by [Chen et al.2016] suggests that on CNN and Daily Mail datasets a significant proportion of questions is ambiguous or too difficult to answer even for humans (partly due to entity anonymization) so the ensemble of our models may be very near to the maximal accuracy achievable on these datasets.
We would like to thank Tim Klinger for providing us with masked softmax code that we used in our implementation.
Empirical Methods in Natural Language Processing (EMNLP).
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.arXiv, pages 1–9.
Proceedings of The 30th International Conference on Machine Learning, pages 1310–1318.
During training we evaluated the model performance after each epoch and stopped the training when the error on the validation set started increasing.
The models usually converged after two epochs of training. Time needed to complete a single epoch of training on each dataset on an Nvidia K40 GPU is shown in Table 5.
|Dataset||Time per epoch|
|CBT Named Entity||1h||5min|
|CBT Common Noun||0h||56min|
The hyperparameters, namely the recurrent hidden layer dimension and the source embedding dimension, were chosen by grid search. We started with a range of 128 to 384 for both parameters and subsequently kept increasing the upper bound by 128 until we started observing a consistent decrease in validation accuracy. The region of the parameter space that we explored together with the parameters of the model with best validation accuracy are summarized in Table 6.
|Rec. Hid. Layer||Embedding|
In Section 6 we analysed how the test accuracy depends on how frequent the correct answer is compared to other answer candidates for the news datasets. The plots for the Children’s Book Test looks very similar, however we are adding it here for completeness.