Sequential Attention: A Context-Aware Alignment Function for Machine Reading

by   Sebastian Brarda, et al.
NYU college

In this paper we propose a neural network model with a novel Sequential Attention layer that extends soft attention by assigning weights to words in an input sequence in a way that takes into account not just how well that word matches a query, but how well surrounding words match. We evaluate this approach on the task of reading comprehension (on the Who did What and CNN datasets) and show that it dramatically improves a strong baseline--the Stanford Reader--and is competitive with the state of the art.



There are no comments yet.


page 1

page 2

page 3

page 4


Consensus Attention-based Neural Networks for Chinese Reading Comprehension

Reading comprehension has embraced a booming in recent NLP research. Sev...

Attention-over-Attention Neural Networks for Reading Comprehension

Cloze-style queries are representative problems in reading comprehension...

Contextualized Word Representations for Reading Comprehension

Reading a document and extracting an answer to a question about its cont...

Deep Neural Networks Evolve Human-like Attention Distribution during Reading Comprehension

Attention is a key mechanism for information selection in both biologica...

Effective Character-augmented Word Embedding for Machine Reading Comprehension

Machine reading comprehension is a task to model relationship between pa...

A3Net: Adversarial-and-Attention Network for Machine Reading Comprehension

In this paper, we introduce Adversarial-and-attention Network (A3Net) fo...

Modeling Task Effects in Human Reading with Neural Attention

Humans read by making a sequence of fixations and saccades. They often s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Soft attention Bahdanau et al. (2014), a differentiable method for selecting the inputs for a component of a model from a set of possibilities, has been crucial to the success of artificial neural network models for natural language understanding tasks like reading comprehension that take short passages as inputs. However, standard approaches to attention in NLP select words with only very indirect consideration of their context, limiting their effectiveness. This paper presents a method to address this by adding explicit context sensitivity into the soft attention scoring function.

We demonstrate the effectiveness of this approach on the task of cloze-style reading comprehension. A problem in the cloze style consists of a passage p, a question q and an answer a drawn from among the entities mentioned in the passage. In particular, we use the CNN dataset Hermann et al. (2015), which introduced the task into widespread use in evaluating neural networks for language understanding, and the newer and more carefully quality-controlled Who did What dataset Onishi et al. (2016).

Figure 1:

The Sequential Attention Model. RNNs first encode the question into a vector

and the document into a sequence of vectors . For each word index in the document, a scoring vector is then computed from and using a function like the partial bilinear function shown here. These vectors are then used as inputs to another RNN layer, the outputs of which () are summed elementwise and used as attention scores () in answer selection.

In standard approaches to soft attention over passages, a scoring function is first applied to every word in the source text to evaluate how closely that word matches a query vector (here, a function of the question). The resulting scores are then normalized and used as the weights in a weighted sum which produces an output or context vector summarizing the most salient words of the input, which is then used in a downstream model (here, to select an answer).

In this work we propose a novel scoring function for soft attention that we call Sequential Attention (SA), shown in Figure 1. In an SA model, a mutiplicative interaction scoring function is used to produce a scoring vector for each word in the source text. A newly-added bidirectional RNN then consumes those vectors and uses them to produce a context-aware scalar score for each word. We evaluate this scoring function within the context of the Stanford Reader Chen et al. (2016), and show that it yields dramatic improvements in performance. On both datasets, it is outperformed only by the Gated Attention Reader Dhingra et al. (2016), which in some cases has access to features not explicitly seen by our model.

2 Related Work

In addition to chen’s Stanford Reader model, there have been several other modeling approaches developed to address these reading comprehension tasks.

DBLP:journals/corr/SeoKFH16 introduced the Bi-Directional Attention Flow which consists of a multi-stage hierarchical process to represent context at different levels of granularity; it use the concatenation of passage word representation, question word representation, and the element-wise product of these vectors in their attention flow layer. This is a more complex variant of the classic bi-linear term that multiplies this concatenated vector with a vector of weights, producing attention scalars. DBLP:journals/corr/DhingraLCS16’s Gated-Attention Reader integrates a multi-hop structure with a novel attention mechanism, essentially building query specific representations of the tokens in the document to improve prediction. This model conducts a classic dot-product soft attention to weight the query representations which are then multiplied element-wise with the context representations, and fed into the next layer of RNN. After several hidden layers that repeat the same process, the dot product between the context representation and the query is used to compute a classic soft-attention.

Outside the task of reading comprehension there has been other work on soft attention over text, largely focusing on the problem of attending over single sentences. luong2015effective study several issues in the design of soft attention models in the context of translation, and introduce the bilinear scoring function. They also propose the idea of attention input-feeding where the original attention vectors are concatenated with the hidden representations of the words and fed into the next RNN step. The goal is to make the model fully aware of the previous alignment choices.

In work largely concurrent to our own, kim2017structured explore the use of conditional random fields (CRFs) to impose a variety of constraints on attention distributions achieving strong results on several sentence level tasks.

3 Modeling

Given the tuple (passage, question, answer), our goal is to predict where refers to answer, to passage, and to question. We define the words of each passage and question as and , respectively, where exactly one contains the token @blank

, representing a blank that can be correctly filled in by the answer. With calibrated probabilities

, we take the where possible ’s are restricted to the subset of anonymized entity symbols present in . In this section, we present two models for this reading comprehension task: chen’s Stanford Reader, and our version with a novel attention mechanism which we call the Sequential Attention model.

3.1 Stanford Reader


Each word or entity symbol is mapped to a d-dimensional vector via embedding matrix . For simplicity, we denote the vectors of the passage and question as and , respectively. The Stanford Reader Chen et al. (2016) uses bidirectional GRUs Cho et al. (2014) to encode the passage and questions. For the passage, the hidden state is defined: . Where contextual embeddings of each word in the passage are encoded in both directions.


And for the question, the last hidden representation of each direction is concatenated:

Attention and answer selection

The Stanford Reader uses bilinear attention Luong et al. (2015):


Where is a learned parameters matrix of the bilinear term that computes the similarity between and with greater flexibility than a dot product. The output vector is then computed as a linear combination of the hidden representations of the passage, weighted by the attention coefficients:


The prediction is the answer, , with highest probability from among the anonymized entities:


Here, is the weight matrix that maps the output to the entities, and

represents the column of a certain entity. Finally a softmax layer is added on top of

with a negative log-likelihood objective for training.

3.2 Sequential Attention

In the Sequential Attention model instead of producing a single scalar value for each word in the passage by using a bilinear term, we define the vectors with a partial-bilinear term111Note that doing softmax over the sum of the terms of the vectors would lead to the same of the Stanford Reader.. Instead of doing the dot product as in the bilinear term, we conduct an element wise multiplication to produce a vector instead of a scalar:


Where is a matrix of learned parameters. It is also possible to use an element-wise multiplication, thus prescinding the parameters :


We then feed the vectors into a new bidirectional GRU layer to get the hidden attention vector representation.


We concatenate the directional vectors to be consistent with the structure of previous layers.


Finally, we compute the weights as below, and proceed as before.


4 Experiments and Results

We evaluate our model on two tasks, CNN and Who did What (WDW). For CNN, we used the anonymized version of the dataset released by hermann, containing training, dev, and test examples. For WDW we used onishi’s data generation script to reproduce their WDW data, yielding training, dev, and test examples.222In the WDW data we found 340 examples in the strict training set, 545 examples in the relaxed training set, 20 examples in the test set, and 30 examples in the validation set that were not answerable because the anonymized answer entity did not exist in the passage. We removed these examples, reducing the size of the WDW test set by , to . We believe this difference is not significant and did not bias the comparison between models. We used the strict version of WDW.


We implemented all our models in Theano

Theano Development Team (2016) and Lasagne Dieleman et al. (2015) and used the Stanford Reader Chen et al. (2016)

open source implementation as a reference. We largely used the same hyperparameters as chen in the Stanford Reader:

, embedding size , GloVe Pennington et al. (2014) word embeddings333The GloVe word vectors used were pretrained with 6 billion tokens with an uncased vocab of 400K words, and were obtained from Wikipedia 2014 and Gigaword 5. for initialization, hidden size . The size of the hidden layer of the bidirectional RNN used to encode the attention vectors is double the size of the one that encodes the words, since it receives vectors that result from the concatenation of GRUs that go in both directions, . Attention and output parameters were initialized from a while GRU weights were initialized from a

. Learning was carried out with SGD with a learning rate of 0.1, batch size of 32, gradient clipping of norm 10 and dropout of

in all the vertical layers444We also tried increasing the hidden size to 200, using 200d GloVe word representations and increasing the dropout rate to 0.3. Finally we increased the number of hidden encoding layers to two. None of these changes resulted in significant performance improvements in accordance with chen. (including the Sequential Attention layer). Also, all the anonymized entities were relabeled according to the order of occurrence, as in the Stanford Reader

. We trained all models for 30 epochs.

Model WDW Strict CNN
Attentive Reader 53% 63%
Stanford Reader 65.6% 73.4%
  + SA partial-bilinear 67.2% 77.1%
Gated Att. Reader 71.2% 77.9%
Table 1: Accuracy on WDW and CNN test sets
Figure 2: Representative sample output for the Stanford Reader and our model.

4.1 Results

Who did What

In our experiments the Stanford Reader (SR) achieved an accuracy of on the strict WDW dataset compared to the that onishi reported. The Sequential Attention model (SA) with partial-bilinear scoring function got , which is the second best performance on the leaderboard, only surpassed by the from the Gated Attention Reader (GA) with qe-comm Li et al. (2016) features and fixed GloVe embeddings. However, the GA model without qe-comm features and fixed embeddings performs significantly worse at . We did not use these features in our SA models, and it is likely that adding these features could further improve SA model performance. We also experimented with fixed embeddings in SA models, but fixed embeddings reduced SA performance.

Another experiment we conducted was to add K training samples from CNN to the WDW data. This increase in the training data size boosted accuracy by with the SR and with the Sequential Attention model reaching a accuracy. This improvement strongly suggests that the gap in performance/difficulty between the CNN and the WDW datasets is partially related to the difference in the training set sizes which results in overfitting.


For a final sanity check and a fair comparison against a well known benchmark, we ran our Sequential Attention model on exactly the same CNN data used by chen.

The Sequential Attention model with partial-bilinear attention scoring function took an average of 2X more time per epoch to train vs. the Stanford Reader. However, our model converged in only 17 epochs vs. 30 for the SR. The results of training the SR on CNN were slightly lower than the reported by chen. The Sequential Attention model achieved accuracy, a gain with respect to SR.

4.1.1 Model comparison on CNN

After achieving good performance with SA we wanted to understand what was driving the increase in accuracy. It is clear that SA has more trainable parameters compared to SR. However, it was not clear if the additional computation required to learn those parameters should be allocated in the attention mechanism, or used to compute richer hidden representations of the passage and questions. Additionally, the bilinear parameters increase the computational requirements, but their impact on performance was not clear. To answer these questions we compared the following models: i) SR with dot-product attention; ii) SR with bilinear attention; iii) SR with two layers (to compute the hidden question and passage representations) and dot-product attention; iv) SR with two layers and bilinear attention; v) SA with elementwise multiplication scoring function; vi) SA with partial-bilinear scoring function.

Surprisingly, the element-wise version of SA performed better than the partial-bilinear version, with an accuracy of which, to our knowledge, has only been surpassed by DBLP:journals/corr/DhingraLCS16 with their Gated-Attention Reader model.

Additionally, 1-layer SR with dot-product attention got lower accuracy than the 1-layer SR with bilinear attention. These results suggest that the bilinear parameters do not significantly improve performance over dot-product attention.

Adding an additional GRU layer to encode the passage and question in the SR model increased performance over the original 1-layer model. With dot-product attention the increase was whereas with bilinear attention, the increase was . However, these performance increases were considerably less than the lift from using an SA model (and SA has fewer parameters).

Model CNN Params
SR, dot prod. att. 73.1%
SR, bilinear att. 73.4%
SR, 2-layer, dot prod. att. 74.2%
SR, 2-layer, bilinear att. 74.7%
SA, element-wise att. 77.3%
SA, partial-bilinear att. 77.1%
Table 2: Accuracy on CNN test sets and number of trainable parameters for various Stanford Reader (SR) and Sequential Attention (SA) models.

4.2 Discussion

The difference between our Sequential Attention

and standard approaches to attention is that we conserve the distributed representation of similarity for each token and use that contextual information when computing attention over other words. In other words, when the bilinear attention layer computes

, it only cares about the magnitude of the resulting (the amount of attention that it gives to that word). Whereas if we keep the vector we can also know which were the dimensions of the distributed representation of the attention that weighted in that decision. Furthermore, if we use that information to feed a new GRU, it helps the model to learn how to assign attention to surrounding words.

Compared to Sequential Attention, Bidirectional attention flow uses a considerably more complex architecture with a query representations for each word in the question. Unlike the Gated Attention Reader, SA does not require intermediate soft attention and it uses only one additional RNN layer. Furthermore, in SA no dot product is required to compute attention, only the sum of the elements of the vector. SA’s simpler architecture performs close to the state-of-the-art.

Figure 2 shows some sample model behavior. In this example and elsewhere, SA results in less sparse attention vectors compared to SR, and this helps the model assign attention not only to potential target strings (anonymized entities) but also to relevant contextual words that are related to those entities. This ultimately leads to richer semantic representations of the passage.

Finally, we found: i) bilinear attention does not yield dramatically higher performance compared to dot-product attention; ii) bilinear parameters do not improve SA performance; iii) Increasing the number of layers in the attention mechanism yields considerably greater performance gains with fewer parameters compared to increasing the number of layers used to compute the hidden representations of the question and passage.

5 Conclusion and Discussion

In this this paper we created a novel and simple model with a Sequential Attention mechanism that performs near the state of the art on the CNN and WDW datasets by improving the bilinear and dot-product attention mechanisms with an additional bi-directional RNN layer. This additional layer allows local alignment information to be used when computing the attentional score for each token. Furthermore, it provides higher performance gains with fewer parameters compared to adding an additional layer to compute the question and passage hidden representations. For future work we would like to try other machine reading datasets such as SQuAD and MS MARCO. Also, we think that some elements of the SA model could be mixed with ideas applied in recent research from DBLP:journals/corr/DhingraLCS16 and DBLP:journals/corr/SeoKFH16. We believe that the SA mechanism may benefit other tasks as well, such as machine translation.


This paper was the result of a term project for the NYU Course DS-GA 3001, Natural Language Understanding with Distributed Representations. Bowman acknowledges support from a Google Faculty Research Award and gifts from Tencent Holdings and the NVIDIA Corporation.