1 Introduction
Teaching machines to comprehend a given context paragraph and answer corresponding questions is one of the longterm goals of natural language processing and artificial intelligence. Figure
1gives an example of the machine reading comprehension (MRC) task. Benefiting from the rapid development of deep learning techniques
[Goodfellow et al.2016] and largescale benchmark datasets [Hermann et al.2015, Hill et al.2016, Rajpurkar et al.2016], endtoend neural networks have achieved promising results on this task
[Wang et al.2017, Seo et al.2017, Xiong et al.2017a, Huang et al.2017].Despite of the advancements, we argue that there still exists two limitations:

To capture complex interactions between the context and the question, a variety of neural attention [Dzmitry Bahdanau2015], such as biattention [Seo et al.2017], coattention [Xiong et al.2017b], are proposed in a singleround alignment architecture. In order to fully compose complete information of the inputs, multiround alignment architectures that compute attentions repeatedly have been proposed [Huang et al.2017, Xiong et al.2017a]. However, in these approaches, the current attention is unaware of which parts of the context and question have been focused in earlier attentions, which results in two distinct but related issues, where multiple attentions 1) focuses on same texts, leading to attention redundancy and 2) fail to focus on some salient parts of the input, causing attention deficiency.

To train the model, standard maximumlikelihood method is used for predicting exactlymatched (EM) answer spans [Wang and Jiang2017]. Recently, reinforcement learning algorithm, which measures the reward as word overlap between the predicted answer and the groung truth, is introduced to optimize towards the F1 metric instead of EM metric [Xiong et al.2017a]
. Specifically, an estimated baseline is utilized to normalize the reward and reduce variances. However, the convergence can be suppressed when the baseline is better than the reward. This is harmful if the inferior reward is partially overlapped with the ground truth, as the normalized objective will discourage the prediction of ground truth positions. We refer to this case as the
convergence suppression problem.
To address the first problem, we present a reattention mechanism that temporally memorizes past attentions and uses them to refine current attentions in a multiround alignment architecture. The computation is based on the fact that two words should share similar semantics if their attentions about same texts are highly overlapped, and be less similar vice versa. Therefore, the reattention can be more concentrated if past attentions focus on same parts of the input, or be relatively more distracted so as to focus on new regions if past attentions are not overlapped at all.
As for the second problem, we extend the traditional training method with a novel approach called dynamiccritical reinforcement learning. Unlike the traditional reinforcement learning algorithm where the reward and baseline are statically sampled, our approach dynamically decides the reward and the baseline according to two sampling strategies, namely random inference and greedy inference. The result with higher score is always set to be the reward while the other is the baseline. In this way, the normalized reward is ensured to be always positive so that no convergence suppression will be made.
All of the above innovations are integrated into a new endtoend neural architecture called Reinforced Mnemonic Reader in Figure 3. We conducted extensive experiments on both the SQuAD [Rajpurkar et al.2016] dataset and two adversarial SQuAD datasets [Jia and Liang2017] to evaluate the proposed model. On SQuAD, our single model obtains an exact match (EM) score of 79.5% and F1 score of 86.6%, while our ensemble model further boosts the result to 82.3% and 88.5% respectively. On adversarial SQuAD, our model surpasses existing approahces by more than 6% on both AddSent and AddOneSent datasets.
2 MRC with Reattention
2.1 Task Description
For the MRC tasks, a question and a context are given, our goal is to predict an answer , which has different forms according to the specific task. In the SQuAD dataset [Rajpurkar et al.2016], the answer is constrained as a segment of text in the context
, nerual networks are designed to model the probability distribution
.2.2 Alignment Architecture for MRC
Among all stateoftheart works for MRC, one of the key factors is the alignment architecture. That is, given the hidden representations of question and context, we align each context word with the entire question using attention mechanisms, and enhance the context representation with the attentive question information. A detailed comparison of different alignment architectures is shown in Table
1.Early work for MRC, such as MatchLSTM [Wang and Jiang2017]
, utilizes the attention mechanism stemmed from neural machine translation
[Dzmitry Bahdanau2015]serially, where the attention is computed inside the cell of recurrent neural networks. A more popular approach is to compute attentions in parallel, resulting in a similarity matrix. Concretely, given two sets of hidden vectors,
and , representing question and context respectively, a similarity matrix is computed as(1) 
where indicates the similarity between th question word and th context word, and is a scalar function. Different methods are proposed to normalize the matrix, resulting in variants of attention such as biattention[Seo et al.2017] and coattention [Xiong et al.2017b]. The attention is then used to attend the question and form a questionaware context representation .
Model  Aligning Rounds  Attention  
Interactive  Self  Type  
MatchLSTM  1    Serial 
Rnet  1  1  Serial 
BiDAF  1    Parallel 
FastQAExt  1  1  Parallel 
DCN+  2  2  Parallel 
FusionNet  3  1  Parallel 
Our Model  3  3  Parallel 
Later, Wang et al. Wang17b propose a serial self aligning method to align the context aginst itself for capturing longterm dependencies among context words. Weissenborn et al. [Weissenborn et al.2017] apply the self alignment in a similar way of Eq. 1, yielding another similarity matrix as
(2) 
where is an indicator function ensuring that the context word is not aligned with itself. Finally, the attentive information can be integrated to form a selfaware context representation , which is used to predict the answer.
We refer to the above process as a singleround alignment architecture. Such architecture, however, is limited in its capability to capture complex interactions among question and context. Therefore, recent works build multiround alignment architectures by stacking several identical aligning layers [Huang et al.2017, Xiong et al.2017a]. More specifically, let and denote the hidden representations of question and context in th layer, and is the corresponding questionaware context representation. Then the two similarity matrices can be computed as
(3) 
However, one problem is that each alignment is not directly aware of previous alignments in such architecture. The attentive information can only flow to the subsequent layer through the hidden representation. This can cause two problems: 1) the attention redundancy, where multiple attention distributions are highly similar. Let denote the softmax function over a vector . Then this problem can be formulized as , where is a small bound and is a function measuring the distribution distance. 2) the attention deficiency, which means that the attention fails to focus on salient parts of the input: , where is another bound and is the “ground truth” attention distribution.
2.3 Reattention Mechanism
To address these problems, we propose to temporally memorize past attentions and explicitly use them to refine current attentions. The intuition is that two words should be correlated if their attentions about same texts are highly overlapped, and be less related vice versa. For example, in Figure 2, suppose that we have access to previous attentions, and then we can compute their dot product to obtain a “similarity of attention”. In this case, the similarity of word pair (team, Broncos) is higher than (team, Panthers).
Therefore, we define the computation of reattention as follows. Let and denote the past similarity matrices that are temporally memorized. The refined similarity matrix () is computed as
(4) 
where is a trainable parameter. Here, is the past context attention distribution for the th question word, and is the self attention distribution for the th context word. In the extreme case, when there is no overlap between two distributions, the dot product will be . On the other hand, if the two distributions are identical and focus on one single word, it will have a maximum value of . Therefore, the similarity of two words can be explicitly measured using their past attentions. Since the dot product is relatively small than the original similarity, we initialize the with a tunable hyperparameter and keep it trainable. The refined similarity matrix can then be normalized for attending the question. Similarly, we can compute the refined matrix to get the unnormalized self reattention as
(5) 
3 Dynamiccritical Reinforcement Learning
In the extractive MRC task, the model distribution can be divided into two steps: first predicting the start position and then the end position as
(6) 
where represents all trainable parameters.
The standard maximumlikelihood (ML) training method is to maximize the log probabilities of the ground truth answer positions [Wang and Jiang2017]
(7) 
where and are the answer span for the th example, and we denote and as and respectively for abbreviation.
Recently, reinforcement learning (RL), with the task reward measured as word overlap between predicted answer and groung truth, is introduced to MRC [Xiong et al.2017a]. A baseline , which is obtained by running greedy inference with the current model, is used to normalize the reward and reduce variances. Such approach is known as the selfcritical sequence training (SCST) [Rennie et al.2016], which is first used in image caption. More specifically, let denote the F1 score between a sampled answer and the ground truth . The training objective is to minimize the negative expected reward by
(8) 
where we abbreviate the model distribution as , and the reward function as . is obtained by greedily maximizing the model distribution:
The expected gradient can be computed according to the REINFORCE algorithm [Sutton and Barto1998] as
(9) 
where the gradient can be approxiamated using a single MonteCarlo sample derived from .
However, a sampled answer is discouraged by the objective when it is worse than the baseline. This is harmful if the answer is partially overlapped with ground truth, since the normalized objective would discourage the prediction of ground truth positions. For example, in Figure 1, suppose that is champion Denver Broncos and is Denver Broncos. Although the former is an acceptable answer, the normalized reward would be negative and the prediction for end position would be suppressed, thus hindering the convergence. We refer to this case as the convergence suppression problem.
Here, we consider both random inference and greedy inference as two different sampling strategies: the first one encourages exploration while the latter one is for exploitation^{1}^{1}1In practice we found that a better approximation can be made by considering a top answer list, where is the best result and is sampled from the rest of the list.. Therefore, we approximate the expected gradient by dynamically set the reward and baseline based on the F1 scores of both and . The one with higher score is set as reward, while the other is baseline. We call this approach as dynamiccritical reinforcement learning (DCRL)
(10) 
Notice that the normalized reward is constantly positive so that superior answers are always encouraged. Besides, when the score of random inference is higher than the greedy one, DCRL is equivalent to SCST. Thus, Eq. 3 is a special case of Eq. 3.
Following [Xiong et al.2017a] and [Kendall et al.2017], we combine ML and DCRL objectives using homoscedastic uncertainty as taskdependent weightings so as to stabilize the RL training as
(11) 
where and are trainable parameters.
4 Endtoend Architecture
Based on previous innovations, we introduce an endtoend architecture called Reinforced Mnemonic Reader, which is shown in Figure 3. It consists of three main components: 1) an encoder builds contextual representations for question and context jointly; 2) an iterative aligner performs multiround alignments between question and context with the reattention mechanism; 3) an answer pointer predicts the answer span sequentially. Beblow we give more details of each component.
Encoder. Let and denote the word sequences of the question and context respectively. The encoder firstly converts each word to an input vector. We utilize the 100dim GloVe embedding [Pennington et al.2014] and 1024dim ELMo embedding [Peters et al.2018]
. Besides, a characterlevel embedding is obtained by encoding the character sequence with a bidirectional long shortterm memory network (BiLSTM)
[Hochreiter and Schmidhuber1997], where two last hidden states are concatenated to form the embedding. In addition, we use binary feature of exact match, POS embedding and NER embedding for both question and context, as suggested in [Chen et al.2017]. Together the inputs and are obtained.To model each word with its contextual information, a weightshared BiLSTM is utilized to perform the encoding
(12) 
Thus, the contextual representations for both question and context words can be obtained, denoted as two matrices: and .
Iterative Aligner. The iterative aligner contains a stack of three aligning blocks. Each block consists of three modules: 1) an interactive alignment to attend the question into the context; 2) a self alignment to attend the context against itself; 3) an evidence collection to model the context representation with a BiLSTM. The reattention mechanism is utilized between two blocks, where past attentions are temporally memorizes to help modulating current attentions. Below we first describe a single block in details, which is shown in Figure 4, and then introduce the entire architecture.
Single Aligning Block. First, the similarity matrix is computed using Eq. 1, where the multiplicative product with nonlinearity is applied as attention function: . The question attention for the th context word is then: , which is used to compute an attended question vector .
To efficiently fuse the attentive information into the context, an heuristic fusion function, denoted as
, is proposed as(13) 
where
denotes the sigmoid activation function,
denotes elementwise multiplication, and the bias term is omitted. The computation is similar to the highway networks [Srivastava et al.2015], where the output vectoris a linear interpolation of the input
and the intermediate vector . A gate is used to control the composition degree to which the intermediate vector is exposed. With this function, the questionaware context vectors can be obtained as: .Similar to the above computation, a self alignment is applied to capture the longterm dependencies among context words. Again, we compute a similarity matrix using Eq. 2. The attended context vector is then computed as: , where is the self attention for the th context word. Using the same fusion function as , we can obtain selfaware context vectors .
Finally, a BiLSTM is used to perform the evidence collection, which outputs the fullyaware context vectors with as its inputs.
Multiround Alignments with Reattention. To enhance the ability of capturing complex interactions among inputs, we stack two more aligning blocks with the reattention mechanism as follows
(14) 
where denote the th block. In the th block (), we fix the hidden representation of question as , and set the hidden representation of context as previous fullyaware context vectors . Then we compute the unnormalized reattention and with Eq. 2.3 and Eq. 2.3
respectively. In addition, we utilize a residual connection
[He et al.2016] in the last BiLSTM to form the final fullyaware context vectors : .Answer Pointer. We apply a variant of pointer networks [Vinyals et al.2015] as the answer pointer to make the predictions. First, the question representation is summarized into a fixedsize summary vector as: , where . Then we compute the start probability by heuristically attending the context representation with the question summary as
(15) 
Next, a new question summary is updated by fusing context information of the start position, which is computed as , into the old question summary: . Finally the end probability is computed as
(16) 
5 Experiments
5.1 Implementation Details
We mainly focus on the SQuAD dataset [Rajpurkar et al.2016] to train and evaluate our model. SQuAD is a machine comprehension dataset, totally containing more than questions manually annotated by crowdsourcing workers on a set of Wikipedia articles. In addition, we also test our model on two adversarial SQuAD datasets [Jia and Liang2017], namely AddSent and AddOneSent. In both adversarial datasets, a confusing sentence with a wrong answer is appended at the end of the context in order to fool the model.
We evaluate the Reinforced Mnemonic Reader (R.MReader) by running the following setting. We first train the model until convergence by optimizing Eq. 7. We then finetune this model with Eq. 11, until the F1 score on the development set no longer improves.
We use the Adam optimizer [Kingma and Ba2014] for both ML and DCRL training. The initial learning rates are and respectively, and are halved whenever meeting a bad iteration. The batch size is and a dropout rate [Srivastava et al.2014] of
is used to prevent overfitting. Word embeddings remain fixed during training. For out of vocabulary words, we set the embeddings from Gaussian distributions and keep them trainable. The size of character embedding and corresponding LSTMs is
, the main hidden size is, and the hyperparameter
is .5.2 Overall Results
We submitted our model on the hidden test set of SQuAD for evaluation. Two evaluation metrics are used: Exact Match (EM), which measures whether the predicted answer are exactly matched with the ground truth, and F1 score, which measures the degree of word overlap at token level.
As shown in Table 2, R.MReader achieves an EM score of 79.5% and F1 score of 86.6%. Since SQuAD is a competitve MRC benchmark, we also build an ensemble model that consists of single models with the same architecture but initialized with different parameters. Our ensemble model improves the metrics to 82.3% and 88.5% respectively^{2}^{2}2The results are on https://worksheets.codalab.org/worksheets/ 0xe6c23cbae5e440b8942f86641f49fd80..
Single Model  Dev  Test  

EM  F1  EM  F1  
LR Baseline  40.0  51.0  40.4  51.0 
DCN+  74.5  83.1  75.1  83.1 
FusionNet  75.3  83.6  76.0  83.9 
SAN  76.2  84.1  76.8  84.4 
AttentionReader+      77.3  84.9 
BSE  77.9  85.6  78.6  85.8 
Rnet+      79.9  86.5 
SLQA+      80.4  87.0 
Hybrid AoA Reader+      80.0  87.3 
R.MReader  78.9  86.3  79.5  86.6 
Ensemble Model  
DCN+      78.8  86.0 
FusionNet  78.5  85.8  79.0  86.0 
SAN  78.6  85.8  79.6  86.5 
BSE  79.6  86.6  81.0  87.4 
AttentionReader+      81.8  88.2 
Rnet+      82.6  88.5 
SLQA+      82.4  88.6 
Hybrid AoA Reader+      82.5  89.3 
R.MReader  81.2  87.9  82.3  88.5 
Human  80.3  90.5  82.3  91.2 
Table 3 shows the performance comparison on two adversarial datasets, AddSent and AddOneSent. All models are trained on the original train set of SQuAD, and are tested on the two datasets. As we can see, R.MReader comfortably outperforms all previous models by more than 6% in both EM and F1 scores, indicating that our model is more robust against adversarial attacks.
5.3 Ablation Study
The contributions of each component of our model are shown in Table 4. Firstly, ablation (14) explores the utility of reattention mechanism and DCRL training method. We notice that reattention has more influences on EM score while DCRL contributes more to F1 metric, and removing both of them results in huge drops on both metrics. Replacing DCRL with SCST also causes a marginal decline of performance on both metrics. Next, we relace the default attention function with the dot product: (5), and both metrics suffer from degradations. (67) shows the effectiveness of heuristics used in the fusion function. Removing any of the two heuristics leads to some performance declines, and heuristic subtraction is more effective than multiplication. Ablation (89) further explores different forms of fusion, where gate refers to and MLP denotes in Eq. 4, respectively. In both cases the highwaylike function has outperformed its simpler variants. Finally, we study the effect of different numbers of aligning blocks in (1012). We notice that using 2 blocks causes a slight performance drop, while increasing to 4 blocks barely affects the SoTA result. Interestingly, a very deep alignment with 5 blocks results in a significant performance decline. We argue that this is because the model encounters the degradation problem existed in deep networks [He et al.2016].
Model  AddSent  AddOneSent  

EM  F1  EM  F1  
LR Baseline  17.0  23.2  22.3  41.8 
MatchLSTM  24.3  34.2  34.8  41.8 
BiDAF  29.6  34.2  40.7  46.9 
SEDT  30.0  35.0  40.0  46.5 
ReasoNet  34.6  39.4  43.6  49.8 
FusionNet  46.2  51.4  54.7  60.7 
R.MReader  53.0  58.5  60.9  67.0 
Configuration  EM  F1  EM  F1 

R.MReader  78.9  86.3  
(1)  Reattention  78.1  85.8  0.8  0.5 
(2)  DCRL  78.2  85.4  0.7  0.9 
(3)  Reattention, DCRL  77.1  84.8  1.8  1.5 
(4)  DCRL, + SCST  78.5  85.8  0.4  0.5 
(5) Attention: Dot  78.2  85.9  0.7  0.4 
(6)  Heuristic Sub  78.1  85.7  0.8  0.6 
(7)  Heuristic Mul  78.3  86.0  0.6  0.3 
(8) Fusion: Gate  77.9  85.6  1.0  0.7 
(9) Fusion: MLP  77.2  85.2  1.7  1.1 
(10) Num of Blocks: 2  78.7  86.1  0.2  0.2 
(11) Num of Blocks: 4  78.8  86.3  0.1  0 
(12) Num of Blocks: 5  77.5  85.2  1.4  1.1 
5.4 Effectiveness of Reattention
We further present experiments to demonstrate the effectiveness of reattention mechanism. For the attention redundancy problem, we measure the distance of attention distributions in two adjacent aligning blocks, e.g., and . Higher distance means less attention redundancy. For the attention deficiency problem, we take the arithmetic mean of multiple attention distributions from the ensemble model as the “ground truth” attention distribution , and compute the distance of individual attention
with it. Lower distance refers to less attention deficiency. We use Kullback–Leibler divergence as the distance function
, and we report the averaged value over all examples.KL divergence   Reattention  + Reattention 

Redundancy  
to  0.695 0.086  0.866 0.074 
to  0.404 0.067  0.450 0.052 
to  0.976 0.092  1.207 0.121 
to  1.179 0.118  1.193 0.097 
Deficiency  
to  0.650 0.044  0.568 0.059 
to  0.536 0.047  0.482 0.035 
Table 5 shows the results. We first see that the reattention indeed help in alleviating the attention redundancy: the divergence between any two adjacent blocks has been successfully enlarged with reattention. However, we find that the improvement between the first two blocks is larger than the one of last two blocks. We conjecture that the first reattention is more accurate at measuring the similarity of word pairs by using the original encoded word representation, while the latter reattention is distracted by highly nonlinear word representations. In addition, we notice that the attention deficiency has also been moderated: the divergence betwen normalized and is reduced.
5.5 Prediction Analysis
Figure 5 compares predictions made either with dynamiccritical reinforcement learning or with selfcritical sequence training. We first find that both approaches are able to obtain answers that match the querysensitive category. For example, the first example shows that both four and two are retrieved when the questions asks for how many. Nevertheless, we observe that DCRL constantly makes more accurate prediction on answer spans, especially when SCST already points a rough boundary. In the second example, SCST takes the whole phrase after Dyrrachium as its location. The third example shows a similar phenomenon, where the SCST retrieves the phrase constantly servicing and replacing mechanical brushes as its answer. We demonstrates that this is because SCST encounters the convergence suppression problem, which impedes the prediction of ground truth answer boundaries. DCRL, however, successfully avoids such problem and thus finds the exactly correct entity.
6 Conclusion
We propose the Reinforced Mnemonic Reader, an enhanced attention reader with two main contributions. First, a reattention mechanism is introduced to alleviate the problems of attention redundancy and deficiency in multiround alignment architectures. Second, a dynamiccritical reinforcement learning approach is presented to address the convergence suppression problem existed in traditional reinforcement learning methods. Our model achieves the stateoftheart results on the SQuAD dataset, outperforming several strong competing systems. Besides, our model outperforms existing approaches by more than 6% on two adversarial SQuAD datasets. We believe that both reattention and DCRL are general approaches, and can be applied to other NLP task such as natural language inference. Our future work is to study the compatibility of our proposed methods.
Acknowledgments
This research work is supported by National Basic Research Program of China under Grant No. 2014CB340303. In addition, we thank Pranav Rajpurkar for help in SQuAD submissions.
References
 [Chen et al.2017] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer opendomain questions. arXiv preprint arXiv:1704.00051, 2017.
 [Dzmitry Bahdanau2015] Yoshua Bengio Dzmitry Bahdanau, Kyunghyun Cho. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR, 2015.
 [Goodfellow et al.2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.
 [He et al.2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of CVPR, pages 770–778, 2016.
 [Hermann et al.2015] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, , and Phil Blunsom. Teaching machines to read and comprehend. In Proceedings of NIPS, 2015.
 [Hill et al.2016] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. The goldilocks principle: Reading children’s books with explicit memory representations. In Proceedings of ICLR, 2016.
 [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735––1780, 1997.
 [Huang et al.2017] HsinYuan Huang, Chenguang Zhu, Yelong Shen, and Weizhu Chen. Fusionnet: Fusing via fullyaware attention with application to machine comprehension. arXiv preprint arXiv:1711.07341, 2017.
 [Jia and Liang2017] Robin Jia and Percy Liang. Adversarial examples for evaluating reading comprehension systems. In Proceedings of EMNLP, 2017.
 [Kendall et al.2017] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multitask learning using uncertainty to weigh losses for scene geometry and semantics. arXiv preprint arXiv:1705.07115, 2017.
 [Kingma and Ba2014] Diederik P. Kingma and Lei Jimmy Ba. Adam: A method for stochastic optimization. In CoRR, abs/1412.6980, 2014.
 [Liu et al.2017a] Rui Liu, Junjie Hu, Wei Wei, Zi Yang, and Eric Nyberg. Structural embedding of syntactic trees for machine comprehension. arXiv preprint arXiv:1703.00572, 2017.
 [Liu et al.2017b] Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. Stochastic answer networks for machine reading comprehension. arXiv preprint arXiv:1712.03556, 2017.
 [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Proceedings of EMNLP, 2014.
 [Peters et al.2018] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word prepresentations. In Proceedings of NAACL, 2018.
 [Rajpurkar et al.2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of EMNLP, 2016.
 [Rennie et al.2016] Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Selfcritical sequence training for image captioning. arXiv preprint arXiv:1612.00563, 2016.
 [Seo et al.2017] Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hananneh Hajishirzi. Bidirectional attention flow for machine comprehension. In Proceedings of ICLR, 2017.
 [Shen et al.2016] Yelong Shen, PoSen Huang, Jianfeng Gao, and Weizhu Chen. Reasonet: Learning to stop reading in machine comprehension. arXiv preprint arXiv:1609.05284, 2016.

[Srivastava et al.2014]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov.
Dropout: A simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research
, pages 1929–1958, 2014.  [Srivastava et al.2015] RupeshKumar Srivastava, Klaus Greff, and Jurgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
 [Sutton and Barto1998] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT Press, 1998.
 [Vinyals et al.2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Proceedings of NIPS, 2015.
 [Wang and Jiang2017] Shuohang Wang and Jing Jiang. Machine comprehension using matchlstm and answer pointer. In Proceedings of ICLR, 2017.
 [Wang et al.2017] Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. Gated selfmatching networks for reading comprehension and question answering. In Proceedings of ACL, 2017.
 [Weissenborn et al.2017] Dirk Weissenborn, Georg Wiese, and Laura Seiffe. Making neural qa as simple as possible but not simpler. In Proceedings of CoNLL, pages 271–280, 2017.
 [Xiong et al.2017a] Caiming Xiong, Victor Zhong, and Richard Socher. Dcn+: Mixed objective and deep residual coattention for question answering. arXiv preprint arXiv:1711.00106, 2017.
 [Xiong et al.2017b] Caiming Xiong, Victor Zhong, and Richard Socher. Dynamic coattention networks for question answering. In Proceedings of ICLR, 2017.