Reinforced Mnemonic Reader for Machine Comprehension

by   Minghao Hu, et al.
FUDAN University

In this paper, we introduce the Reinforced Mnemonic Reader for machine comprehension (MC) task, which aims to answer a query about a given context document. We propose several novel mechanisms that address critical problems in MC that are not adequately solved by previous works, such as enhancing the capacity of encoder, modeling long-term dependencies of contexts, refining the predicted answer span, and directly optimizing the evaluation metric. Extensive experiments on TriviaQA and Stanford Question Answering Dataset (SQuAD) show that our model achieves state-of-the-art results.


Bidirectional Attention Flow for Machine Comprehension

Machine comprehension (MC), answering a query about a given context para...

Two-Stage Synthesis Networks for Transfer Learning in Machine Comprehension

We develop a technique for transfer learning in machine comprehension (M...

Smarnet: Teaching Machines to Read and Comprehend Like Human

Machine Comprehension (MC) is a challenging task in Natural Language Pro...

Multi-range Reasoning for Machine Comprehension

We propose MRU (Multi-Range Reasoning Units), a new fast compositional e...

MEMEN: Multi-layer Embedding with Memory Networks for Machine Comprehension

Machine comprehension(MC) style question answering is a representative p...

Multi-Perspective Context Matching for Machine Comprehension

Previous machine comprehension (MC) datasets are either too small to tra...

DCN+: Mixed Objective and Deep Residual Coattention for Question Answering

Traditional models for question answering optimize using cross entropy l...

1 Introduction

Teaching machines to comprehend a given context paragraph and answer corresponding questions is one of the long-term goals of natural language processing and artificial intelligence. Figure


gives an example of the machine reading comprehension (MRC) task. Benefiting from the rapid development of deep learning techniques

[Goodfellow et al.2016] and large-scale benchmark datasets [Hermann et al.2015, Hill et al.2016, Rajpurkar et al.2016]

, end-to-end neural networks have achieved promising results on this task

[Wang et al.2017, Seo et al.2017, Xiong et al.2017a, Huang et al.2017].

Despite of the advancements, we argue that there still exists two limitations:

  1. To capture complex interactions between the context and the question, a variety of neural attention [Dzmitry Bahdanau2015], such as bi-attention [Seo et al.2017], coattention [Xiong et al.2017b], are proposed in a single-round alignment architecture. In order to fully compose complete information of the inputs, multi-round alignment architectures that compute attentions repeatedly have been proposed [Huang et al.2017, Xiong et al.2017a]. However, in these approaches, the current attention is unaware of which parts of the context and question have been focused in earlier attentions, which results in two distinct but related issues, where multiple attentions 1) focuses on same texts, leading to attention redundancy and 2) fail to focus on some salient parts of the input, causing attention deficiency.

  2. To train the model, standard maximum-likelihood method is used for predicting exactly-matched (EM) answer spans [Wang and Jiang2017]. Recently, reinforcement learning algorithm, which measures the reward as word overlap between the predicted answer and the groung truth, is introduced to optimize towards the F1 metric instead of EM metric [Xiong et al.2017a]

    . Specifically, an estimated baseline is utilized to normalize the reward and reduce variances. However, the convergence can be suppressed when the baseline is better than the reward. This is harmful if the inferior reward is partially overlapped with the ground truth, as the normalized objective will discourage the prediction of ground truth positions. We refer to this case as the

    convergence suppression problem.

To address the first problem, we present a reattention mechanism that temporally memorizes past attentions and uses them to refine current attentions in a multi-round alignment architecture. The computation is based on the fact that two words should share similar semantics if their attentions about same texts are highly overlapped, and be less similar vice versa. Therefore, the reattention can be more concentrated if past attentions focus on same parts of the input, or be relatively more distracted so as to focus on new regions if past attentions are not overlapped at all.

As for the second problem, we extend the traditional training method with a novel approach called dynamic-critical reinforcement learning. Unlike the traditional reinforcement learning algorithm where the reward and baseline are statically sampled, our approach dynamically decides the reward and the baseline according to two sampling strategies, namely random inference and greedy inference. The result with higher score is always set to be the reward while the other is the baseline. In this way, the normalized reward is ensured to be always positive so that no convergence suppression will be made.

Figure 1: An example from the SQuAD dataset. Evidences needed for the answer are marked as green.

All of the above innovations are integrated into a new end-to-end neural architecture called Reinforced Mnemonic Reader in Figure 3. We conducted extensive experiments on both the SQuAD [Rajpurkar et al.2016] dataset and two adversarial SQuAD datasets [Jia and Liang2017] to evaluate the proposed model. On SQuAD, our single model obtains an exact match (EM) score of 79.5% and F1 score of 86.6%, while our ensemble model further boosts the result to 82.3% and 88.5% respectively. On adversarial SQuAD, our model surpasses existing approahces by more than 6% on both AddSent and AddOneSent datasets.

2 MRC with Reattention

2.1 Task Description

For the MRC tasks, a question and a context are given, our goal is to predict an answer , which has different forms according to the specific task. In the SQuAD dataset [Rajpurkar et al.2016], the answer is constrained as a segment of text in the context

, nerual networks are designed to model the probability distribution


2.2 Alignment Architecture for MRC

Among all state-of-the-art works for MRC, one of the key factors is the alignment architecture. That is, given the hidden representations of question and context, we align each context word with the entire question using attention mechanisms, and enhance the context representation with the attentive question information. A detailed comparison of different alignment architectures is shown in Table


Early work for MRC, such as Match-LSTM [Wang and Jiang2017]

, utilizes the attention mechanism stemmed from neural machine translation 

[Dzmitry Bahdanau2015]

serially, where the attention is computed inside the cell of recurrent neural networks. A more popular approach is to compute attentions in parallel, resulting in a similarity matrix. Concretely, given two sets of hidden vectors,

and , representing question and context respectively, a similarity matrix is computed as


where indicates the similarity between -th question word and -th context word, and is a scalar function. Different methods are proposed to normalize the matrix, resulting in variants of attention such as bi-attention[Seo et al.2017] and coattention [Xiong et al.2017b]. The attention is then used to attend the question and form a question-aware context representation .

Model Aligning Rounds Attention
Interactive Self Type
Match-LSTM 1 - Serial
Rnet 1 1 Serial
BiDAF 1 - Parallel
FastQAExt 1 1 Parallel
DCN+ 2 2 Parallel
FusionNet 3 1 Parallel
Our Model 3 3 Parallel
Table 1: Comparison of alignment architectures of competing models: Wang & JiangWang17a, Wang et al.Wang17b, Seo et al.Seo17, Weissenborn et al.Weissenborn17, Xiong et al.Xiong17 and Huang et al.Huang17.

Later, Wang et al. Wang17b propose a serial self aligning method to align the context aginst itself for capturing long-term dependencies among context words. Weissenborn et al. [Weissenborn et al.2017] apply the self alignment in a similar way of Eq. 1, yielding another similarity matrix as


where is an indicator function ensuring that the context word is not aligned with itself. Finally, the attentive information can be integrated to form a self-aware context representation , which is used to predict the answer.

We refer to the above process as a single-round alignment architecture. Such architecture, however, is limited in its capability to capture complex interactions among question and context. Therefore, recent works build multi-round alignment architectures by stacking several identical aligning layers [Huang et al.2017, Xiong et al.2017a]. More specifically, let and denote the hidden representations of question and context in -th layer, and is the corresponding question-aware context representation. Then the two similarity matrices can be computed as


However, one problem is that each alignment is not directly aware of previous alignments in such architecture. The attentive information can only flow to the subsequent layer through the hidden representation. This can cause two problems: 1) the attention redundancy, where multiple attention distributions are highly similar. Let denote the softmax function over a vector . Then this problem can be formulized as , where is a small bound and is a function measuring the distribution distance. 2) the attention deficiency, which means that the attention fails to focus on salient parts of the input: , where is another bound and is the “ground truth” attention distribution.

2.3 Reattention Mechanism

To address these problems, we propose to temporally memorize past attentions and explicitly use them to refine current attentions. The intuition is that two words should be correlated if their attentions about same texts are highly overlapped, and be less related vice versa. For example, in Figure 2, suppose that we have access to previous attentions, and then we can compute their dot product to obtain a “similarity of attention”. In this case, the similarity of word pair (team, Broncos) is higher than (team, Panthers).

Therefore, we define the computation of reattention as follows. Let and denote the past similarity matrices that are temporally memorized. The refined similarity matrix () is computed as


where is a trainable parameter. Here, is the past context attention distribution for the -th question word, and is the self attention distribution for the -th context word. In the extreme case, when there is no overlap between two distributions, the dot product will be . On the other hand, if the two distributions are identical and focus on one single word, it will have a maximum value of . Therefore, the similarity of two words can be explicitly measured using their past attentions. Since the dot product is relatively small than the original similarity, we initialize the with a tunable hyper-parameter and keep it trainable. The refined similarity matrix can then be normalized for attending the question. Similarly, we can compute the refined matrix to get the unnormalized self reattention as

Figure 2: Illustrations of reattention for the example in Figure 1.

3 Dynamic-critical Reinforcement Learning

In the extractive MRC task, the model distribution can be divided into two steps: first predicting the start position and then the end position as


where represents all trainable parameters.

The standard maximum-likelihood (ML) training method is to maximize the log probabilities of the ground truth answer positions [Wang and Jiang2017]


where and are the answer span for the -th example, and we denote and as and respectively for abbreviation.

Recently, reinforcement learning (RL), with the task reward measured as word overlap between predicted answer and groung truth, is introduced to MRC [Xiong et al.2017a]. A baseline , which is obtained by running greedy inference with the current model, is used to normalize the reward and reduce variances. Such approach is known as the self-critical sequence training (SCST) [Rennie et al.2016], which is first used in image caption. More specifically, let denote the F1 score between a sampled answer and the ground truth . The training objective is to minimize the negative expected reward by


where we abbreviate the model distribution as , and the reward function as . is obtained by greedily maximizing the model distribution:

The expected gradient can be computed according to the REINFORCE algorithm [Sutton and Barto1998] as


where the gradient can be approxiamated using a single Monte-Carlo sample derived from .

However, a sampled answer is discouraged by the objective when it is worse than the baseline. This is harmful if the answer is partially overlapped with ground truth, since the normalized objective would discourage the prediction of ground truth positions. For example, in Figure 1, suppose that is champion Denver Broncos and is Denver Broncos. Although the former is an acceptable answer, the normalized reward would be negative and the prediction for end position would be suppressed, thus hindering the convergence. We refer to this case as the convergence suppression problem.

Here, we consider both random inference and greedy inference as two different sampling strategies: the first one encourages exploration while the latter one is for exploitation111In practice we found that a better approximation can be made by considering a top- answer list, where is the best result and is sampled from the rest of the list.. Therefore, we approximate the expected gradient by dynamically set the reward and baseline based on the F1 scores of both and . The one with higher score is set as reward, while the other is baseline. We call this approach as dynamic-critical reinforcement learning (DCRL)


Notice that the normalized reward is constantly positive so that superior answers are always encouraged. Besides, when the score of random inference is higher than the greedy one, DCRL is equivalent to SCST. Thus, Eq. 3 is a special case of Eq. 3.

Following [Xiong et al.2017a] and [Kendall et al.2017], we combine ML and DCRL objectives using homoscedastic uncertainty as task-dependent weightings so as to stabilize the RL training as


where and are trainable parameters.

4 End-to-end Architecture

Based on previous innovations, we introduce an end-to-end architecture called Reinforced Mnemonic Reader, which is shown in Figure 3. It consists of three main components: 1) an encoder builds contextual representations for question and context jointly; 2) an iterative aligner performs multi-round alignments between question and context with the reattention mechanism; 3) an answer pointer predicts the answer span sequentially. Beblow we give more details of each component.

Encoder. Let and denote the word sequences of the question and context respectively. The encoder firstly converts each word to an input vector. We utilize the 100-dim GloVe embedding [Pennington et al.2014] and 1024-dim ELMo embedding [Peters et al.2018]

. Besides, a character-level embedding is obtained by encoding the character sequence with a bi-directional long short-term memory network (BiLSTM) 

[Hochreiter and Schmidhuber1997], where two last hidden states are concatenated to form the embedding. In addition, we use binary feature of exact match, POS embedding and NER embedding for both question and context, as suggested in [Chen et al.2017]. Together the inputs and are obtained.

Figure 3: The architecture overview of Reinforced Mnemonic Reader. The subfigures to the right show detailed demonstrations of the reattention mechanism: 1) refined to attend the query; 2) refined to attend the context.

To model each word with its contextual information, a weight-shared BiLSTM is utilized to perform the encoding


Thus, the contextual representations for both question and context words can be obtained, denoted as two matrices: and .

Iterative Aligner. The iterative aligner contains a stack of three aligning blocks. Each block consists of three modules: 1) an interactive alignment to attend the question into the context; 2) a self alignment to attend the context against itself; 3) an evidence collection to model the context representation with a BiLSTM. The reattention mechanism is utilized between two blocks, where past attentions are temporally memorizes to help modulating current attentions. Below we first describe a single block in details, which is shown in Figure 4, and then introduce the entire architecture.

Figure 4: The detailed overview of a single aligning block. Different colors in and represent different degrees of similarity.

Single Aligning Block. First, the similarity matrix is computed using Eq. 1, where the multiplicative product with nonlinearity is applied as attention function: . The question attention for the -th context word is then: , which is used to compute an attended question vector .

To efficiently fuse the attentive information into the context, an heuristic fusion function, denoted as

, is proposed as



denotes the sigmoid activation function,

denotes element-wise multiplication, and the bias term is omitted. The computation is similar to the highway networks [Srivastava et al.2015], where the output vector

is a linear interpolation of the input

and the intermediate vector . A gate is used to control the composition degree to which the intermediate vector is exposed. With this function, the question-aware context vectors can be obtained as: .

Similar to the above computation, a self alignment is applied to capture the long-term dependencies among context words. Again, we compute a similarity matrix using Eq. 2. The attended context vector is then computed as: , where is the self attention for the -th context word. Using the same fusion function as , we can obtain self-aware context vectors .

Finally, a BiLSTM is used to perform the evidence collection, which outputs the fully-aware context vectors with as its inputs.

Multi-round Alignments with Reattention. To enhance the ability of capturing complex interactions among inputs, we stack two more aligning blocks with the reattention mechanism as follows


where denote the -th block. In the -th block (), we fix the hidden representation of question as , and set the hidden representation of context as previous fully-aware context vectors . Then we compute the unnormalized reattention and with Eq. 2.3 and Eq. 2.3

respectively. In addition, we utilize a residual connection 

[He et al.2016] in the last BiLSTM to form the final fully-aware context vectors : .

Answer Pointer. We apply a variant of pointer networks [Vinyals et al.2015] as the answer pointer to make the predictions. First, the question representation is summarized into a fixed-size summary vector as: , where . Then we compute the start probability by heuristically attending the context representation with the question summary as


Next, a new question summary is updated by fusing context information of the start position, which is computed as , into the old question summary: . Finally the end probability is computed as


5 Experiments

5.1 Implementation Details

We mainly focus on the SQuAD dataset [Rajpurkar et al.2016] to train and evaluate our model. SQuAD is a machine comprehension dataset, totally containing more than questions manually annotated by crowdsourcing workers on a set of Wikipedia articles. In addition, we also test our model on two adversarial SQuAD datasets [Jia and Liang2017], namely AddSent and AddOneSent. In both adversarial datasets, a confusing sentence with a wrong answer is appended at the end of the context in order to fool the model.

We evaluate the Reinforced Mnemonic Reader (R.M-Reader) by running the following setting. We first train the model until convergence by optimizing Eq. 7. We then finetune this model with Eq. 11, until the F1 score on the development set no longer improves.

We use the Adam optimizer [Kingma and Ba2014] for both ML and DCRL training. The initial learning rates are and respectively, and are halved whenever meeting a bad iteration. The batch size is and a dropout rate [Srivastava et al.2014] of

is used to prevent overfitting. Word embeddings remain fixed during training. For out of vocabulary words, we set the embeddings from Gaussian distributions and keep them trainable. The size of character embedding and corresponding LSTMs is

, the main hidden size is

, and the hyperparameter

is .

5.2 Overall Results

We submitted our model on the hidden test set of SQuAD for evaluation. Two evaluation metrics are used: Exact Match (EM), which measures whether the predicted answer are exactly matched with the ground truth, and F1 score, which measures the degree of word overlap at token level.

As shown in Table 2, R.M-Reader achieves an EM score of 79.5% and F1 score of 86.6%. Since SQuAD is a competitve MRC benchmark, we also build an ensemble model that consists of single models with the same architecture but initialized with different parameters. Our ensemble model improves the metrics to 82.3% and 88.5% respectively222The results are on 0xe6c23cbae5e440b8942f86641f49fd80..

Single Model Dev Test
LR Baseline 40.0 51.0 40.4 51.0
DCN+ 74.5 83.1 75.1 83.1
FusionNet 75.3 83.6 76.0 83.9
SAN 76.2 84.1 76.8 84.4
AttentionReader+ - - 77.3 84.9
BSE 77.9 85.6 78.6 85.8
R-net+ - - 79.9 86.5
SLQA+ - - 80.4 87.0
Hybrid AoA Reader+ - - 80.0 87.3
R.M-Reader 78.9 86.3 79.5 86.6
Ensemble Model
DCN+ - - 78.8 86.0
FusionNet 78.5 85.8 79.0 86.0
SAN 78.6 85.8 79.6 86.5
BSE 79.6 86.6 81.0 87.4
AttentionReader+ - - 81.8 88.2
R-net+ - - 82.6 88.5
SLQA+ - - 82.4 88.6
Hybrid AoA Reader+ - - 82.5 89.3
R.M-Reader 81.2 87.9 82.3 88.5
Human 80.3 90.5 82.3 91.2
Table 2: The performance of Reinforced Mnemonic Reader and other competing approaches on the SQuAD dataset. The results of test set are extracted on Feb 2, 2018: Rajpurkar et al.Rajpurkar16, Xiong et al.Xiong17, Huang et al.Huang17, Liu et al.Liu17 and PetersElmo17. indicates unpublished works. BSE refers to BiDAF + Self Attention + ELMo.

Table 3 shows the performance comparison on two adversarial datasets, AddSent and AddOneSent. All models are trained on the original train set of SQuAD, and are tested on the two datasets. As we can see, R.M-Reader comfortably outperforms all previous models by more than 6% in both EM and F1 scores, indicating that our model is more robust against adversarial attacks.

5.3 Ablation Study

The contributions of each component of our model are shown in Table 4. Firstly, ablation (1-4) explores the utility of reattention mechanism and DCRL training method. We notice that reattention has more influences on EM score while DCRL contributes more to F1 metric, and removing both of them results in huge drops on both metrics. Replacing DCRL with SCST also causes a marginal decline of performance on both metrics. Next, we relace the default attention function with the dot product: (5), and both metrics suffer from degradations. (6-7) shows the effectiveness of heuristics used in the fusion function. Removing any of the two heuristics leads to some performance declines, and heuristic subtraction is more effective than multiplication. Ablation (8-9) further explores different forms of fusion, where gate refers to and MLP denotes in Eq. 4, respectively. In both cases the highway-like function has outperformed its simpler variants. Finally, we study the effect of different numbers of aligning blocks in (10-12). We notice that using 2 blocks causes a slight performance drop, while increasing to 4 blocks barely affects the SoTA result. Interestingly, a very deep alignment with 5 blocks results in a significant performance decline. We argue that this is because the model encounters the degradation problem existed in deep networks [He et al.2016].

Model AddSent AddOneSent
LR Baseline 17.0 23.2 22.3 41.8
Match-LSTM 24.3 34.2 34.8 41.8
BiDAF 29.6 34.2 40.7 46.9
SEDT 30.0 35.0 40.0 46.5
ReasoNet 34.6 39.4 43.6 49.8
FusionNet 46.2 51.4 54.7 60.7
R.M-Reader 53.0 58.5 60.9 67.0
Table 3: Performance comparison on two adversarial SQuAD datasets. Wang & JiangWang17a, Seo et al.Seo17, Liu et al.Liu17b, Shen et al.Shen16 and Huang et al.Huang17. indicates ensemble models.
Configuration EM F1 EM F1
R.M-Reader 78.9 86.3
(1) - Reattention 78.1 85.8 -0.8 -0.5
(2) - DCRL 78.2 85.4 -0.7 -0.9
(3) - Reattention, DCRL 77.1 84.8 -1.8 -1.5
(4) - DCRL, + SCST 78.5 85.8 -0.4 -0.5
(5) Attention: Dot 78.2 85.9 -0.7 -0.4
(6) - Heuristic Sub 78.1 85.7 -0.8 -0.6
(7) - Heuristic Mul 78.3 86.0 -0.6 -0.3
(8) Fusion: Gate 77.9 85.6 -1.0 -0.7
(9) Fusion: MLP 77.2 85.2 -1.7 -1.1
(10) Num of Blocks: 2 78.7 86.1 -0.2 -0.2
(11) Num of Blocks: 4 78.8 86.3 -0.1 0
(12) Num of Blocks: 5 77.5 85.2 -1.4 -1.1
Table 4: Ablation study on SQuAD dev set.

5.4 Effectiveness of Reattention

We further present experiments to demonstrate the effectiveness of reattention mechanism. For the attention redundancy problem, we measure the distance of attention distributions in two adjacent aligning blocks, e.g., and . Higher distance means less attention redundancy. For the attention deficiency problem, we take the arithmetic mean of multiple attention distributions from the ensemble model as the “ground truth” attention distribution , and compute the distance of individual attention

with it. Lower distance refers to less attention deficiency. We use Kullback–Leibler divergence as the distance function

, and we report the averaged value over all examples.

KL divergence - Reattention + Reattention
to 0.695 0.086 0.866 0.074
to 0.404 0.067 0.450 0.052
to 0.976 0.092 1.207 0.121
to 1.179 0.118 1.193 0.097
to 0.650 0.044 0.568 0.059
to 0.536 0.047 0.482 0.035
Table 5: Comparison of KL diverfence on different attention distributions on SQuAD dev set.

Table 5 shows the results. We first see that the reattention indeed help in alleviating the attention redundancy: the divergence between any two adjacent blocks has been successfully enlarged with reattention. However, we find that the improvement between the first two blocks is larger than the one of last two blocks. We conjecture that the first reattention is more accurate at measuring the similarity of word pairs by using the original encoded word representation, while the latter reattention is distracted by highly nonlinear word representations. In addition, we notice that the attention deficiency has also been moderated: the divergence betwen normalized and is reduced.

5.5 Prediction Analysis

Figure 5 compares predictions made either with dynamic-critical reinforcement learning or with self-critical sequence training. We first find that both approaches are able to obtain answers that match the query-sensitive category. For example, the first example shows that both four and two are retrieved when the questions asks for how many. Nevertheless, we observe that DCRL constantly makes more accurate prediction on answer spans, especially when SCST already points a rough boundary. In the second example, SCST takes the whole phrase after Dyrrachium as its location. The third example shows a similar phenomenon, where the SCST retrieves the phrase constantly servicing and replacing mechanical brushes as its answer. We demonstrates that this is because SCST encounters the convergence suppression problem, which impedes the prediction of ground truth answer boundaries. DCRL, however, successfully avoids such problem and thus finds the exactly correct entity.

Figure 5: Predictions with DCRL (red) and with SCST (blue) on SQuAD dev set.

6 Conclusion

We propose the Reinforced Mnemonic Reader, an enhanced attention reader with two main contributions. First, a reattention mechanism is introduced to alleviate the problems of attention redundancy and deficiency in multi-round alignment architectures. Second, a dynamic-critical reinforcement learning approach is presented to address the convergence suppression problem existed in traditional reinforcement learning methods. Our model achieves the state-of-the-art results on the SQuAD dataset, outperforming several strong competing systems. Besides, our model outperforms existing approaches by more than 6% on two adversarial SQuAD datasets. We believe that both reattention and DCRL are general approaches, and can be applied to other NLP task such as natural language inference. Our future work is to study the compatibility of our proposed methods.


This research work is supported by National Basic Research Program of China under Grant No. 2014CB340303. In addition, we thank Pranav Rajpurkar for help in SQuAD submissions.