Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data

03/14/2019 ∙ by Moonsu Han, et al. ∙ 0

We consider a novel question answering (QA) task where the machine needs to read from large streaming data (long documents or videos) without knowing when the questions will be given, in which case the existing QA methods fail due to lack of scalability. To tackle this problem, we propose a novel end-to-end reading comprehension method, which we refer to as Episodic Memory Reader (EMR) that sequentially reads the input contexts into an external memory, while replacing memories that are less important for answering unseen questions. Specifically, we train an RL agent to replace a memory entry when the memory is full in order to maximize its QA accuracy at a future timepoint, while encoding the external memory using the transformer architecture to learn representations that considers relative importance between the memory entries. We validate our model on a real-world large-scale textual QA task (TriviaQA) and a video QA task (TVQA), on which it achieves significant improvements over rule-based memory scheduling policies or an RL-based baseline that learns the query-specific importance of each memory independently.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Question answering (QA) problem is one of the most important challenges in Natural Language Understanding (NLU). In recent years, there has been drastic progress on the topic, owing to the success of deep learning based QA models 

(Sukhbaatar et al., 2015; Seo et al., 2016; Xiong et al., 2016; Hu et al., 2018; Back et al., 2018; Devlin et al., 2018). On certain tasks such as machine reading comprehension (MRC), where the problem is to find the span of the answer within a given paragraph Rajpurkar et al. (2016), the deep-learning based QA models have even surpassed human-level performances.

Despite such impressive achievements, it is still challenging to model question answering with document-level context Joshi et al. (2017), where the context may include a long document with a large number of paragraphs, due to problems such as difficulty in modeling long-term dependency and computational cost. To overcome such scalability problems, researchers have proposed pipelining or confidence based selection methods that combine paragraph-level models to obtain a document-level model (Joshi et al., 2017; Chen et al., 2017; Clark and Gardner, 2018; Wang et al., 2018b). Yet, such models are applicable only when questions are given beforehand and all sentences in the document can be stored in memory.

However, in realistic settings, the amount of context may be too large to fit into the system memory. We may consider query-based context selection methods such as ones proposed in Indurthi et al. (2018) and Min et al. (2018)

, but in many cases, the question may not be given when reading in the context, and thus it would be difficult to select out the context based on the question. For example, a conversation agent may need to answer a question after numerous conversations in a long-term time period, and a video QA model may need to watch an entire movie, or a sports game, or days of streaming videos from security cameras before answering a question. In such cases, existing QA models will fail to solve the problem due to memory limitation.

In this paper, we target a novel problem of solving question answering problem with streaming data as context, where the size of the context could be significantly larger than what the memory can accommodate (See Figure 1

). In such a case, the model needs to carefully manage what to remember from this streaming data such that the memory contains the most informative context instances in order to answer an unseen question in the future. We pose this memory management problem as a learning problem and train both the memory representation and the scheduling agent using reinforcement learning.

Specifically, we propose to train the memory module itself using reinforcement learning to replace the most uninformative memory entry in order to maximize its reward on a given task. However, this is a seemingly ill-posed problem since for most of the time, the scheduling should be performed without knowing which question will arrive next. To tackle this challenge, we implement the policy network and the value network that learn not only relation between sentences and query but also relative importance among the sentences in order to maximize its question answering accuracy at a future timepoint. We refer to this network as Episodic Memory Reader (EMR). EMR can perform selective memorization to keep a compact set of important context that will be useful for future tasks in lifelong learning scenarios.

We validate our proposed memory network on a large-scale QA task (Trivia QA) and video question answering task (TVQA) where the context is too large to fit into the external memory, against rule-based and an RL-based scheduling method without consideration of relative importance between memories. The results show that our model significantly outperforms the baselines, due to its ability to preserve the most important pieces of information from the streaming data.

Our contribution is threefold:

  • We consider a novel task of learning to remember important instances from streaming data for question answering task, where the size of the memory is significantly smaller than the length of the data stream.

  • We propose a novel end-to-end neural architecture for QA from streaming data, where we train a scheduling agent via reinforcement learning to store the most important memory cell for solving future QA tasks in the global external memory.

  • We validate the efficacy of our model on real-world, large-scale text and video QA datasets, on which it obtains significantly improved performances over baseline methods.

2 Related Work

Question-answering

There has been a rapid progress in question answering (QA) in recent years, thanks to the advancement in deep learning as well as the availability of large-scale datasets. One of the most popular large-scale QA dataset is Stanford Question Answering Dataset (SQuAD, Rajpurkar et al. (2016) that consists of 10K question-answering pairs. Unlike Richardson et al. (2013) and Hermann et al. (2015)

that provide multiple-choice QA pairs, SQuAD provides and requires to predict exact locations of the answers. On this span prediction task, attentional models 

(Pan et al., 2017; Cui et al., 2017; Hu et al., 2018) have achieved impressive performances, with Bi-Directional Attention Flow (BiDAF, Seo et al. (2016)) that uses bi-directional attention mechanism for the context and query being one of the best performing models. TriviaQA Joshi et al. (2017) is another large-scale QA dataset that includes QA pairs. Since the length of each document in Trivia is much longer than SQuAD, with average of sentences per document, existing span prediction models (Joshi et al., 2017; Back et al., 2018; Yu et al., 2018) fail to work due to memory limitation, and simply resort to document truncation. Video question answering (Tapaswi et al., 2016; Lei et al., 2018), where video frames are given as context for QA, is another important topic where scalability is an issue. Several models (Kim et al., 2017, 2018; Na et al., 2017; Wang et al., 2018a) propose to solve video QA using attentions and memory augmented networks, to perform composite reasoning over both videos and texts; however, they only focus on short-length videos. Most existing work on QA focus on small-size problems due to memory limitation. Our work, on the other hand, considers a challenging scenario where the context is order of magnitude larger than the memory.

Context selection

A few recent models propose to select minimal context from the given document when answering questions for scalability, rather than using the full context. Min et al. (2018) proposed a context selector that generates attentions on the context vectors, in order to achieve scability and robustness against adversarial inputs. Choi et al. (2017) and Indurthi et al. (2018) propose a similar method, but they use REINFORCE Williams (1992)

instead of linear classifiers. While these context selection methods share our motivation of achieving scability and selecting out the most informative pieces of information to solve the QA task, our problem setting is completely different from theirs since we consider a much challenging problem of learning from the streaming data without knowing when the question will be given, where the size of the context is much larger than the memory and the question is unseen when training the selection module.

Memory-augmented neural networks

Our episodic memory reader is essentially a memory-augmented network (MANN) (Sukhbaatar et al., 2015; Graves et al., 2014; Xiong et al., 2016) with a RL-based scheduler. While most existing work on MANN assume that the memory is sufficiently large to hold all the data instances, a few tried to consider memory-scheduling for better scalability. Gülçehre et al. (2016) propose to train an addressing agent using reinforcement learning in order to dynamically decide which memory to overwrite based on the query. This query-specific importance is similar to our motivation, but in our case the query is given after reading in all the context and thus unusable for scheduling, and we perform hard replacement instead of overwriting. Differentiable Neural Computer (DNC) Graves et al. (2016) extends the NTM to address the issue by introducing a temporal link matrix, replacing the least used memory when the memory is full. However, this method is a rule-based one that cannot maximize the performance on a given task.

Figure 2: The overview of our Episodic Memory Reader (EMR). EMR learns the policy and the value network to select a memory entry to replace, in order to maximize the reward, defined as the performance on future QA tasks (F1-score, accuracy).

3 Learning What to Remember from Streaming Data

We now describe how to solve question answering tasks with streaming data as context. In a more general sense, this is a problem of learning from a long data stream that contains a large portion of unimportant, noisy data (e.g. routine greetings in dialogs, uninformative video frames) with limited memory. The data stream is episodic, where an unlimited amount of data instances may arrive at one time interval and becomes inaccessible afterward. Additionally, we consider that it is not possible for the model to know in advance what tasks (a question in the case of QA problem) will be given at which timestep in the future (See Figure 2 for more details). To solve this problem, the model needs to identify important data instances from the data stream and store them into external memory. Formally, given a data stream (e.g. sentences or images) as input, the model should learn a function that maps it to the set of memory entries where . How can we then learn such a function that maximizes the performance on unseen future tasks without knowing what problems will be given at what time? We formulate this problem as a reinforcement learning problem to train a memory scheduling agent.

3.1 Model Overview

We now describe our model, Episodic Memory Reader (EMR) to solve the previously described problem. Our model has three components: (1) an agent based on EMR, (2) an external memory , and (3) a target network which solves the target task given the memory and a query. Figure 2 shows the overview of our model. Basically, given a sequence of data instances that streams through the system, the agent learns to retain the most useful subset in the memory, by interacting with the external memory that encodes the relative importance of each memory entry. When , the agent simply writes to . However, when , where the memory is full, it selects an existing memory entry to delete. Specifically, it outputs an action based on , where is selection of the memory entry to delete, where the state is the concatenation of the memory and the data instance: . Thus either or one of the memory entries will be deleted. To maximize the performance on the future QA task, the agent should replace the least important memory entry. When the agent encounters the task at timestep , it leverages both the memory at timestep , and the task information (e.g. question), to solve the task. For each action, the environment (question answering module) provides the reward , that is given either as the F1-score or the accuracy.

3.2 Episodic Memory Reader

Episodic Memory Reader (EMR) is composed of three components: (1) Data Encoder that encodes each data instance into memory vector representation, (2) Memory Encoder

that generates replacement probability for each memory entry, and the

(3) Value Network

that estimates the value of memory as a whole. In some cases, we may use policy gradient methods, in which case the value network becomes unnecessary.

3.2.1 Data Encoder

The data instance which arrives at time can be in any data format, and thus we transform it into a memory vector representation to be used by the agent using an encoder:

where is the data encoder, which could be any neural architecture based on the type of the input data. For example, we could use a RNN if is composed of sequential data (e.g. a sentence composed of words ) or a CNN if is an image.

3.2.2 Memory Encoder

Using the memory vector representations generated from the data encoder, the memory encoder outputs a probability for each memory entry by considering the importance between the entries and then replaces the most unimportant entry. This component corresponds to the policy network of the actor-critic method. Now we describe our ERM models:

EMR-Independent

Since we do not have existing work for our novel problem setting, as a baseline, we first consider a memory encoder that only captures the relative importance of each memory entry independently to the new data instance, which we refer to as EMR-Independent. This scheduling mechanism is adopted from Dynamic Least Recently Use (LRU) addressing introduced in Gülçehre et al. (2016), but different from LRU in that it replaces the memory entry rather than overwriting it, and is trained without query to maximize the performance for unseen future queries. EMR-Independent outputs the importance for each memory entry by comparing them with an embedding of the new data instance as . To compute the overall importance of each memory entry, as done in Gülçehre et al. (2016), we compute the exponential moving average as . Then, we compute the final replacing probabilty with the LRU factor as follows:

where , and are sigmoid and softmax functions respectively, and is the policy of the agent.

EMR-biGRU

A major drawback of EMR-Independent is that the evaluation of each memory depends only on the input . In other words, the importance is computed between each memory entry and the new data instance regardless of other entries in the memory. However, this scheme cannot model the relative importance of each memory entry to other memory entries, which is more important in deciding on the least important memory. One way to consider relative relationships between memory entries is to encode them using a bidirectional GRU (biGRU) as follows:

where

is a Gated Recurrent Unit parameterized by

, is a concatenation of features. Thus, it learns the general importance of each memory entry in relation to its neighbors rather than considering independent for each entry, which is useful when selecting out the most important entry among highly similar data instances (e.g. video frames). However, the model may not effectively model long-range relationships between memory entries in far-away slots due to the inherent limitation with RNNs.

EMR-Transformer

To overcome such suboptimality of RNN-based modeling, we further adopt the self-attention mechanism from Vaswani et al. (2017). Query , key , and the value are the components of the self-attention used to generate the relative importance of the entries, which are computed by a linear layer that takes with the position encoding proposed in Vaswani et al. (2017) as input. With multi-headed attention, each component is projected to a multi-dimensional space; the dimensions for each componenets are , , and , where is the size of memory and is the number of attention heads. Using these, we can formulate the retrieved output using self-attention and memory encoding as follows:

where , is a concatentation of . Memory encoding is then computed using linear function with as input.

Figure 3: Detailed architecture of memory encoder in EMR-Independent and EMR-biGRU/Transformer.

We use a multi-layer perceptron (MLP) to generate the replacement probability for each entry as follows:

where the MLP is a multi-layer perceptron with three linear layers, using ReLU as the activation fuction. The agent then selects a memory to replace based on the policy

.

Figure 3 illustrates the architecture of the memory encoder for EMR-Independent and EMR-biGRU/Transformer.

3.2.3 Value Network

For solving certain QA problems, we need to consider the future importance of each memory entry. Especially in textual QA tasks (e.g. TriviaQA), storing the evidence sentences that precede span words may be useful as they may provide useful context. However, using only discrete policy gradient method, we cannot preserve such context instances. To overcome this issue, we use an actor-critic RL method (A3C, Mnih et al. (2016)) to estimate the sum of future rewards at each state using the value network. The difference between the policy and the value is that the value can be estimated differently at each time step and the needs to consider the memory as a whole. To obtain a holistic representation of our memory, we use Deep Sets Zaheer et al. (2017). Following Zaheer et al. (2017) we sum up all and input them into an MLP (

), that consists of two linear layers and a ReLU activation function, to obtain a set representation. Then, we further process the set representation

by a GRU with the hidden state from the previous time step. Finally, we feed the output of the GRU to a multi-layer perceptron to estimate the value for the current timestep.

Figure 4: Example context and QA pair from TriviaQA.

3.3 Training and test

Training

Our model learns the memory scheduling policy jointly with the model to solve the task. For training EMR, we choose A3C Mnih et al. (2016) or REINFORCE Williams (1992). At training time, since the tasks are given, we provide the question to the agent at every timestep. At each step, the agent selects the action stochastically from multinomial distribution based on to explore various states, and make an action. Then, the QA model provides the agent the reward . We use asynchronous multiprocessing method illustrated in Mnih et al. (2016) to train several models at once.

Test

At test time, the agent deletes the memory index following . Contrarily from the training step, the model observes the question only at the end of the data stream. When encountering the question, the model solves the task using the data instances kept in the external memory.

4 Experiment

We experiment our ERM-biGRU and EMR-Transformer against several baselines:

1) FIFO (First-In First-Out). A rule-based memory scheduling policy that replaces the oldest memory entry.

2) Uniform. A policy that replaces all memory entries with equal probability at each time.

3) LIFO (Last-In First-Out). A policy that replaces the newest data instance. That is, it first fills in the memory and then discards all following data instances.

4) EMR-Independent. A baseline EMR which learns the importance of each memory entry only relative to the new data instance.

We will release the codes for reproduction, if the paper is accepted.

4.1 TriviaQA

Figure 5: The histogram of number of answers for each document length for TriviaQA dataset.

Figure 6: A visualization of how our model operates in order to solve the problem. Episodic Memory Reader (EMR) sequentially reads the sentences one by one while replacing least important memories. For this example, EMR retained the sentences with the word ‘France’ (bold fonts) in order to answer a given question.
Dataset

TriviaQA Joshi et al. (2017) is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and the web. This dataset is more challenging than standard QA benchmark datasets such as Stanford Question Answering Dataset (SQuAD, Rajpurkar et al. (2016)), as the answers for a question may not be directly obtained by span prediction and the context is very long (Figure 4). Since conventional QA models Seo et al. (2016); Back et al. (2018); Yu et al. (2018); Devlin et al. (2018) are span prediction models, on TriviaQA they only train on QA pairs whose answers can be found in the given context. In such a setting, TriviaQA becomes highly biased, where the answers are mostly spanned in the earlier part of the document (Figure 5). We evaluate our work only on the Wikipedia domain since most previous work report similar results on both domains. While TriviaQA dataset consists of both human-verified and machine-generated QA subsets, we use the human-verified subset only since the machine-generated QA pairs are unreliable. We use the validation set for test since the test set does not contain labels.

Experiment Details

We employ the pre-trained model from Deep Bidirectional Transformers (BERT, Devlin et al. (2018)), which is the current state-of-the-art model for SQuAD challenge, that trains several Transformers in Vaswani et al. (2017) for pretraining tasks for predicting the indices of the exact location of an answer. We embed words into each memory cell using a GRU and set the number of cells to , thus the memory can hold words at maximum. This is a reasonable restriction since BERT limites the maximum number of word tokens to , including both the context and the query. We will release the codes for better reproducibility, if the paper is accepted.

Model ExactMatch F1
FIFO 24.53 27.22
Uniform 28.30 34.39
LIFO 46.23 50.10
EMR-Independent 34.91 41.15
EMR-biGRU 52.20 57.57
EMR-Transformer 51.26 56.14
Table 1: Q&A accuracy on the TriviaQA dataset.
Results and Analysis

We report the performance of our model on the TriviaQA using both ExactMatch and F1-score in Table 1. We see that ERM models which consider the relative importance between the memory entries (EMR-biGRU and EMR-Transformer) outperform both the rule-based baselines and ERM-Independent. One interesting observation is that LIFO performs quite well unlike the other rule-based scheduling policies, but this is due to the dataset bias (See Figure 5) where most answers are spanned in earlier part of the documents. To further see whether this improvement is from its ability to remember important context, we examine the sentences that remain in the memory after ERM finishes reading all the sentences in Figure 6. We see that ERM remembered the sentences that contain key words that is required to answer the future question. See supplementary file for more examples.

4.2 Tvqa

Dataset

TVQA Lei et al. (2018) is a localized, compositional video question-answering dataset that contains 153K question-answer pairs from 22K clips spanning over 460 hours of video. The questions are multiple choice questions on the video contents, where the task is to find a single correct answer out of five candidate answers. The questions can be answered by examining the annotated clip segments, which spans around frames per clip (See Figure 7 (a)). The average number of frames for each clip is . In addition to the video frames, the dataset also provides subtitles for each video frame. Thus solving the questions requires compositional reasoning capability over both a large number of images and texts.

Figure 7: An example of TVQA dataset and a visualization of how our model operates. The answer in red is the ground truth and the answer with underline is the predicted answer from our QA model.
Figure 8: QA accuracy of various memory scheduling policies on the TVQA dataset, reported as a function of the number of memory entries.
Experiment Details

As for the QA module, we use Multi-stream model for Multi-Modal Video QA, which is the attention-based baseline model provided in Lei et al. (2018)

. For efficient training, we use features extracted from a ResNet-101 pretrained on the ImageNet dataset. For embedding subtitles and question-answering pairs, we use GloVe 

Pennington et al. (2014). For training, we restrict the number of memory entries for our episodic reader as , where each memory entry contains the encoding of a video frame and the subtitle associated with the frame, where the former is encoded using CNN and the latter using GRU. We train our model and the baseline models using the ADAM optimizer Kingma and Ba (2014), with the initial learning rate of . Unlike from the experiments on TriviaQA, we use REINFORCE Williams (1992) to train the policy. This is because TVQA is composed of consecutive image frames captured within a short time interval, which tend to contain redundant information. Thus the value network of the actor-critic model fails to estimate good value of the given state since deleting a good frame will not result in the loss of QA accuracy. Thus we compute the reward as the accuracy difference between at time step and then use only the policy with non-episodic REINFORCE for training. With this method, if the task fails to solve the question after deleting certain frame, the frame is considered as important, and unimportant otherwise. We will release the codes for better reproducibility, if the paper is accepted.

Results and Analysis

We report the accuracy on TVQA as a function of memory size in Figure 8. We observe that EMR variants significantly outperform all baselines, including EMR-Independent. We also observe that the models perform well even when the size of the memory is increased to as large as , which was never encountered during the training stage where the number of memory entries was fixed as . When the size of memory is small, the gap between different models are larger, with EMR-Transformer obtaining the best accuracy, which may be due to its ability to capture global relative importance of each memory entry. However, the gap between EMR-Transformer and EMR-biGRU diminishes as the size of memory increases, since then the size of the memory becomes large enough to contain all the frames necessary to answer the question.

As qualitative analysis, we further examine which frames and subtitles were preserved in the external memory after the model has read through the entire sequence in Figure 8. To answer the question for this example, the model should consider the relationship between two frames, where the first frame describes Ross showing the paper to others, and the second frame describes Monica entering the coffee shop. We see that our model kept both frames, although it did not know what the question will be. See Appendix for more examples.

5 Conclusion

We proposed a novel problem of question answering from streaming data, where the model needs to answer a question that is given after reading through unlimited amount of context (e.g. documents, videos) that cannot fit into the system memory. To hanlde this problem, we proposed Episodic Memory Reader (EMR), which is basically a memory-augmented network with RL-based memory-scheduler, that learns the relative importance among memory entries and replaces the entries with the lowest importance to maximize the QA performance for future tasks. We validated EMR on large-scale text and video QA datasets against rule-based memory scheduling as well as an RL-baseline that does not model relative importances among memory entries, which it significantly outperforms. Further qualitative analysis of memory contents after learning confirms that such good performance comes from its ability to retain important instances for future QA tasks.

References

  • Back et al. (2018) Seohyun Back, Seunghak Yu, Sathish Reddy Indurthi, Jihie Kim, and Jaegul Choo. 2018. Memoreader: Large-scale reading comprehension through neural memory controller. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018

    , pages 2131–2140.
  • Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1870–1879.
  • Choi et al. (2017) Eunsol Choi, Daniel Hewlett, Jakob Uszkoreit, Illia Polosukhin, Alexandre Lacoste, and Jonathan Berant. 2017. Coarse-to-fine question answering for long documents. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 209–220.
  • Clark and Gardner (2018) Christopher Clark and Matt Gardner. 2018. Simple and effective multi-paragraph reading comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 845–855.
  • Cui et al. (2017) Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, and Guoping Hu. 2017.

    Attention-over-attention neural networks for reading comprehension.

    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 593–602.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
  • Graves et al. (2014) Alex Graves, Greg Wayne, and Ivo Danihelka. 2014. Neural turing machines. CoRR, abs/1410.5401.
  • Graves et al. (2016) Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adrià Puigdomènech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. 2016. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476.
  • Gülçehre et al. (2016) Çaglar Gülçehre, Sarath Chandar, Kyunghyun Cho, and Yoshua Bengio. 2016. Dynamic neural turing machine with soft and hard addressing schemes. CoRR, abs/1607.00036.
  • Hermann et al. (2015) Karl Moritz Hermann, Tomás Kociský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1693–1701.
  • Hu et al. (2018) Minghao Hu, Yuxing Peng, Zhen Huang, Xipeng Qiu, Furu Wei, and Ming Zhou. 2018. Reinforced mnemonic reader for machine reading comprehension. In

    Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden.

    , pages 4099–4106.
  • Indurthi et al. (2018) Sathish Reddy Indurthi, Seunghak Yu, Seohyun Back, and Heriberto Cuayáhuitl. 2018. Cut to the chase: A context zoom-in network for reading comprehension. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 570–575.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1601–1611.
  • Kim et al. (2018) Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, and Byoung-Tak Zhang. 2018. Multimodal dual attention memory for video story question answering. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV, pages 698–713.
  • Kim et al. (2017) Kyung-Min Kim, Min-Oh Heo, Seong-Ho Choi, and Byoung-Tak Zhang. 2017. Deepstory: Video story QA by deep embedded memory networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 2016–2022.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
  • Lei et al. (2018) Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. 2018. TVQA: localized, compositional video question answering. CoRR, abs/1809.01696.
  • Min et al. (2018) Sewon Min, Victor Zhong, Richard Socher, and Caiming Xiong. 2018. Efficient and robust question answering from minimal context over documents. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 1725–1735.
  • Mnih et al. (2016) Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In

    Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016

    , pages 1928–1937.
  • Na et al. (2017) Seil Na, Sangho Lee, Jisung Kim, and Gunhee Kim. 2017. A read-write memory network for movie story understanding. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 677–685.
  • Pan et al. (2017) Boyuan Pan, Hao Li, Zhou Zhao, Bin Cao, Deng Cai, and Xiaofei He. 2017. MEMEN: multi-layer embedding with memory networks for machine comprehension. CoRR, abs/1707.09098.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2383–2392.
  • Richardson et al. (2013) Matthew Richardson, Christopher J. C. Burges, and Erin Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 193–203.
  • Seo et al. (2016) Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional attention flow for machine comprehension. CoRR, abs/1611.01603.
  • Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-to-end memory networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2440–2448.
  • Tapaswi et al. (2016) Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016

    , pages 4631–4640.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000–6010.
  • Wang et al. (2018a) Bo Wang, Youjiang Xu, Yahong Han, and Richang Hong. 2018a. Movie question answering: Remembering the textual cues for layered visual contents. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 7380–7387.
  • Wang et al. (2018b) Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry Tesauro, Bowen Zhou, and Jing Jiang. 2018b. R: Reinforced ranker-reader for open-domain question answering. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 5981–5988.
  • Williams (1992) Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256.
  • Xiong et al. (2016) Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2397–2406.
  • Yu et al. (2018) Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. Qanet: Combining local convolution with global self-attention for reading comprehension. CoRR, abs/1804.09541.
  • Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabás Póczos, Ruslan R. Salakhutdinov, and Alexander J. Smola. 2017. Deep sets. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 3394–3404.

Appendix A Appendix

a.1 TriviaQA

We provide additional examples to show what our EMR models have remembered, for TriviaQA dataset.


Figure 9: An example visualization of the memory. The answer word ’belgium’ (Black / Thick) arrives at first timestep, and our model retains sentences containing it.

Figure 10: An example visualization of the memory. The answer word ’ely’ (Black / Thick) arrives at first timestep, and our model retains it after reading in all the context sentences.

Figure 11: An example visualization of the memory. The answer word ’alaska’ (Black / Thick) arrives at timestep 10, and our model retains it after reading in all the context sentences.

a.2 Tvqa

We provide additional examples to show what our EMR models have remebered for TVQA dataset. Each frames illustrated in figure are the frames in the external memory at the last time step. The stars with different colors denote the supporting frames for different questions.

Figure 12: An example of clip from drama ’House’. Each frame with star is corresponding to question with the star of same color.
Figure 13: An example of clip from drama ’Friends’. Each frame with star is corresponding to question with the star of same color.
Figure 14: An example of clip from drama ’Castle’. Each frame with star is corresponding to question with the star of same color.
Figure 15: An example of clip from drama ’When I met your mother’. Each frame with star is corresponding to question with the star of same color.