Learning to Search in Long Documents Using Document Structure

06/09/2018 ∙ by Mor Geva, et al. ∙ Tel Aviv University 0

Reading comprehension models are based on recurrent neural networks that sequentially process the document tokens. As interest turns to answering more complex questions over longer documents, sequential reading of large portions of text becomes a substantial bottleneck. Inspired by how humans use document structure, we propose a novel framework for reading comprehension. We represent documents as trees, and model an agent that learns to interleave quick navigation through the document tree with more expensive answer extraction. To encourage exploration of the document tree, we propose a new algorithm, based on Deep Q-Network (DQN), which strategically samples tree nodes at training time. Empirically we find our algorithm improves question answering performance compared to DQN and a strong information-retrieval (IR) baseline, and that ensembling our model with the IR baseline results in further gains in performance.



There are no comments yet.


page 8

page 15

page 16

Code Repositories


Author implementation of "Learning to Search in Long Documents Using Document Structure" (Mor Geva and Jonathan Berant, 2018)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/

Reading comprehension (RC), the task of reading documents and answering questions about their content, has attracted immense attention recently. While early work focused on simple questions and short paragraphs [Hermann et al.2015, Rajpurkar et al.2016, Trischler et al.2017, Onishi et al.2016], current work is shifting towards more complex questions that require reasoning over long documents [Joshi et al.2017, Hewlett et al.2016, Welbl et al.2017, Kočisky et al.2017].

Long documents pose a challenge for current RC models, as they are dominated by recurrent neural networks (RNNs) [Chen et al.2016, Kadlec et al.2016, Xiong et al.2017]. RNNs process documents token-by-token, and thus using them for long documents is prohibitive. A common solution is to retrieve part of the document with an IR approach [Chen et al.2017, Clark and Gardner2017] or a cheap model [Watanabe et al.2017], and run an RNN over the retrieved excerpts. However, as documents become longer and questions become complex, two problems emerge: (a) retrieving all the necessary text with a one-shot IR method when performing complex reasoning becomes harder, and thus thousands of tokens are retrieved [Clark and Gardner2017]. (b) Running even a cheap model over the document in its entirety becomes expensive [Choi et al.2017].

Humans, in lieu of a mental inverted index, use document structure to guide their search for answers. E.g., the answer to “What high school did Leonard Cohen go to?” is likely to appear in “Early life”, while the answer to “How hot is it in Melbourne in July?” is likely to appear in “Climate”. In this work we investigate whether we can train a model to navigate through a document using its structure and find the answer, while reading only a small portion of the entire document.

Figure 1: An overview of our framework: an agent answers the question for a document by starting at the title, performing navigation actions until reaching the relevant paragraph, extracting the answer , and then stopping.

We represent documents as trees and train an agent that navigates through the document tree until returning a final answer. Figure 1 illustrates this process. Our agent reads the question “The vacation destinations of Pattaya and Phuket are in which country?” and starts navigation at the title of the document. After reading a paragraph, and skipping “History”, it drills down to “Geography” until finally halting at a paragraph that specifies the answer (“Thailand”). The agent observes at each step only a glimpse of the local text to determine its next action, which can be movement to a tree node, answering the question with a more expensive RC model, or terminating navigation. Thus, the agent consumes only a small fraction of the entire document.

Our training data is question-document-answer triplets, without gold navigation paths, and thus we train our model with the Deep Q-Network (DQN) algorithm. Because the dataset is biased towards answers appearing at the beginning of the document, the algorithm tends to stop early and does not explore the document well. To overcome this challenge, we propose DocQN: a variant of DQN for tree navigation that improves exploration by sampling nodes from multiple parts of the tree.

We evaluate the ability of our agent to navigate to paragraphs containing the answer on a variant of TriviaQA [Joshi et al.2017] and find that: (a) DocQN navigates better than DQN in documents both quantitatively and qualitatively. (b) While DocQN observes only 6% of the document tokens, it outperforms an IR method in end-to-end QA performance. (c) An ensemble of DocQN and IR substantially improves both navigation and end-to-end QA performance over the ensemble components.

To summarize, in this paper we ask: can an agent use document structure and learn to find answers for complex questions in long documents? We propose a new model and training algorithm that overcomes an inherent bias in the data, answering the aforementioned question in the affirmative. Our code and dataset are available at https://github.com/mega002/DocQN.

2 Problem Overview

We work in the traditional RC setup, where we are given question-document-answer triplets as a training set, and aim to learn a function that finds the answer for an unseen question-document pair. Unlike prior work, we assume documents are trees, where every tree node corresponds to a structural element and is labeled with text . Specifically, the root is labeled with the document title, sections and subsections are labeled by their title, and paragraphs and sentences are labeled by the text they contain. In addition, we order all non-sentence tree nodes by a pre-order traversal (which corresponds to the linear order of text in the document), and denote the index of a node by . For sentence nodes, is the index of their parent (a paragraph).

Figure 1 shows an example tree, where for each node we show the relevant structural element and index (sentence nodes are not shown in the figure).

With this document representation, answering questions can be viewed as a Markov Decision Process (MDP), where in each state the agent is located in a particular tree node, actions allow movement through the document tree, answering the question with a RC model, or stopping, and a reward is based on whether the agent locates a node that contains the answer.

Our agent interleaves actions that navigate in the document, with an action that runs an RC model in a certain document position and extracts an answer. Thus, the agent can decide to continue navigation after extracting a certain answer. This is strictly more expressive than existing approaches that combine IR with RC, where some text is retrieved exactly once before applying an RC model. As RC shifts to reasoning over complex questions, navigating and reading multiple parts of the document will become necessary. We show in Section 5 this approach indeed improves QA performance.

3 Data

To test our framework, we capitalize on the recently-released TriviaQA dataset, which contains question-answer pairs, along with a small set of documents that (in almost all cases) contain the answer. TriviaQA is suitable for our purposes as it is a large scale dataset, where questions are relatively complex and documents are fairly long. The dataset includes only raw text, and thus for every evidence document, we built a tree representation by extracting the html metadata from the corresponding Wikipedia page, and constructing the document structure from it.

Because our goal is to investigate whether a model can learn to search through a document, it is important that a non-negligible fraction of the questions require navigation through the document. However, in Wikipedia each document starts with a preface that summarizes the document, and thus often contains the answer. Consequently, a model that ignores the question and document and always stops in the first paragraph is likely to obtain good performance. Figure 2 shows the distribution of the first answer occurrence (FAO) in a document in TriviaQA over the node indices (x-axis), where all tree nodes, except sentences are considered (see Section 2). We find that in most question-document pairs the FAO is in the first few paragraphs, and that in 60% of the cases it is in the preface section.

TriviaQA TriviaQA-NoP Train Questions 61,888 57,220 Documents 110,647 99,315 Development Questions 7,993 7,336 Documents 14,229 12,706 Test Questions 7,701 6,507 Documents 13,661 10,481 Table 1: Data statistics for TriviaQA vs. TriviaQA-NoP. Average number of tokens 5590.7 Average number of tree nodes 332.2 Average number of high-level sections 6.6 Table 2: Data statistics for the evidence documents of TriviaQA-NoP. Figure 2: FAO node index distribution of TriviaQA and TriviaQA-NoP (median values in red). TriviaQA TriviaQA-NoP Questions 86.7% 83.8% Question-document pairs 69.7% 66.9% Table 3: Portion of answerable samples in a random subset of the training set, which contains 278 questions and 475 question-document pairs.

To alleviate this heavy bias, we derive a new dataset, termed TriviaQA-NoP, where we remove the preface section from every document. After removing the preface, 2,144 out of 77,582 (2.8%) questions and 6,124 out of 138,537 (4.4%) question-document pairs are left without an answer and are removed. To further reduce the number of cases where an answer can not be inferred from a document, we drop question-document pairs where: (a) the answer appears only in titles; (b) the answer is a single-character; (c) the FAO node index is (in most cases the answer is an item in a list). Finally, the dataset includes 91.6% of the questions and 88.4% of the question-document pairs from the original dataset. We provide full statistics on the dataset in Tables 1 and 2.

To verify that questions remain answerable in TriviaQA-NoP after removing the preface, we perform manual analysis on a random sample of 278 questions and 475 documents from the training set (Table 3). We find that the portion of answerable questions and question-document pairs remains high and is reduced by less than 3% in comparison to TriviaQA . This demonstrates that indeed, for most questions and documents, the context necessary for answering the question also appears in the document body and not only in the preface.

Figure 2 shows the FAO node index distribution in TriviaQA-NoP. We observe, compared to TriviaQA, that the first occurrence of an answer is much more spread out across the document, and that the median increases from 3 to 14, which will require more navigation from the agent. However, even in TriviaQA-NoP answers tend to appear at the beginning of the document, because document content is usually organized by importance. This bias results in an exploration challenge for our training algorithm, which we will address in Section 4.

4 Method

In this section, we describe a model for the agent and a training algorithm based on DQN [Mnih et al.2015]. Specifically, we introduce a tree-sampling strategy, which addresses the exploration challenge stemming from the bias towards answers early in the document.

4.1 Framework

We represent the MDP as a tuple , where is the state space, is the action space, is a reward function, and is a deterministic transition function. Our model implements an action-value function , which takes a state and returns a value for every action . This function defines a policy . We now describe the state space, actions and reward function.

“Phuket Province Name”
: height 2
: depth 1
: h_dist_start 0
: h_dist_end 2
: parent.h_dist_start 0
: parent.h_dist_end 0
: navigation step 1
Table 4: An example observation and list of navigation features for node 1 in Figure 1. The features height and depth correspond to distance from the farthest leaf and root, respectively (height is 2 since sentence nodes are omitted from the figure). h_dist_start and h_dist_end measure the horizontal distance from the first and last child of the node’s parent. navigation step is a counter for the number of performed steps.


Given a tree node , a state is a tuple , where is the question, is an observation, is an answer prediction, and , are navigation and answer prediction features. An observation is a sequence of tokens produced by recursively concatenating the first tokens of text in the label to the observation of ’s parent. An answer prediction is the sequence of tokens that were extracted by an RC model, if an RC model was already run on (and a null token otherwise). The answer prediction features provide information on the distribution over answer spans provided by the RC model, which reflects its confidence: is the entropy of the distribution,

is the logit value for

, and is the number of tokens in . Navigation features provide information on the relative location of in the document. An example observation and full list of navigation features are given in Table 4.

Note that the state does not depend on the history of visited tree nodes.111except for navigation step, which can be approximated by the shortest path from the root to any node. While incorporating history could be beneficial, a memory-less model enables us to explore tree-sampling strategies, which is important for training (Section 4.2).


We define the following set of actions . Let be a node with an ordered list of children , and be a child of . We define five movement actions (Figure 3), where Down moves from to its first child , Right moves from to , and Left moves from to . Because moving upwards reaches a node we already visited, we define UpR, which moves from to , and UpL, which moves from to . If an illegal action is chosen (e.g., Left from , then the agent stays in its current position. The action Answer returns an answer (and a distribution over spans) by running a RC model on , unless is a sentence, in which case it is run on the paragraph containing . After Answer, the agent can resume navigation. The action Stop also returns the answer given the current node , but also terminates navigation.

Figure 3: Movement actions in our environment.


Our goal is to develop an agent that can navigate in the document, and thus we define the reward based on whether the agent stops in a node that contains the gold answer text (this is noisy, because the answer might be there sporadically). While a simple reward would be an indicator for whether the agent stopped at a correct node, such a reward would not capture the proximity of the agent to the answer. Moreover, we would like to consider the overall document length, rewarding successful navigations in long documents. Therefore, we define the following reward:

where is the node where the agent is located and is the closest tree node that contains the answer ( is the node index as defined above). Thus, when stopping, the reward is proportional to the distance to the closest answer location given the document length. An additional reward is given if navigation is successful, and a penalty is given for any other action, to encourage shorter trajectories. We further penalize the Answer action to discourage frequent usage of the RC model.

4.2 DocQN: DQN with Tree Sampling

Training the navigation model is based on DQN [Mnih et al.2015]. In DQN, at every step an agent at state selects an action using -greedy policy, given the current action-value function parameterized by . The agent observes a reward and a state and adds a transition to a replay memory buffer that holds a large number of recent transitions. The parameters are then optimized so that the action-value function matches better the observed reward. This is done by sampling a batch of random transitions from and minimizing the regression loss


where are the parameters of a target network, which is a periodic copy of that is not optimized, and is a discount factor.

We also add some of the recent enhancements to DQN [Hessel et al.2017], which have proved to be useful in our setup. Specifically we implement Double Q-Learning [van Hasselt et al.2016], Prioritized Experienced Replay [Schaul et al.2015], and Dueling Networks [Wang et al.2016].

Reducing bias with DocQN

The DQN algorithm contains episodes, where in each episode the agent is placed at an initial state , from which it starts taking actions. In our setup, this state corresponds to the root of the document tree. Because the data has a bias towards answers appearing at the beginning of the document, the agent learns that stopping early improves the reward and is stuck at a local minimum, where it ceases exploration. Examining Figures 2 and 8, we observe that DQN learns to stop very early in the document compared to the FAO node index distribution of TriviaQA-NoP, amplifying the bias in the data.

To address this issue, we suggest DocQN, a variant of DQN aimed at increasing exploration when navigating in a document structure. DocQN capitalized on two properties. First, DQN is an off-policy algorithm that trains from transitions . Second, our model is memory-less, and thus we can sample any node and compute the corresponding state . Therefore, we modify DQN, and instead of initializing every episode with and performing a sequence of actions, we sample states from distributions that explore the document better. By exploring transitions from across the document, the model learns from more distant parts of the document.

Figure 4:

Illustration of the different state distributions for a specific documents tree. Nodes with darker color have higher sampling probability, and nodes marked in red are paragraph nodes containing the answer.

Algorithm 1 DocQN 1:Let be a distribution over tree nodes 2:Initialize replay memory and parameters 3:for  episode  do 4:      if  random()  then 5:            for   do 6:                 sample node and generate 7:                 sample action from (-greedily). 8:                  9:                 store in              10:      else 11:            Initialize start state 12:            for   do 13:                 sample action from (-greedily). 14:                  15:                 store in                    16:       (Equation 1) Figure 5: High-level overview of the network architecture.

Algorithm 1 provides the details of DocQN. The algorithm initializes the replay memory buffer, the model and target network parameters, and chooses a distribution over tree nodes. At the beginning of every episode, with probability (line 4),222 is annealed from to during training. we sample state transitions using and store them in , and with probability (line 10), we start at the initial state and sample a trajectory as in DQN. Parameters are updated as in DQN by sampling transitions from the replay buffer and minimizing Equation 1. If samples nodes from various locations in the document, the algorithm will explore better and will not getting stuck at its beginning. We consider the following instantiations of :

  1. [topsep=0pt,itemsep=0ex,parsep=0ex]

  2. : Uniform sampling over nodes, except we discourage sampling sentences by uniformly sampling with probability a leaf (sentence), and probability an inner node.

  3. : Backward sampling: We uniformly sample a paragraph node that contains the gold answer, then uniformly a number of actions , and perform random movement actions from to output a node. This results in a node that is “close” to the answer , and can be viewed as similar to bi-directional search or backward search [Lao et al.2015].

Figure 4 shows the distribution over tree nodes for a specific document tree where the answer appears in two paragraph nodes (ignoring sentence nodes). We see that sequential sampling (as in DQN) puts most of the probability mass close to the root, uniform sampling is uniform (across paragraphs, as sentence nodes are omitted), and backward sampling is concentrated close to answer nodes. This illustrates how different distributions result in different exploration strategies.

Network Architecture

We briefly describe the neural architecture (Figure 5) and provide full details in the supplementary material. As explained in Section 4.1, the input to the network is the state , which comprises the question tokens , observation tokens , answer prediction tokens and features ,

. The question, observation and answer prediction tokens are encoded with pre-trained word embeddings and trained character embeddings, where character embeddings are followed by a convolutional layer with max pooling, yielding a single vector per token. Each token is then represented by concatenating the word embedding with the character embedding.

Question tokens, observation tokens, and answer to are then fed into a BiLSTM and LSTM [Hochreiter and Schmidhuber1997] respectively and the LSTM outputs are compressed to a single vector through self attention [Cheng et al.2016], resulting in one vector for and one for . Answer prediction tokens are fed into an LSTM, where the last hidden state is concatenated to the features , thus creating a third vector for the answer prediction.

We concatenate these three vectors, and pass them through a one layer feed-forward network that then branches to two networks according to the Dueling DQN architecture [Wang et al.2016]. In each branch we also concatenate the navigation features . One branch predicts the value of the state , and the other branch predicts the advantage of every action for every possible action . The output of the network is as in wang2016dueling.

5 Experimental Evaluation

Our experimental evaluation aims to answer the following questions: (a) Can document structure be used to learn to navigate to an answer in a document? (b) How does DocQN compare to DQN? (c) How does DocQN compare to IR methods that observe the entire document?

5.1 Experimental Setup

We evaluate on TriviaQA-NoP. Because our focus is on the navigation ability of the agent, we train a single RC model and fix it in all experiments. Specifically, we download RaSoR [Lee et al.2017, Salant and Berant2018],333https://github.com/shimisalant/RaSoR and exactly follow the procedure described by the authors of TriviaQA for training a RC model, i.e., we train RaSoR on the first tokens of each document in TriviaQA-NoP. As a sanity check for our RC model, we also train and evaluate RaSor on the original TriviaQA dataset. Indeed, Rasor obtains EM and F1, which is substantially higher than the baseline reported by joshi2017triviaqa.


We use two metrics: First, we measure navigation accuracy, i.e., for a question-document pair, whether a method returns text containing a gold answer (if the agent stops at a sentence node we evaluate the encompassing paragraph). Because questions in TriviaQA-NoP often have more than one evidence document, we also measure aggregated navigation accuracy, where we give credit if the agent navigated correctly in any of the documents. This gives performance assuming an oracle that always chooses the best document for the question. Because the test set in TriviaQA is hidden, we evaluated navigation accuracy on the development set only.

In addition, we measure end-to-end QA performance with the official Exact Match (EM) and F1 metrics. The EM metric measures the percentage of predictions that match exactly any of the answer aliases, and the F1 metric measures the average overlap between the prediction and answer. To aggregate evidence from multiple documents we follow joshi2017triviaqa and define the score of an answer to be the sum of probabilities for that answer across all documents.


We compare the following models:

  • [topsep=0pt,itemsep=0ex,parsep=0ex,leftmargin=]

  • DocQN: This is our main model, where we use the sampling distribution .

  • DQN: DQN algorithm without state sampling.

  • {DocQNDQN}-Coupled: A less expressive version of the model, where the actions ANSWER and STOP are coupled, i.e., the agent can use the RC model exactly once and then stops. This is similar to a setup where retrieval is performed once and not interleaved with navigation.

  • RandomWalk An agent that selects an action at each step uniformly at random.

  • RandomPara An agent that randomly selects a non-sentence tree node.

  • Doc-Tf-Idf: An IR baseline, where a paragraph is selected based on its tf-idf score. We implement the tf-idf scheme of DocQA [Clark and Gardner2017], which is a high-performing system on TriviaQA

    . Here, idf is computed for each paragraph in the context of the current document, and the paragraph with highest cosine similarity to the question is chosen. Note that

    Doc-Tf-Idf processes the entire document (up to tens of thousands of tokens), while DocQN processes only a small fraction.

  • Tf-Idf: Vanilla tf-idf, where the idf score is computed from all documents in TriviaQA-NoP.

  • Ensemble: We ensemble DocQN with Doc-Tf-Idf in two ways: (1) for finding the final answer, we aggregate scores as described above, that is, we sum the probabilities from both models over all documents and choose the answer with highest score. (2) For navigation, we simply tune on the development set a threshold , where we take the prediction of DocQN if it stopped at a node with index and use Doc-Tf-Idf otherwise.

  • ReadTop: Following joshi2017triviaqa, a model that runs the RC model on the first tokens of the document. Note that running the RC model on the text retrieved by DocQN involves consuming far fewer tokens, namely, RaSoR consumes only tokens on average.

We report the value of all hyper-parameters in the supplementary material (all fixed without tuning).

5.2 Results

Figure 6: Navigation example of DocQN.
Figure 7: Model performance (top) and portion of samples in the development set (bottom) as functions of the node index of the FAO. Figure 8: Distribution of node index at navigation stopping point (median value in red) for DQN and DocQN.

Tables  6 and  6 show the results of our experiments. Focusing on navigation accuracy, we see that randomly walking or choosing a paragraph yields low performance. Vanilla Tf-Idf performs considerably better than the random baselines, but is outperformed by all other models. Comparing DQN to DocQN, we see that DocQN outperforms DQN. Allowing DQN and DocQN to run the RC model during navigation improves their performance, but the accuracy gain is substantially higher for DocQN (2.3% accuracy and 3.1% aggregated accuracy) than for DQN (0.4% accuracy and 0.7% aggregated accuracy). Doc-Tf-Idf, which has access to the entire document outperforms DocQN, which consumes on average 6% of the entire document. Nonetheless, the two models obtain the same aggregated accuracy. This good performance of Doc-Tf-Idf shows that in TriviaQA-NoP answer paragraphs share a lot of lexical material with the question. Importantly, an ensemble of Doc-Tf-Idf with DocQN substantially improves the overall performance, reaching accuracy of 35% and aggregated accuracy of 48%. This is in sharp contrast to an ensemble with DQN, where for any value of , Doc-Tf-Idf performs better on its own.

Dev. Dev. Agg. RandomWalk 1.8 3.1 RandomPara 13.4 20.9 DQN-Coupled 26.8 38.8 DQN 27.2 39.5 DocQN-Coupled 28.1 40.5 DocQN 30.4 43.6 Tf-Idf 25.3 35.4 Doc-Tf-Idf 32.4 43.5 Ensemble (threshold, ) 35.0 48.0
Table 5: Navigation accuracy for all models on the development set of TriviaQA-NoP.
Development Test EM F1 EM F1 DQN 21.7 26.5 19.1 24.1 DocQN 23.6 27.9 21.0 25.5 Doc-Tf-Idf 21.4 27.1 18.2 23.5 Ensemble (threshold, ) 26.8 32.0 24.2 29.4 Ensemble (answer) 28.4 33.4 25.4 30.5 ReadTop 32.5 36.7 28.1 32.4
Table 6: End-to-end QA performance of all models on the development and test sets of TriviaQA-NoP. For the ensemble model, we choose the best threshold of .

Examining end-to-end performance, DocQN again outperforms DQN, but DocQN is now better than Doc-Tf-Idf. This suggests that when Doc-Tf-Idf selects a paragraph with the answer, it is often difficult to extract with the RC model. Again, the ensembles leads to a dramatic increase in performance, showing that Doc-Tf-Idf and DocQN are complementary. However, ReadTop, which consumes the first tokens of each document compared to for DocQN, substantially outperforms all navigation models. This shows that the bias for answers at the beginning of documents is still strong in TriviaQA-NoP.

To further elucidate the differences between navigation models, Figure 7 shows navigation accuracy of different models and the proportion of samples for different node indices of the FAO. We see that DQN outperforms DocQN when the answer is at the top of the document, but DocQN dominates DQN when the answer is further down, showing that DocQN learns to find answers deeper in the document. Doc-Tf-Idf has a more balanced navigation accuracy across the document, which explains why an ensemble of Doc-Tf-Idf with DocQN works, as they are complementary to one another.


Figure 6 shows a navigation example, which includes the navigation step, node index, observation , and the action taken. For the observation we also highlight the attention distribution from the self-attention component. In this figure the question is about culture, and we see the agent going into multiple sections, reading them, running the RC model and continuing forward, until finally stopping at a paragraph that contains the answer. We provide many more examples in the supplementary material.

Table 7 highlights some differences between DQN and DocQN. DocQN has longer trajectories, and stops at sentence nodes more frequently, suggesting it reads the document at a finer granularity. Additionally, it leverages the RC model to collect more information and confidence during navigation, by choosing the action ANSWER more frequently. Both models consume less than 7% of the entire document. Figure 8 illustrates the navigation stopping point, and shows that DocQN navigates deeper into the document.

Path length avg. 7.7, range [1,36] avg. 15.2, range [3,100]
Answer predictions avg. 2.8 avg. 4.1
Tokens consumed 3.4% 6.2%
Stopping node 0.5% title, 2.2% headline, 89.0% paragraph, 8.3% sentence 0% title, 0.4% headline, 75.1% paragraph, 24.5% sentence
Table 7: Comparing navigation properties of DQN and DocQN on the development set. The average and range of navigation path length in steps, relative amount of consumed tokens, and distribution of stopping node types.

6 Related Work

Handling the challenges of reasoning over multiple long documents is gaining fast momentum recently [Shen et al.2017]. As mentioned, some approaches use IR for reducing the amount of processed text [Chen et al.2017, Clark and Gardner2017], while others use cheap or parallelizable models to handle long documents [Hewlett et al.2017, Swayamdipta et al.2018, Wang et al.2018a]. Searching for answers while using a trained RC model as a black-box was also implemented recently in wang2018evidence, for open-domain questions and multiple short evidence texts from the Web. Another thrust has focused on skimming text in a sequential manner [Yu et al.2017], or designing recurrent architectures that can consume text quickly [Bradbury et al.2017, Seo et al.2018, Campos et al.2018, Yu et al.2018]. However, to the best of our knowledge no work has previously applied these methods to long documents such as Wikipedia pages.

In this work we use TriviaQA-NoP for evaluation of our navigation based approach and comparison to an IR baseline. While there are various aspects to consider in such evaluation setup, our choice of data was derived mainly by the requirements for long and structured context. Recently, several new datasets such as WikiHop and NarrativeQA were published. These datasets try to focus on the tendency of RC models to match local context patterns, and are designed for multi-step reasoning. [Welbl et al.2017, Wadhwa et al.2018, Kočisky et al.2017].

Our work is also related to several papers which model an agent that navigates in an environment to find objects in an image [Ba et al.2015], relations in a knowledge-base [Das et al.2018], or documents on the web [Nogueira and Cho2016].

7 Conclusions

We investigate whether document structure can be leveraged to train an agent that finds answers to questions in long documents while reading only a small fraction of it. We show that an agent that reads 6% of the document can improve QA performance compared to an IR method that utilizes the entire document, and that ensembling the two substantially improves performance. We also present DocQN, an algorithm that promotes better exploration of the document, and show it outperforms DQN qualitatively and quantitatively.

Our approach represents a conceptual departure from previous methods for reading long documents, as it interleaves searching for an answer in the document with extracting the answer from a particular paragraph, which we show improves both navigation and QA performance. We expect that as RC models tackle longer documents that require reasoning and reading text that is spread in multiple parts of the document, models that can efficiently navigate and collect evidence will become more and more crucial. Our agent provides a first step in this important research direction.


We thank Eunsol Choi from Washington University for helpful discussions and comments on the paper. This research was supported by the Yandex Initiative in Machine Learning and by The Israel Science Foundation grant 942/16. This work was completed in partial fulfillment for the Ph.D degree of the first author.


  • [Ba et al.2015] J. Ba, V. Mnih, and K. Kavukcuoglu. 2015. Multiple object recognition with visual attention. In International Conference on Learning Representations (ICLR).
  • [Bradbury et al.2017] J. Bradbury, S. Merity, C. Xiong, and R. Socher. 2017. Quasi-recurrent neural networks. In International Conference on Learning Representations (ICLR).
  • [Campos et al.2018] V. Campos, B. Jou, X. Giro i Nieto, J. Torres, and S. Chang. 2018. Skip RNN: Learning to skip state updates in recurrent neural networks. In International Conference on Learning Representations (ICLR).
  • [Chen et al.2016] D. Chen, J. Bolton, and C. D. Manning. 2016. A thorough examination of the CNN / Daily Mail reading comprehension task. In Association for Computational Linguistics (ACL).
  • [Chen et al.2017] D. Chen, A. Fisch, J. Weston, and A. Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL).
  • [Cheng et al.2016] J. Cheng, L. Dong, and M. Lapata. 2016. Long short-term memory-networks for machine reading. In

    Empirical Methods in Natural Language Processing (EMNLP)

  • [Choi et al.2017] E. Choi, D. Hewlett, A. Lacoste, I. Polosukhin, J. Uszkoreit, and J. Berant. 2017. Coarse-to-fine question answering for long documents. In Association for Computational Linguistics (ACL).
  • [Clark and Gardner2017] C. Clark and M. Gardner. 2017. Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723.
  • [Das et al.2018] R. Das, S. Dhuliawala, M. Zaheer, L. Vilnis, I. Durugkar, A. Krishnamurthy, A. Smola, and A. McCallum. 2018.

    Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning.

    In International Conference on Learning Representations (ICLR).
  • [Hermann et al.2015] K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems (NIPS).
  • [Hessel et al.2017] M. Hessel, J. Modayil, H. V. Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. 2017. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298.
  • [Hewlett et al.2016] D. Hewlett, A. Lacoste, L. Jones, I. Polosukhin, A. Fandrianto, J. Han, M. Kelcey, and D. Berthelot. 2016. Wikireading: A novel large-scale language understanding task over Wikipedia. In Association for Computational Linguistics (ACL).
  • [Hewlett et al.2017] D. Hewlett, L. Jones, A. Lacoste, et al. 2017. Accurate supervised and semi-supervised machine reading for long documents. In Empirical Methods in Natural Language Processing (EMNLP), pages 2011–2020.
  • [Hochreiter and Schmidhuber1997] S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
  • [Joshi et al.2017] M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Association for Computational Linguistics (ACL).
  • [Kadlec et al.2016] R. Kadlec, M. Schmid, O. Bajgar, and J. Kleindienst. 2016. Text understanding with the attention sum reader network. In Association for Computational Linguistics (ACL).
  • [Kočisky et al.2017] T. Kočisky, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. 2017. The NarrativeQA reading comprehension challenge. arXiv preprint arXiv:1712.07040.
  • [Lao et al.2015] N. Lao, E. Minkov, and W. Cohen. 2015. Learning relational features with backward random walks. In Association for Computational Linguistics (ACL).
  • [Lee et al.2017] K. Lee, S. Salant, T. Kwiatkowski, A. Parikh, D. Das, and J. Berant. 2017. Learning recurrent span representations for extractive question answering. arXiv.
  • [Mnih et al.2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
  • [Nogueira and Cho2016] R. Nogueira and K. Cho. 2016. End-to-end goal-driven web navigation. In Advances in Neural Information Processing Systems (NIPS).
  • [Onishi et al.2016] T. Onishi, H. Wang, M. Bansal, K. Gimpel, and D. McAllester. 2016. Who did what: A large-scale person-centered cloze dataset. In Empirical Methods in Natural Language Processing (EMNLP).
  • [Rajpurkar et al.2016] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP).
  • [Salant and Berant2018] S. Salant and J. Berant. 2018. Contextualized word representations for reading comprehension. In North American Association for Computational Linguistics (NAACL).
  • [Schaul et al.2015] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. 2015. Prioritized experience replay. In International Conference on Learning Representations (ICLR).
  • [Seo et al.2018] M. Seo, S. Min, A. Farhadi, and H. Hajishirzi. 2018. Neural speed reading via skim-RNN. In International Conference on Learning Representations (ICLR).
  • [Shen et al.2017] Y. Shen, P. Huang, J. Gao, and W. Chen. 2017. Reasonet: Learning to stop reading in machine comprehension. In International Conference on Knowledge Discovery and Data Mining (KDD).
  • [Swayamdipta et al.2018] S. Swayamdipta, A. P. Parikh, and T. Kwiatkowski. 2018. Multi-mention learning for reading comprehension with neural cascades. In International Conference on Learning Representations (ICLR).
  • [Trischler et al.2017] A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. 2017. NewsQA: A machine comprehension dataset. In Workshop on Representation Learning for NLP.
  • [van Hasselt et al.2016] H. van Hasselt, A. Guez, and D. Silver. 2016. Deep reinforcement learning with double Q-learning. In

    Association for the Advancement of Artificial Intelligence (AAAI)

    , volume 16, pages 2094–2100.
  • [Wadhwa et al.2018] S. Wadhwa, V. Embar, M. Grabmair, and E. Nyberg. 2018. Towards inference-oriented reading comprehension: Parallelqa. In

    Workshop on New Forms of Generalization in Deep Learning and Natural Language Processing at NAACL

  • [Wang et al.2016] Z. Wang, T. Schaul, M. Hessel, H. V. Hasselt, M. Lanctot, and N. D. Freitas. 2016. Dueling network architectures for deep reinforcement learning. In International Conference on Machine Learning (ICML).
  • [Wang et al.2018a] S. Wang, M. Yu, X. Guo, Z. Wang, T. Klinger, W. Zhang, S. Chang, G. Tesauro, B. Zhou, and J. Jiang. 2018a. R3: Reinforced ranker-reader for open-domain question answering. In Association for the Advancement of Artificial Intelligence (AAAI).
  • [Wang et al.2018b] S. Wang, M. Yu, J. Jiang, W. Zhang, X. Guo, S. Chang, Z. Wang, T. Klinger, G. Tesauro, and M. Campbell. 2018b. Evidence aggregation for answer re-ranking in open-domain question answering. In International Conference on Learning Representations.
  • [Watanabe et al.2017] Y. Watanabe, B. Dhingra, and R. Salakhutdinov. 2017. Question answering from unstructured text by retrieval and comprehension. arXiv preprint arXiv:1703.08885.
  • [Welbl et al.2017] J. Welbl, P. Stenetorp, and S. Riedel. 2017. Constructing datasets for multi-hop reading comprehension across documents. arXiv preprint arXiv:1710.06481.
  • [Xiong et al.2017] C. Xiong, V. Zhong, and R. Socher. 2017. Dynamic coattention networks for question answering. In International Conference on Learning Representations (ICLR).
  • [Yu et al.2017] A. W. Yu, H. Lee, and Q. V. Le. 2017. Learning to skim text. In Association for Computational Linguistics (ACL).
  • [Yu et al.2018] K. Yu, Y. Liu, A. G. Schwing, and J. Peng. 2018. Fast and accurate text classification: Skimming, rereading and early stopping. In International Conference on Learning Representations (ICLR).

Appendix A Supplementary Material

a.1 Network Architecture Details

Here, we elaborate on the network architecture, which was briefly described in the paper. Given an input state , we denote by , and the question tokens, observation tokens and answer prediction tokens, respectively.

Word-level embedding

For every , every and every , we create an embedding of the input tokens in two steps. First, we embed every token with a pre-trained GloVe word embedding matrix :

Next, we apply character-level embeddings with a learned embedding matrix

. The embedded characters are then summarized with a max-pooling convolutional neural network:

Concatenation of the two components yields the final word-level embeddings:

Question sequence encoding

The question is encoded with a BiLSTM, where for every timestamp , the forward and backward outputs are concatenated:

The outputs are then summarized with a self-attention:

where the coefficients are obtained by feeding the outputs to a two-layer feed-forward network, and normalizing the output logits with softmax:

Observation sequence encoding

The observation sequence encoding is obtained in an analogous manner to , except that we use a rather than a .

Answer prediction encoding

The answer prediction encoding is obtained by running the answer prediction tokens through a LSTM, and concatenating the last hidden state with the input feature vector :

State representation

The final state representation is formed by concatenating the encoded question with the encoded observation . Concretely:

The input node features are concatenated to an upper layer, as we describe next , to increase their weights in the final predictions. We have found that incorporation of these features in this way accelerates the navigation learning process.

Q-values prediction

We implement a Dueling DQN architecture, where the final Q-values are composed of a state value prediction and advantage predictions for every possible action

. Denoting by FFNN a single-layer feed-forward neural network, predictions are derived as follows:

Where , , and are the navigation features. The final Q-values prediction is obtained by averaging:

a.2 Hyper Parameters

Table 8 summarizes the hyper-parameters used for building and training the DocQN models.

Parameter Value
Network Maximum node observation length 20 tokens
Maximum observation length 120 tokens
Word embedding dimension 300
Character embedding dimension 20
Convolution filter size 5
BiLSTM and LSTM hidden dimension 300
First feed-forward layer dimension 512
Second feed-forward layer dimension 256
Dropout rate 0.2
Training RMSprop learning rate 0.0001
Batch size 64
Target network period 10K steps
Initial memory size 50K transitions
Maximal memory size 300K transitions
Action sampling
Discount factor
Prioritization usage 0.6
Prioritization importance sampling
Sampling State sampling
Annealing steps for M steps
Sampling repetitions 5
Maximum navigation length (train) 30 steps
Maximum navigation length (test) 100 steps
Interpolation coefficient for 0.5
Table 8: DocQN hyper-parameters

a.3 Navigation Examples

Figures 9,10,11,12,13 show sample navigations, performed by the DocQN model. In each example, denote the question, answer aliases and answer node numbers. The observation tokens are highlighted according to the self-attention weights, given by the model. By choosing the action ANS, the agent executes the RC model to obtain an answer prediction, which is part of the observation in the next step.

Figure 9: Navigation example 1. Although the document is more general than the question subject and the answer appears multiple times across the document, the agent finds the correct context for answering the question. Figure 10: Navigation example 2. The agent explores several sections before reaching the most relevant one.
Figure 11: Navigation example 3. The agent quickly navigates to the answer, without running the RC model. Figure 12: Navigation example 4. The agent leverages the RC model and decides to stop after observing enough information to answer.
Figure 13: Navigation example 5. After reading through the ”Television episodes” section, the agent goes back to the ”Production” section to find the answer.