DocQN
Author implementation of "Learning to Search in Long Documents Using Document Structure" (Mor Geva and Jonathan Berant, 2018)
view repo
Reading comprehension models are based on recurrent neural networks that sequentially process the document tokens. As interest turns to answering more complex questions over longer documents, sequential reading of large portions of text becomes a substantial bottleneck. Inspired by how humans use document structure, we propose a novel framework for reading comprehension. We represent documents as trees, and model an agent that learns to interleave quick navigation through the document tree with more expensive answer extraction. To encourage exploration of the document tree, we propose a new algorithm, based on Deep Q-Network (DQN), which strategically samples tree nodes at training time. Empirically we find our algorithm improves question answering performance compared to DQN and a strong information-retrieval (IR) baseline, and that ensembling our model with the IR baseline results in further gains in performance.
READ FULL TEXT VIEW PDF
We present a Chinese judicial reading comprehension (CJRC) dataset which...
read it
We present a framework for question answering that can efficiently scale...
read it
Reading Comprehension has received significant attention in recent years...
read it
We propose to model the text classification process as a sequential deci...
read it
State-of-the-art deep reading comprehension models are dominated by recu...
read it
This study considers the task of machine reading at scale (MRS) wherein,...
read it
A fundamental trade-off between effectiveness and efficiency needs to be...
read it
Author implementation of "Learning to Search in Long Documents Using Document Structure" (Mor Geva and Jonathan Berant, 2018)
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/
Reading comprehension (RC), the task of reading documents and answering questions about their content, has attracted immense attention recently. While early work focused on simple questions and short paragraphs [Hermann et al.2015, Rajpurkar et al.2016, Trischler et al.2017, Onishi et al.2016], current work is shifting towards more complex questions that require reasoning over long documents [Joshi et al.2017, Hewlett et al.2016, Welbl et al.2017, Kočisky et al.2017].
Long documents pose a challenge for current RC models, as they are dominated by recurrent neural networks (RNNs) [Chen et al.2016, Kadlec et al.2016, Xiong et al.2017]. RNNs process documents token-by-token, and thus using them for long documents is prohibitive. A common solution is to retrieve part of the document with an IR approach [Chen et al.2017, Clark and Gardner2017] or a cheap model [Watanabe et al.2017], and run an RNN over the retrieved excerpts. However, as documents become longer and questions become complex, two problems emerge: (a) retrieving all the necessary text with a one-shot IR method when performing complex reasoning becomes harder, and thus thousands of tokens are retrieved [Clark and Gardner2017]. (b) Running even a cheap model over the document in its entirety becomes expensive [Choi et al.2017].
Humans, in lieu of a mental inverted index, use document structure to guide their search for answers. E.g., the answer to “What high school did Leonard Cohen go to?” is likely to appear in “Early life”, while the answer to “How hot is it in Melbourne in July?” is likely to appear in “Climate”. In this work we investigate whether we can train a model to navigate through a document using its structure and find the answer, while reading only a small portion of the entire document.
We represent documents as trees and train an agent that navigates through the document tree until returning a final answer. Figure 1 illustrates this process. Our agent reads the question “The vacation destinations of Pattaya and Phuket are in which country?” and starts navigation at the title of the document. After reading a paragraph, and skipping “History”, it drills down to “Geography” until finally halting at a paragraph that specifies the answer (“Thailand”). The agent observes at each step only a glimpse of the local text to determine its next action, which can be movement to a tree node, answering the question with a more expensive RC model, or terminating navigation. Thus, the agent consumes only a small fraction of the entire document.
Our training data is question-document-answer triplets, without gold navigation paths, and thus we train our model with the Deep Q-Network (DQN) algorithm. Because the dataset is biased towards answers appearing at the beginning of the document, the algorithm tends to stop early and does not explore the document well. To overcome this challenge, we propose DocQN: a variant of DQN for tree navigation that improves exploration by sampling nodes from multiple parts of the tree.
We evaluate the ability of our agent to navigate to paragraphs containing the answer on a variant of TriviaQA [Joshi et al.2017] and find that: (a) DocQN navigates better than DQN in documents both quantitatively and qualitatively. (b) While DocQN observes only 6% of the document tokens, it outperforms an IR method in end-to-end QA performance. (c) An ensemble of DocQN and IR substantially improves both navigation and end-to-end QA performance over the ensemble components.
To summarize, in this paper we ask: can an agent use document structure and learn to find answers for complex questions in long documents? We propose a new model and training algorithm that overcomes an inherent bias in the data, answering the aforementioned question in the affirmative. Our code and dataset are available at https://github.com/mega002/DocQN.
We work in the traditional RC setup, where we are given question-document-answer triplets as a training set, and aim to learn a function that finds the answer for an unseen question-document pair. Unlike prior work, we assume documents are trees, where every tree node corresponds to a structural element and is labeled with text . Specifically, the root is labeled with the document title, sections and subsections are labeled by their title, and paragraphs and sentences are labeled by the text they contain. In addition, we order all non-sentence tree nodes by a pre-order traversal (which corresponds to the linear order of text in the document), and denote the index of a node by . For sentence nodes, is the index of their parent (a paragraph).
Figure 1 shows an example tree, where for each node we show the relevant structural element and index (sentence nodes are not shown in the figure).
With this document representation, answering questions can be viewed as a Markov Decision Process (MDP), where in each state the agent is located in a particular tree node, actions allow movement through the document tree, answering the question with a RC model, or stopping, and a reward is based on whether the agent locates a node that contains the answer.
Our agent interleaves actions that navigate in the document, with an action that runs an RC model in a certain document position and extracts an answer. Thus, the agent can decide to continue navigation after extracting a certain answer. This is strictly more expressive than existing approaches that combine IR with RC, where some text is retrieved exactly once before applying an RC model. As RC shifts to reasoning over complex questions, navigating and reading multiple parts of the document will become necessary. We show in Section 5 this approach indeed improves QA performance.
To test our framework, we capitalize on the recently-released TriviaQA dataset, which contains question-answer pairs, along with a small set of documents that (in almost all cases) contain the answer. TriviaQA is suitable for our purposes as it is a large scale dataset, where questions are relatively complex and documents are fairly long. The dataset includes only raw text, and thus for every evidence document, we built a tree representation by extracting the html metadata from the corresponding Wikipedia page, and constructing the document structure from it.
Because our goal is to investigate whether a model can learn to search through a document, it is important that a non-negligible fraction of the questions require navigation through the document. However, in Wikipedia each document starts with a preface that summarizes the document, and thus often contains the answer. Consequently, a model that ignores the question and document and always stops in the first paragraph is likely to obtain good performance. Figure 2 shows the distribution of the first answer occurrence (FAO) in a document in TriviaQA over the node indices (x-axis), where all tree nodes, except sentences are considered (see Section 2). We find that in most question-document pairs the FAO is in the first few paragraphs, and that in 60% of the cases it is in the preface section.
To alleviate this heavy bias, we derive a new dataset, termed TriviaQA-NoP, where we remove the preface section from every document. After removing the preface, 2,144 out of 77,582 (2.8%) questions and 6,124 out of 138,537 (4.4%) question-document pairs are left without an answer and are removed. To further reduce the number of cases where an answer can not be inferred from a document, we drop question-document pairs where: (a) the answer appears only in titles; (b) the answer is a single-character; (c) the FAO node index is (in most cases the answer is an item in a list). Finally, the dataset includes 91.6% of the questions and 88.4% of the question-document pairs from the original dataset. We provide full statistics on the dataset in Tables 1 and 2.
To verify that questions remain answerable in TriviaQA-NoP after removing the preface, we perform manual analysis on a random sample of 278 questions and 475 documents from the training set (Table 3). We find that the portion of answerable questions and question-document pairs remains high and is reduced by less than 3% in comparison to TriviaQA . This demonstrates that indeed, for most questions and documents, the context necessary for answering the question also appears in the document body and not only in the preface.
Figure 2 shows the FAO node index distribution in TriviaQA-NoP. We observe, compared to TriviaQA, that the first occurrence of an answer is much more spread out across the document, and that the median increases from 3 to 14, which will require more navigation from the agent. However, even in TriviaQA-NoP answers tend to appear at the beginning of the document, because document content is usually organized by importance. This bias results in an exploration challenge for our training algorithm, which we will address in Section 4.
In this section, we describe a model for the agent and a training algorithm based on DQN [Mnih et al.2015]. Specifically, we introduce a tree-sampling strategy, which addresses the exploration challenge stemming from the bias towards answers early in the document.
We represent the MDP as a tuple , where is the state space, is the action space, is a reward function, and is a deterministic transition function. Our model implements an action-value function , which takes a state and returns a value for every action . This function defines a policy . We now describe the state space, actions and reward function.
Example | |
---|---|
“Phuket Province Name” | |
: height | 2 |
: depth | 1 |
: h_dist_start | 0 |
: h_dist_end | 2 |
: parent.h_dist_start | 0 |
: parent.h_dist_end | 0 |
: navigation step | 1 |
Given a tree node , a state is a tuple , where is the question, is an observation, is an answer prediction, and , are navigation and answer prediction features. An observation is a sequence of tokens produced by recursively concatenating the first tokens of text in the label to the observation of ’s parent. An answer prediction is the sequence of tokens that were extracted by an RC model, if an RC model was already run on (and a null token otherwise). The answer prediction features provide information on the distribution over answer spans provided by the RC model, which reflects its confidence: is the entropy of the distribution,
is the logit value for
, and is the number of tokens in . Navigation features provide information on the relative location of in the document. An example observation and full list of navigation features are given in Table 4.Note that the state does not depend on the history of visited tree nodes.111except for navigation step, which can be approximated by the shortest path from the root to any node. While incorporating history could be beneficial, a memory-less model enables us to explore tree-sampling strategies, which is important for training (Section 4.2).
We define the following set of actions . Let be a node with an ordered list of children , and be a child of . We define five movement actions (Figure 3), where Down moves from to its first child , Right moves from to , and Left moves from to . Because moving upwards reaches a node we already visited, we define UpR, which moves from to , and UpL, which moves from to . If an illegal action is chosen (e.g., Left from , then the agent stays in its current position. The action Answer returns an answer (and a distribution over spans) by running a RC model on , unless is a sentence, in which case it is run on the paragraph containing . After Answer, the agent can resume navigation. The action Stop also returns the answer given the current node , but also terminates navigation.
Our goal is to develop an agent that can navigate in the document, and thus we define the reward based on whether the agent stops in a node that contains the gold answer text (this is noisy, because the answer might be there sporadically). While a simple reward would be an indicator for whether the agent stopped at a correct node, such a reward would not capture the proximity of the agent to the answer. Moreover, we would like to consider the overall document length, rewarding successful navigations in long documents. Therefore, we define the following reward:
where is the node where the agent is located and is the closest tree node that contains the answer ( is the node index as defined above). Thus, when stopping, the reward is proportional to the distance to the closest answer location given the document length. An additional reward is given if navigation is successful, and a penalty is given for any other action, to encourage shorter trajectories. We further penalize the Answer action to discourage frequent usage of the RC model.
Training the navigation model is based on DQN [Mnih et al.2015]. In DQN, at every step an agent at state selects an action using -greedy policy, given the current action-value function parameterized by . The agent observes a reward and a state and adds a transition to a replay memory buffer that holds a large number of recent transitions. The parameters are then optimized so that the action-value function matches better the observed reward. This is done by sampling a batch of random transitions from and minimizing the regression loss
(1) |
where are the parameters of a target network, which is a periodic copy of that is not optimized, and is a discount factor.
We also add some of the recent enhancements to DQN [Hessel et al.2017], which have proved to be useful in our setup. Specifically we implement Double Q-Learning [van Hasselt et al.2016], Prioritized Experienced Replay [Schaul et al.2015], and Dueling Networks [Wang et al.2016].
The DQN algorithm contains episodes, where in each episode the agent is placed at an initial state , from which it starts taking actions. In our setup, this state corresponds to the root of the document tree. Because the data has a bias towards answers appearing at the beginning of the document, the agent learns that stopping early improves the reward and is stuck at a local minimum, where it ceases exploration. Examining Figures 2 and 8, we observe that DQN learns to stop very early in the document compared to the FAO node index distribution of TriviaQA-NoP, amplifying the bias in the data.
To address this issue, we suggest DocQN, a variant of DQN aimed at increasing exploration when navigating in a document structure. DocQN capitalized on two properties. First, DQN is an off-policy algorithm that trains from transitions . Second, our model is memory-less, and thus we can sample any node and compute the corresponding state . Therefore, we modify DQN, and instead of initializing every episode with and performing a sequence of actions, we sample states from distributions that explore the document better. By exploring transitions from across the document, the model learns from more distant parts of the document.
Illustration of the different state distributions for a specific documents tree. Nodes with darker color have higher sampling probability, and nodes marked in red are paragraph nodes containing the answer.
Algorithm 1 provides the details of DocQN. The algorithm initializes the replay memory buffer, the model and target network parameters, and chooses a distribution over tree nodes. At the beginning of every episode, with probability (line 4),222 is annealed from to during training. we sample state transitions using and store them in , and with probability (line 10), we start at the initial state and sample a trajectory as in DQN. Parameters are updated as in DQN by sampling transitions from the replay buffer and minimizing Equation 1. If samples nodes from various locations in the document, the algorithm will explore better and will not getting stuck at its beginning. We consider the following instantiations of :
[topsep=0pt,itemsep=0ex,parsep=0ex]
: Uniform sampling over nodes, except we discourage sampling sentences by uniformly sampling with probability a leaf (sentence), and probability an inner node.
: Backward sampling: We uniformly sample a paragraph node that contains the gold answer, then uniformly a number of actions , and perform random movement actions from to output a node. This results in a node that is “close” to the answer , and can be viewed as similar to bi-directional search or backward search [Lao et al.2015].
Figure 4 shows the distribution over tree nodes for a specific document tree where the answer appears in two paragraph nodes (ignoring sentence nodes). We see that sequential sampling (as in DQN) puts most of the probability mass close to the root, uniform sampling is uniform (across paragraphs, as sentence nodes are omitted), and backward sampling is concentrated close to answer nodes. This illustrates how different distributions result in different exploration strategies.
We briefly describe the neural architecture (Figure 5) and provide full details in the supplementary material. As explained in Section 4.1, the input to the network is the state , which comprises the question tokens , observation tokens , answer prediction tokens and features ,
. The question, observation and answer prediction tokens are encoded with pre-trained word embeddings and trained character embeddings, where character embeddings are followed by a convolutional layer with max pooling, yielding a single vector per token. Each token is then represented by concatenating the word embedding with the character embedding.
Question tokens, observation tokens, and answer to are then fed into a BiLSTM and LSTM [Hochreiter and Schmidhuber1997] respectively and the LSTM outputs are compressed to a single vector through self attention [Cheng et al.2016], resulting in one vector for and one for . Answer prediction tokens are fed into an LSTM, where the last hidden state is concatenated to the features , thus creating a third vector for the answer prediction.
We concatenate these three vectors, and pass them through a one layer feed-forward network that then branches to two networks according to the Dueling DQN architecture [Wang et al.2016]. In each branch we also concatenate the navigation features . One branch predicts the value of the state , and the other branch predicts the advantage of every action for every possible action . The output of the network is as in wang2016dueling.
Our experimental evaluation aims to answer the following questions: (a) Can document structure be used to learn to navigate to an answer in a document? (b) How does DocQN compare to DQN? (c) How does DocQN compare to IR methods that observe the entire document?
We evaluate on TriviaQA-NoP. Because our focus is on the navigation ability of the agent, we train a single RC model and fix it in all experiments. Specifically, we download RaSoR [Lee et al.2017, Salant and Berant2018],333https://github.com/shimisalant/RaSoR and exactly follow the procedure described by the authors of TriviaQA for training a RC model, i.e., we train RaSoR on the first tokens of each document in TriviaQA-NoP. As a sanity check for our RC model, we also train and evaluate RaSor on the original TriviaQA dataset. Indeed, Rasor obtains EM and F1, which is substantially higher than the baseline reported by joshi2017triviaqa.
We use two metrics: First, we measure navigation accuracy, i.e., for a question-document pair, whether a method returns text containing a gold answer (if the agent stops at a sentence node we evaluate the encompassing paragraph). Because questions in TriviaQA-NoP often have more than one evidence document, we also measure aggregated navigation accuracy, where we give credit if the agent navigated correctly in any of the documents. This gives performance assuming an oracle that always chooses the best document for the question. Because the test set in TriviaQA is hidden, we evaluated navigation accuracy on the development set only.
In addition, we measure end-to-end QA performance with the official Exact Match (EM) and F1 metrics. The EM metric measures the percentage of predictions that match exactly any of the answer aliases, and the F1 metric measures the average overlap between the prediction and answer. To aggregate evidence from multiple documents we follow joshi2017triviaqa and define the score of an answer to be the sum of probabilities for that answer across all documents.
We compare the following models:
[topsep=0pt,itemsep=0ex,parsep=0ex,leftmargin=]
DocQN: This is our main model, where we use the sampling distribution .
DQN: DQN algorithm without state sampling.
{DocQNDQN}-Coupled: A less expressive version of the model, where the actions ANSWER and STOP are coupled, i.e., the agent can use the RC model exactly once and then stops. This is similar to a setup where retrieval is performed once and not interleaved with navigation.
RandomWalk An agent that selects an action at each step uniformly at random.
RandomPara An agent that randomly selects a non-sentence tree node.
Doc-Tf-Idf: An IR baseline, where a paragraph is selected based on its tf-idf score. We implement the tf-idf scheme of DocQA [Clark and Gardner2017], which is a high-performing system on TriviaQA
. Here, idf is computed for each paragraph in the context of the current document, and the paragraph with highest cosine similarity to the question is chosen. Note that
Doc-Tf-Idf processes the entire document (up to tens of thousands of tokens), while DocQN processes only a small fraction.Tf-Idf: Vanilla tf-idf, where the idf score is computed from all documents in TriviaQA-NoP.
Ensemble: We ensemble DocQN with Doc-Tf-Idf in two ways: (1) for finding the final answer, we aggregate scores as described above, that is, we sum the probabilities from both models over all documents and choose the answer with highest score. (2) For navigation, we simply tune on the development set a threshold , where we take the prediction of DocQN if it stopped at a node with index and use Doc-Tf-Idf otherwise.
ReadTop: Following joshi2017triviaqa, a model that runs the RC model on the first tokens of the document. Note that running the RC model on the text retrieved by DocQN involves consuming far fewer tokens, namely, RaSoR consumes only tokens on average.
We report the value of all hyper-parameters in the supplementary material (all fixed without tuning).
Tables 6 and 6 show the results of our experiments. Focusing on navigation accuracy, we see that randomly walking or choosing a paragraph yields low performance. Vanilla Tf-Idf performs considerably better than the random baselines, but is outperformed by all other models. Comparing DQN to DocQN, we see that DocQN outperforms DQN. Allowing DQN and DocQN to run the RC model during navigation improves their performance, but the accuracy gain is substantially higher for DocQN (2.3% accuracy and 3.1% aggregated accuracy) than for DQN (0.4% accuracy and 0.7% aggregated accuracy). Doc-Tf-Idf, which has access to the entire document outperforms DocQN, which consumes on average 6% of the entire document. Nonetheless, the two models obtain the same aggregated accuracy. This good performance of Doc-Tf-Idf shows that in TriviaQA-NoP answer paragraphs share a lot of lexical material with the question. Importantly, an ensemble of Doc-Tf-Idf with DocQN substantially improves the overall performance, reaching accuracy of 35% and aggregated accuracy of 48%. This is in sharp contrast to an ensemble with DQN, where for any value of , Doc-Tf-Idf performs better on its own.
Examining end-to-end performance, DocQN again outperforms DQN, but DocQN is now better than Doc-Tf-Idf. This suggests that when Doc-Tf-Idf selects a paragraph with the answer, it is often difficult to extract with the RC model. Again, the ensembles leads to a dramatic increase in performance, showing that Doc-Tf-Idf and DocQN are complementary. However, ReadTop, which consumes the first tokens of each document compared to for DocQN, substantially outperforms all navigation models. This shows that the bias for answers at the beginning of documents is still strong in TriviaQA-NoP.
To further elucidate the differences between navigation models, Figure 7 shows navigation accuracy of different models and the proportion of samples for different node indices of the FAO. We see that DQN outperforms DocQN when the answer is at the top of the document, but DocQN dominates DQN when the answer is further down, showing that DocQN learns to find answers deeper in the document. Doc-Tf-Idf has a more balanced navigation accuracy across the document, which explains why an ensemble of Doc-Tf-Idf with DocQN works, as they are complementary to one another.
Figure 6 shows a navigation example, which includes the navigation step, node index, observation , and the action taken. For the observation we also highlight the attention distribution from the self-attention component. In this figure the question is about culture, and we see the agent going into multiple sections, reading them, running the RC model and continuing forward, until finally stopping at a paragraph that contains the answer. We provide many more examples in the supplementary material.
Table 7 highlights some differences between DQN and DocQN. DocQN has longer trajectories, and stops at sentence nodes more frequently, suggesting it reads the document at a finer granularity. Additionally, it leverages the RC model to collect more information and confidence during navigation, by choosing the action ANSWER more frequently. Both models consume less than 7% of the entire document. Figure 8 illustrates the navigation stopping point, and shows that DocQN navigates deeper into the document.
DQN | DocQN | |
---|---|---|
Path length | avg. 7.7, range [1,36] | avg. 15.2, range [3,100] |
Answer predictions | avg. 2.8 | avg. 4.1 |
Tokens consumed | 3.4% | 6.2% |
Stopping node | 0.5% title, 2.2% headline, 89.0% paragraph, 8.3% sentence | 0% title, 0.4% headline, 75.1% paragraph, 24.5% sentence |
Handling the challenges of reasoning over multiple long documents is gaining fast momentum recently [Shen et al.2017]. As mentioned, some approaches use IR for reducing the amount of processed text [Chen et al.2017, Clark and Gardner2017], while others use cheap or parallelizable models to handle long documents [Hewlett et al.2017, Swayamdipta et al.2018, Wang et al.2018a]. Searching for answers while using a trained RC model as a black-box was also implemented recently in wang2018evidence, for open-domain questions and multiple short evidence texts from the Web. Another thrust has focused on skimming text in a sequential manner [Yu et al.2017], or designing recurrent architectures that can consume text quickly [Bradbury et al.2017, Seo et al.2018, Campos et al.2018, Yu et al.2018]. However, to the best of our knowledge no work has previously applied these methods to long documents such as Wikipedia pages.
In this work we use TriviaQA-NoP for evaluation of our navigation based approach and comparison to an IR baseline. While there are various aspects to consider in such evaluation setup, our choice of data was derived mainly by the requirements for long and structured context. Recently, several new datasets such as WikiHop and NarrativeQA were published. These datasets try to focus on the tendency of RC models to match local context patterns, and are designed for multi-step reasoning. [Welbl et al.2017, Wadhwa et al.2018, Kočisky et al.2017].
Our work is also related to several papers which model an agent that navigates in an environment to find objects in an image [Ba et al.2015], relations in a knowledge-base [Das et al.2018], or documents on the web [Nogueira and Cho2016].
We investigate whether document structure can be leveraged to train an agent that finds answers to questions in long documents while reading only a small fraction of it. We show that an agent that reads 6% of the document can improve QA performance compared to an IR method that utilizes the entire document, and that ensembling the two substantially improves performance. We also present DocQN, an algorithm that promotes better exploration of the document, and show it outperforms DQN qualitatively and quantitatively.
Our approach represents a conceptual departure from previous methods for reading long documents, as it interleaves searching for an answer in the document with extracting the answer from a particular paragraph, which we show improves both navigation and QA performance. We expect that as RC models tackle longer documents that require reasoning and reading text that is spread in multiple parts of the document, models that can efficiently navigate and collect evidence will become more and more crucial. Our agent provides a first step in this important research direction.
We thank Eunsol Choi from Washington University for helpful discussions and comments on the paper. This research was supported by the Yandex Initiative in Machine Learning and by The Israel Science Foundation grant 942/16. This work was completed in partial fulfillment for the Ph.D degree of the first author.
Empirical Methods in Natural Language Processing (EMNLP)
.Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning.
In International Conference on Learning Representations (ICLR).Association for the Advancement of Artificial Intelligence (AAAI)
, volume 16, pages 2094–2100.Workshop on New Forms of Generalization in Deep Learning and Natural Language Processing at NAACL
.Here, we elaborate on the network architecture, which was briefly described in the paper. Given an input state , we denote by , and the question tokens, observation tokens and answer prediction tokens, respectively.
For every , every and every , we create an embedding of the input tokens in two steps. First, we embed every token with a pre-trained GloVe word embedding matrix :
Next, we apply character-level embeddings with a learned embedding matrix
. The embedded characters are then summarized with a max-pooling convolutional neural network:
Concatenation of the two components yields the final word-level embeddings:
The question is encoded with a BiLSTM, where for every timestamp , the forward and backward outputs are concatenated:
The outputs are then summarized with a self-attention:
where the coefficients are obtained by feeding the outputs to a two-layer feed-forward network, and normalizing the output logits with softmax:
The observation sequence encoding is obtained in an analogous manner to , except that we use a rather than a .
The answer prediction encoding is obtained by running the answer prediction tokens through a LSTM, and concatenating the last hidden state with the input feature vector :
The final state representation is formed by concatenating the encoded question with the encoded observation . Concretely:
The input node features are concatenated to an upper layer, as we describe next , to increase their weights in the final predictions. We have found that incorporation of these features in this way accelerates the navigation learning process.
We implement a Dueling DQN architecture, where the final Q-values are composed of a state value prediction and advantage predictions for every possible action
. Denoting by FFNN a single-layer feed-forward neural network, predictions are derived as follows:
Where , , and are the navigation features. The final Q-values prediction is obtained by averaging:
Table 8 summarizes the hyper-parameters used for building and training the DocQN models.
Parameter | Value | |
Network | Maximum node observation length | 20 tokens |
Maximum observation length | 120 tokens | |
Word embedding dimension | 300 | |
Character embedding dimension | 20 | |
Convolution filter size | 5 | |
BiLSTM and LSTM hidden dimension | 300 | |
First feed-forward layer dimension | 512 | |
Second feed-forward layer dimension | 256 | |
Dropout rate | 0.2 | |
Training | RMSprop learning rate | 0.0001 |
Batch size | 64 | |
Target network period | 10K steps | |
Initial memory size | 50K transitions | |
Maximal memory size | 300K transitions | |
Action sampling | ||
Discount factor | ||
Prioritization usage | 0.6 | |
Prioritization importance sampling | ||
Sampling | State sampling | |
Annealing steps for | M steps | |
Sampling repetitions | 5 | |
Maximum navigation length (train) | 30 steps | |
Maximum navigation length (test) | 100 steps | |
Interpolation coefficient for | 0.5 |
Figures 9,10,11,12,13 show sample navigations, performed by the DocQN model. In each example, denote the question, answer aliases and answer node numbers. The observation tokens are highlighted according to the self-attention weights, given by the model. By choosing the action ANS, the agent executes the RC model to obtain an answer prediction, which is part of the observation in the next step.
Comments
There are no comments yet.