Reinforcement learning refers to learning strategies for sequential decision-making tasks, where a system takes actions at a particular state with the goal of maximizing a long-term reward. Recently, several tasks that involve states and actions described by natural language have been studied, such as text-based games (Narasimhan et al., 2015; He et al., 2016a), web navigation (Nogueira and Cho, 2016), information extraction (Narasimhan et al., 2016), Reddit popularity prediction and tracking (He et al., 2016b), and human-computer dialogue systems (Wen et al., 2016; Li et al., 2016). Some of these studies ignore the use of external knowledge or world knowledge, while others (such as information extraction and task-oriented dialogue systems) directly interact with an (often) static database.
External knowledge – both general and domain-specific – has been shown to be useful in many natural language tasks, such as in question answering (Yang et al., 2003; Katz et al., 2005; Lin, 2002), information extraction (Agichtein and Gravano, 2000; Etzioni et al., 2011; Wu and Weld, 2010), computer games (Branavan et al., 2012), and dialog systems (Ammicht et al., 1999; Yan et al., 2016). However, in reinforcement learning, incorporating external knowledge is relatively rare, mainly due to the domain-specific nature of reinforcement learning tasks, e.g. Atari games (Mnih et al., 2015) and the game of Go (Silver et al., 2016). Of particular interest in our work is external knowledge represented by unstructured text, such as news feeds, Wikipedia pages, search engine results, and manuals, as opposed to a structured knowledge base.
Our study is conducted on the task of Reddit popularity prediction proposed in He et al. He et al. (2016b), which is a sequential decision-making problem based on a large-scale real-world natural language data set. In this task, a specified number of discussion threads predicted to be popular are recommended, chosen from a fixed window of recent comments to track. The authors proposed a reinforcement learning solution in which the state is formed by the collection of comments in the threads being tracked, and actions correspond to selecting a subset of new comments to follow (sub-actions) from the set of recent contributions to the discussion. Since comments are potentially redundant (multiple respondents can have similar reactions), the study found that sub-actions were best evaluated in combination. The computational complexity of the combinatorial action space was sidestepped by random sampling a fixed number of candidates from the full action space. A major drawback of random sampling in this application is that popular comments are rare and easily missed.
We make two main contributions in this paper. The first is a novel architecture for incorporating unstructured external knowledge into reinforcement learning. More specifically, information from the original state is used to query the knowledge source (here, an evolving collection of documents corresponding to other online discussions about world events), and the state representation is augmented by the outcome of the query. Thus, the agent can use both the local context (reinforcement learning environment) and the global context (e.g. recent discussions about world news) when making decisions. Second, we propose to use a two-stage Q-learning framework that makes it feasible to explore the full combinatorial natural language action space. A first Q-function is used to efficiently generate a list of sub-optimal candidate actions, and a second more sophisticated Q-function reranks the list to pick the best action.
On Reddit, users reply to posts and other comments in a threaded (tree-structured) discussion. Comments (and posts) are associated with a karma score, which is a combination of positive and negative votes from registered users indicating popularity of the comment. In prior work (He et al., 2016b), popularity prediction in Reddit discussions (comment recommendation) is proposed for studying reinforcement learning with a large scale natural language action space. At each time step , the agent receives a string of text that describes the state and several strings of text that describe the potential actions (new comments to consider). The agent attempts to pick the best action for the purpose of maximizing the long-term reward. In a real-time scenario, the final karma of a comment is not immediately available, so prediction of popularity is based on the text in the comment as well as the context of discussion history. It is common that a lower karma comment will eventually lead to more discussion and popular comments in the future. Thus it is natural to formulate this task as a reinforcement learning problem.
More specifically, the set of comments that are being tracked at time is denoted as . The state, action, and immediate rewards are defined as follows:
State: all previously tracked comments, as well as the post (root node of the tree), i.e.
Action: an action is taken when a total of new comments , appear as nodes in the subtrees of , and the agent picks a set of comments to be tracked in the next time step: , and if
Reward: is the accumulated karma scores111The karma score is observed from an archived version of the discussion, not immediately shown at the time of the comment, so not available to a real-time system. in comments in
In this task, because an action corresponds to a set of comments (sub-actions) chosen from a larger set of candidates, the action space is combinatorial. and are also time-varying, reflecting the flow of the discussion in the paths chosen.
The standard Q-learning defines a function as the expected return starting from and taking the action :
where denotes a discount factor. The Q-function associated with an optimal policy can be found by the Q-learning recursion (Watkins and Dayan, 1992):
where is the learning rate of the algorithm. In He et al. He et al. (2016b), two deep Q-learning architectures are proposed, both with separate networks for the state and action spaces yielding embeddings and , respectively. Those embeddings are combined with a general interaction function to approximate the Q-values, , as in He et al. He et al. (2016a), where the approach of using separate networks for natural language state and action spaces is termed a Deep Reinforcement Relevance Network (DRRN).
3 Related Work
There has been increasing interest in applying deep reinforcement learning to a variety of problems, including tasks involving natural language. To control agents directly given high-dimensional sensory inputs, a Deep Q-Network (Mnih et al., 2015)
has been proposed and shown high capacity and scalability for handling a large state space. Another stream of work in recent deep learning research is the attention mechanism(Bahdanau et al., 2015; Sukhbaatar et al., 2015; Vinyals et al., 2015)
, where a probability distribution is computed to pay attention to certain parts of a collection of data. It has been shown that the attention mechanism can handle long sequences or a large collection of data, while being quite interpretable. The attention mechanism work that is closest to ours is memory network (MemNN)(Weston et al., 2014; Sukhbaatar et al., 2015)
. Most work on MemNNs uses embeddings of a query and documents to compute the attention weights for memory slots. Here, we propose models that also use non-content based features (time, popularity) for memory addressing. This helps retrieve content that provides complementary information to what is modeled in the query embedding vector. In addition, the content-based component of our query scheme uses TF-IDF based semantic-similarity, since the memory comprises a very large corpus of external documents that makes end-to-end learning of attention features impractical.
Multiple studies have explored interacting with a database (or knowledge base) using reinforcement learning. Narasimhan et al. Narasimhan et al. (2016) presents a framework of acquiring and incorporating external evidence to improve extraction accuracy in domains where the amount of training data is scarce. In task-oriented human-computer dialogue interactions, Wen et al. Wen et al. (2016)
introduce a neural network-based trainable dialogue system with a database operator module. Dhingra et al.Dhingra et al. (2016) proposed a dialogue agent that provides users with an entity from a knowledge base by interactively asking for its attributes. In question answering, knowledge representation and reasoning also plays a central role (Ferrucci et al., 2010; Boyd-Graber et al., 2012). Our goal differs from these studies in that we do not directly optimize a domain-specific knowledge search, instead we use external world knowledge to enrich the state representation in a reinforcement learning task.
Our task of tracking popular Reddit comments is somewhat related to an approach to multi-document summarization described in(Daumé III et al., 2009). A difference with respect to our problem is that the space of text for selection evolves over time. In addition, in our case, the agent has no access to optimal policy, in contrast to the SEARN algorithm used in that work.
To address overestimations of action values, double Q-learning (Hasselt, 2010; Van Hasselt et al., 2015) has been proposed and it leads to better performance gains on several Atari games. Dulac-Arnold et al. Dulac-Arnold et al. (2016)
present a policy architecture that works efficiently with a large number of actions. While a combinatorial action space can be large and discrete, this method does not directly apply in our case, because the possible actions are changing over different states. Instead, we borrow the philosophy from double Q and propose a two-stage Q-learning approach to reduce computational complexity by using a first Q-function to construct a quick yet rough estimate in the combinatorial action space, and then a second Q-function to rerank a set of sub-optimal actions.
The work described in our paper improves over (He et al., 2016b) by augmenting the state representation with external knowledge and by combining the two architectures that they proposed in two-stage Q-learning to enable exploration of the full action space. The first DRRN evaluates an action by treating sub-actions as independent and summing their contributions to the Q-value (Figure 1(a)), and the second models potential redundancy of sub-actions by using a BiLSTM at the comment level (Figure 1(b)).
4 Incorporating External Knowledge into the State Representation
This approach is inspired by the observation that in a real-world decision making process, it is usually beneficial to consider background knowledge. Here, we introduce a mechanism to incorporate external language knowledge into decision making. The intuition is that the agent will keep track of a memory space that helps with decision making, and when a new state comes, the agent refers to this external knowledge and picks relevant resources to help with decision making.
The architecture we propose is illustrated in Figure 2. Every time the agent reads the state information from the environment, it performs a lookup operation in external knowledge in its memory. This external knowledge could be a static knowledge base, or more generally it can be a dynamic database. In our experiments, the agent keeps an evolving collection of documents from the worldnews subreddit. We use an attention mechanism that produces a probability distribution over the entire external knowledge resource. This weight vector is computed by considering a set of features measuring the relevance between the current state and the “world knowledge” of the agent. More specifically, we consider the following three types of relevance:
Timing features: when users express their opinions on a website such as Reddit, it is likely they are referring to more recent news events. We use two indicator features to represent whether a document from the external knowledge is within the past 24 hours, or the past 7 days relative to the time of the new state. We denote these features as and , respectively.
Semantic similarity: we use the standard tf-idf (term-frequency inverse-document-frequency) (Salton and McGill, 1986)
and compute cosine similarity scores as a measure for semantic relevance between the current state and each document in the external knowledge. We denote this semantic similarity as.
Popularity: for reddit posts/comments, we may use karma score as a measure for popularity. It is possible that high popularity topics will occur more often in the environment. To compensate the range difference in different relevance measures, we normalize karma scores222Detailed descriptions are given in Section 6. so the feature values fall in the range . We denote this normalized popularity score as .
For each state the agent extracts the above features for each document in the external knowledge, and form a 4-dimensional feature vector . The attention weights are then computed as a linear combination followed by a softmax over the entire external knowledge:
where the Softmax operates over the collection of documents and has dimension equals the number of documents. Note in our experimental setting, the softmax applies for only documents that exist before the new comments appear, and this simulates a “real-time” dynamic external knowledge resource. The attention weights are then multiplied with document embeddings to form a vector representation (embedding) of “world” knowledge:
The world embedding is concatenated with the original state embedding to enrich understanding of the environment.
5 Two-Stage Q-learning for a Combinatorial Action Space
There are two challenges associated with a combinatorial action space. One is the development of a Q-function framework for estimating the long-term reward. This is addressed in (He et al., 2016b). The other is the potentially high computational complexity, due to evaluating over every possible pair of . In the case of deep Q-learning, most of the time has been spent on the forward-pass from actions to Q-values. For back-propagation, since we only need to back-propagate one particular action the agent has chosen, complexity is not affected by the combinatorial action space.
One solution to sidestep computational complexity is to randomly pick a fixed number, say candidate actions, and perform a
operation. While this is widely used in the reinforcement learning literature, it is problematic in our application because the large and highly skewed action space makes it likely that good actions are missed. Here we propose to use two-stage Q-learning for reducing search complexity. More specifically, we can rewrite theoperation as:
where means picking the top- actions from the whole action set .
In the case of being DRRN-Sum, we can rewrite as:
which is simplified by precomputing sub-action value , . is the simple DRRN introduced in He et al. He et al. (2016a).
To elaborate, the idea is to use a first Q function to perform a quick but rough ranking of . The second Q function , which can be more sophisticated, is used to rerank the top- candidate actions. This is effectively a beam search with coarse-to-fine models and reranking. This ensures that all comments are explored, and at the same time, the architecture can be sophisticated enough to capture detailed dependencies between sub-actions, such as information redundancy. In our experiments, we pick to be DRRN-Sum and to be DRRN-BiLSTM. While the independence assumption on sub-action interdependency is too strong, the DRRN-Sum model is relatively easy to train. Since the parameters on the action side are tied for different sub-actions, we can train a DRRN with and then apply the model for each pair of . This will result in sub-action Q-values . Thus computing Equation 1 is equivalent to sorting values. Thus, we avoid the huge computational cost of first generating actions from sub-actions, then applying a general Q-function approximation to come up with . In Section 6, we train a DRRN (with ) and then copy the parameters to DRRN-Sum, which can be used to evaluate the full action space.333The whole two-stage Q framework is summarized in Algorithm 1 in Appendix.
6.1 Data set and preprocessing
We carry out experiments on the task of predicting popular discussion threads on Reddit, as proposed by He et al. He et al. (2016b)
. Specifically, we conduct experiments on data from 5 subreddits including askscience, askmen, todayilearned, askwomen, and politics, which cover diverse genres and topics. In order to have long enough discussion threads, we filter out discussion trees with fewer than 100 comments. For each of the 5 subreddits, we randomly partition 90% of the data for online training, and 10% of the data for testing. Our evaluation metric is accumulated karma scores. For each setting we obtain mean (average reward) and standard deviation (shown as error bars or numbers in brackets) by 5 independent runs, each over 10,000 episodes. In all our experiments we set. The basic subreddit statistics are shown in Table 1. We also report random policy performances and oracle upper bound performances.444Upper bounds are estimated by exhaustively searching through each discussion tree to find max karma discussion threads (overlapped comments are counted only once). This upper bound may not be attainable in a real-time setting. For askscience, and different ’s, the upper bound performances range from 1991.3 () to 2298.0 ().
|Subreddit||# Posts (in k)||# Comments (in M)||Random||Upper bound|
In preprocessing we remove punctuation and lowercase all words. We use a bag-of-words representations for each state , and comment
in discussion tracking, and for each document in the external knowledge source. The vocabulary contains the most frequent 5,000 words and the out-of-vocabulary rate is 7.1%. We use fully-connected feed-forward neural networks to compute state, action and document embeddings, withhidden layers and hidden dimension 20.
Our Q-learning agent uses -greedy () throughout online training and testing. The discounting factor . During training, we use experience replay (Lin, 1992)
and the memory size is set to 10,000. For each experience replay, 500 episodes are generated and tuples are stored in a first-in-first-out fashion. We use mini-batch stochastic gradient descent with batch size of 100, and constant learning rate. We train separate models for different subreddits.
6.2 Incorporating external knowledge
We first study the effect of incorporating external knowledge, without considering the combinatorial action space. More specifically, we set and use the simple DRRN. Each action is to pick a comment from to track. Our proposed method uses a state representation augmented by the world knowledge, as illustrated in Figure 2.
We utilize the worldnews subreddit as our external knowledge source. This subreddit consists of 9.88k posts. We define each document in the world knowledge to be the post plus its top-5 comments ranked by karma scores. The agent keeps a growing collection of documents. That is, at each time , the external knowledge contains documents from worldnews that appear before time . To compute popularity score of each document, we simply sum the karma scores of post and top-5 comments. Then the karma scores are normalized by dividing the highest score in the external knowledge.555Unlike in Fang et al. Fang et al. (2016), the summed karma scores do not follow a Zipfian distribution, so we do not use quantization or any nonlinear transformation. Thus the popularity feature values for computing attention fall in the range .
For comparison, we experiment with a baseline DRRN without any external knowledge. We also construct a baseline DRRN with hand-crafted rules for picking documents from external knowledge. Those rules include: i) documents within the past-day, ii) documents within the past-week, iii) 10 semantically most similar documents, iv) 10 most popular documents. We use a bag-of-words representation and construct the world embedding used to augment the state representation.
We compare multiple ways of incorporating external knowledge for different subreddits and show performance gains over a baseline DRRN (without any external knowledge) in Figure 3. The experimental results show that the DRRN using a learned attention mechanism to retrieve relevant knowledge outperforms all other configurations of DRRNs with rules for knowledge retrieval, and significantly outperforms the DRRN baseline that does not use external knowledge. Also we observe that different relevance features have different impact across subreddits. For example, for askscience, past-day documents have higher impact than past-week documents, while for politics past-week documents are more important. The most-popular documents actually have a negative effect for todayilearned, mainly because those are documents which are most popular throughout the entire history, while todayilearned discussions value information about recent events.666In principle, since we are concatenating the world embedding to obtain an augmented state representation, the result should not get worse. We hypothesize this is due to overfitting and use of mismatched documents, as in the most-popular setting for todayilearned. Nevertheless, the attention mechanism learns to rely on proper features to retrieve useful knowledge for the needs of different domains.
6.3 Two-stage Q-learning for a combinatorial action space
In this subsection we study the effect of two-stage Q-learning, without considering external knowledge. We train DRRN () first, and copy over the parameters to DRRN-Sum as . We then train =DRRN-BiLSTM as before, except that we use =DRRN-Sum to explore the whole action space to obtain .
|K=2||573.2 (12.9)||663.3 (8.7)||676.9 (5.5)|
|K=3||711.1 (8.7)||793.1 (8.1)||833.9 (5.7)|
|K=4||854.7 (16.0)||964.5 (12.0)||987.1 (12.1)|
|K=5||980.9 (21.1)||1099.4 (15.9)||1101.3 (13.8)|
On askscience, we try multiple settings with and the results are shown in Table 2. We compare the proposed two-stage Q-learning with two single-stage Q-learning baselines. The first baseline, following the method in He et al. He et al. (2016b), uses a random subsampling approach to obtain (with ) and takes the max over them using DRRN-BiLSTM. The second baseline uses DRRN-Sum and explores the whole action space. The proposed two-stage Q-learning uses DRRN-Sum for picking a and DRRN-BiLSTM for reranking. We observe a large improvement by switching from “random” to “all”, showing that exploring the entire action space is critical in this task. There is a consistent gain by using two-stage Q-learning instead of a single-stage Q with DRRN-Sum. This shows that using a more sophisticated value function for reranking also helps with performance.
|random||DRRN-BiLSTM||711.1 (8.7)||139.0 (3.6)||606.9 (15.8)||135.0 (1.3)||177.9 (3.3)|
|all||DRRN-Sum||793.1 (8.1)||142.5 (2.3)||679.4 (11.4)||145.9 (2.4)||180.6 (6.3)|
|DRRN-Sum||DRRN-BiLSTM||833.9 (5.7)||148.0 (5.5)||697.9 (9.4)||149.6 (3.3)||204.7 (4.2)|
In Table 3, we compare two-stage Q-learning with the two baselines across different subreddits, with . The findings are consistent with those for askscience. Since different subreddits may have very different karma score distributions and language style, our results suggest that the algorithm applies well to different community interaction styles.
During testing, we compare runtime of the DRRN-BiLSTM Q-function with different , simulating over 10,000 episodes with and . The search time for the random selection and the two-stage Q-function are similar, both nearly constant for different . Using two-stage Q the test runtime is reduced by for and for comparing to exploring the whole action space.777Training DRRN-BiLSTM with the whole action space is intractable, so we just used a subspace trained DRRN-BiLSTM model for testing. This however achieves worse performance compared to the two-stage Q probably due to mismatch in training and testing.
6.4 Combined results
|Would it be possible to artificially create an atmosphere like Earth has on Mars?||Ultimate Reality TV: A Crazy Plan for a Mars Colony - It might become the mother of all reality shows. Fully 704 candidates are soon to begin competing for a trip to Mars to establish a colony there.||‘Alien thigh bone’ on Mars: Excitement from alien hunters at ‘evidence’ of extraterrestrial life. Mars likely never had enough oxygen in its atmosphere and elsewhere to support more complex organisms.||The Gaia (General Authority on Islamic Affairs) and the UAE (United Arab Emirates) have issued a fatwa on people living on mars, due to the religious reasoning that there is no reason to be there.||North Korea’s internet is offline; massive DDOS attack presumed.|
|Does our sun have any unique features compared to any other star?||Star Wars: Episode VII begins filming in UAE desert. This can’t possibly be a modern Star Wars movie! I don’t see a green screen in sight! Ya, it’s more like Galaxy news.||African Pop Star turns white (and causes controversy) with new line of skin whitening cream. I would like to see an unshopped photo of her in natural lighting.||Dwarf planet discovery hints at a hidden Super Earth in solar system - The body, which orbits the sun at a greater distance than any other known object, may be shepherded by an unseen planet.||Hong Kong democracy movement hit by 2018. The vote has no standing in law, by attempting to sabotage it, the Chinese(?) are giving it legitimacy|
In Figure 4, we present an ablation study on effects of incorporating external knowledge and/or two-stage Q-learning (with ) across different subreddits. The two contributions we proposed each help improve reinforcement learning performance in a natural language scenario with a combinatorial action space. In addition, combining these two approaches further improves performance. In our task, two-stage Q-learning provides a larger gain. However, in all cases, incorporating external knowledge consistently gives additional gain on top of two-stage Q-learning.
We conduct case studies in Table 4. We show examples of most/least attended documents in the external knowledge given the state description. The documents are shortened for brevity. In the first example, the state is about a question about the atmosphere on Mars. The most-attended documents are correctly related to Mars living conditions, in various sources and aspects. The second example has the state talking about sun’s features compared to other stars. Interestingly, although the agent is able to attend to top documents due to some topic word matching (e.g. sun, star), the picked documents reflect popularity more than topic relevance. The least-attended documents are totally irrelevant in both examples, as expected.
In this paper we introduce two approaches for improving natural language based decision making in a combinatorial action space. The first is to augment the state representation of the environment by incorporating external knowledge through a learnable attention mechanism. The second is to use a two-stage Q-learning framework for exploring the entire combinatorial action space, while avoiding enumeration of all possible action combinations. Our experimental results show that both proposed approaches improve the performance in the task of predicting popular Reddit threads.
- Agichtein and Gravano (2000) E. Agichtein and L. Gravano. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on Digital libraries. ACM, pages 85–94.
- Ammicht et al. (1999) E. Ammicht, A. L Gorin, and T. Alonso. 1999. Knowledge collection for natural language spoken dialog systems. In EUROSPEECH.
- Bahdanau et al. (2015) D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations.
Boyd-Graber et al. (2012)
J. Boyd-Graber, B. Satinoff, H. He, and H. Daumé III. 2012.
Besting the quiz master: Crowdsourcing incremental classification
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, pages 1290–1301.
Branavan et al. (2012)
S. Branavan, D. Silver, and R. Barzilay. 2012.
Learning to win by reading manuals in a monte-carlo framework.
Journal of Artificial Intelligence Research43:661–704.
- Daumé III et al. (2009) H. Daumé III, J. Langford, and D. Marcu. 2009. Search-based structured prediction. Machine learning 75(3):297–325.
- Dhingra et al. (2016) B. Dhingra, L. Li, X. Li, J. Gao, Y-N Chen, F. Ahmed, and L. Deng. 2016. End-to-end reinforcement learning of dialogue agents for information access. arXiv preprint arXiv:1609.00777 .
- Dulac-Arnold et al. (2016) G. Dulac-Arnold, R. Evans, H. Van Hasselt, P. Sunehag, T. Lillicrap, and J. Hunt. 2016. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679 .
- Etzioni et al. (2011) O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam. 2011. Open information extraction: The second generation. In IJCAI. volume 11, pages 3–10.
- Fang et al. (2016) H. Fang, H. Cheng, and M. Ostendorf. 2016. Learning latent local conversation modes for predicting community endorsement in online discussions. In Proc. Int. Workshop Natural Language Processing for Social Media. page 55.
- Ferrucci et al. (2010) D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A Kalyanpur, A. Lally, J W. Murdock, E. Nyberg, J. Prager, et al. 2010. Building watson: An overview of the DeepQA project. AI magazine 31(3):59–79.
- Hasselt (2010) Hado V. Hasselt. 2010. Double Q-learning. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23. Curran Associates, Inc., pages 2613–2621. http://papers.nips.cc/paper/3964-double-q-learning.pdf.
- He et al. (2016a) J. He, J. Chen, X. He, J. Gao, L. Li, L. Deng, and M. Ostendorf. 2016a. Deep reinforcement learning with a natural language action space. In Proc. Annu. Meeting Assoc. for Computational Linguistics (ACL).
- He et al. (2016b) J. He, M. Ostendorf, X. He, J. Chen, J. Gao, L. Li, and L. Deng. 2016b. Deep reinforcement learning with a combinatorial action space for predicting popular reddit threads. In Proc. of the 2016 Conference on Empirical Methods in Natural Language Processing. http://aclweb.org/anthology/D16-1189.
- Katz et al. (2005) B. Katz, G. Marton, G. C Borchardt, A. Brownell, S. Felshin, D. Loreto, J. Louis-Rosenberg, B. Lu, F. Mora, S. Stiller, et al. 2005. External knowledge sources for question answering. In TREC.
- Li et al. (2016) J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao. 2016. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, pages 1192–1202. https://aclweb.org/anthology/D16-1127.
- Lin (2002) J. J Lin. 2002. The web as a resource for question answering: Perspectives and challenges. In LREC.
- Lin (1992) L-J Lin. 1992. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8(3–4):293–321.
- Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A Rusu, J. Veness, M. G Bellemare, A. Graves, M. Riedmiller, A. K Fidjeland, G. Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529–533.
- Narasimhan et al. (2015) K. Narasimhan, T. Kulkarni, and R. Barzilay. 2015. Language understanding for text-based games using deep reinforcement learning. In Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing. pages 1–11. http://aclweb.org/anthology/D15-1001.
- Narasimhan et al. (2016) K. Narasimhan, A. Yala, and R. Barzilay. 2016. Improving information extraction by acquiring external evidence with reinforcement learning. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, pages 2355–2365. https://aclweb.org/anthology/D16-1261.
- Nogueira and Cho (2016) R. Nogueira and K. Cho. 2016. End-to-end goal-driven web navigation. In Advances in Neural Information Processing Systems 29. pages 1903–1911.
- Salton and McGill (1986) G. Salton and M. J McGill. 1986. Introduction to modern information retrieval. McGraw-Hill, Inc.
- Silver et al. (2016) D. Silver, A. Huang, C. J Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529(7587):484–489.
- Sukhbaatar et al. (2015) S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus. 2015. End-to-end memory networks. In Advances in neural information processing systems. pages 2440–2448.
- Van Hasselt et al. (2015) H. Van Hasselt, A. Guez, and D. Silver. 2015. Deep reinforcement learning with double Q-learning. CoRR, abs/1509.06461 .
- Vinyals et al. (2015) O. Vinyals, Ł. Kaiser, T. Koo, S. Petrov, I. Sutskever, and G. Hinton. 2015. Grammar as a foreign language. In Advances in Neural Information Processing Systems. pages 2773–2781.
- Watkins and Dayan (1992) C. JCH Watkins and P. Dayan. 1992. Q-learning. Machine learning 8(3-4):279–292.
- Wen et al. (2016) T.-H. Wen, M. Gasic, N. Mrksic, L. M Rojas-Barahona, P.-H. Su, S. Ultes, D. Vandyke, and S. Young. 2016. A network-based end-to-end trainable task-oriented dialogue system. arXiv preprint arXiv:1604.04562 .
- Weston et al. (2014) J. Weston, S. Chopra, and A. Bordes. 2014. Memory networks. arXiv preprint arXiv:1410.3916 .
- Wu and Weld (2010) F. Wu and D. S Weld. 2010. Open information extraction using wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pages 118–127.
- Yan et al. (2016) Z. Yan, N. Duan, J. Bao, P. Chen, M. Zhou, Z. Li, and J. Zhou. 2016. Docchat: An information retrieval approach for chatbot engines using unstructured documents. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 516–525. http://www.aclweb.org/anthology/P16-1049.
- Yang et al. (2003) H. Yang, T-S Chua, S. Wang, and C-K Koh. 2003. Structured use of external knowledge for event-based open domain question answering. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM, pages 33–40.
Appendix A Algorithm table for two-stage Q-learning
As shown in Algorithm 1.
Appendix B URLs for subreddits used in this paper
As shown in Table 5. All post ids will be released for future work on this task.