Reinforcement Learning with External Knowledge and Two-Stage Q-functions for Predicting Popular Reddit Threads

04/20/2017 ∙ by Ji He, et al. ∙ Microsoft University of Washington 0

This paper addresses the problem of predicting popularity of comments in an online discussion forum using reinforcement learning, particularly addressing two challenges that arise from having natural language state and action spaces. First, the state representation, which characterizes the history of comments tracked in a discussion at a particular point, is augmented to incorporate the global context represented by discussions on world events available in an external knowledge source. Second, a two-stage Q-learning framework is introduced, making it feasible to search the combinatorial action space while also accounting for redundancy among sub-actions. We experiment with five Reddit communities, showing that the two methods improve over previous reported results on this task.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning refers to learning strategies for sequential decision-making tasks, where a system takes actions at a particular state with the goal of maximizing a long-term reward. Recently, several tasks that involve states and actions described by natural language have been studied, such as text-based games (Narasimhan et al., 2015; He et al., 2016a), web navigation (Nogueira and Cho, 2016), information extraction (Narasimhan et al., 2016), Reddit popularity prediction and tracking (He et al., 2016b), and human-computer dialogue systems (Wen et al., 2016; Li et al., 2016). Some of these studies ignore the use of external knowledge or world knowledge, while others (such as information extraction and task-oriented dialogue systems) directly interact with an (often) static database.

External knowledge – both general and domain-specific – has been shown to be useful in many natural language tasks, such as in question answering (Yang et al., 2003; Katz et al., 2005; Lin, 2002), information extraction (Agichtein and Gravano, 2000; Etzioni et al., 2011; Wu and Weld, 2010), computer games (Branavan et al., 2012), and dialog systems (Ammicht et al., 1999; Yan et al., 2016). However, in reinforcement learning, incorporating external knowledge is relatively rare, mainly due to the domain-specific nature of reinforcement learning tasks, e.g. Atari games (Mnih et al., 2015) and the game of Go (Silver et al., 2016). Of particular interest in our work is external knowledge represented by unstructured text, such as news feeds, Wikipedia pages, search engine results, and manuals, as opposed to a structured knowledge base.

Our study is conducted on the task of Reddit popularity prediction proposed in He et al. He et al. (2016b), which is a sequential decision-making problem based on a large-scale real-world natural language data set. In this task, a specified number of discussion threads predicted to be popular are recommended, chosen from a fixed window of recent comments to track. The authors proposed a reinforcement learning solution in which the state is formed by the collection of comments in the threads being tracked, and actions correspond to selecting a subset of new comments to follow (sub-actions) from the set of recent contributions to the discussion. Since comments are potentially redundant (multiple respondents can have similar reactions), the study found that sub-actions were best evaluated in combination. The computational complexity of the combinatorial action space was sidestepped by random sampling a fixed number of candidates from the full action space. A major drawback of random sampling in this application is that popular comments are rare and easily missed.

We make two main contributions in this paper. The first is a novel architecture for incorporating unstructured external knowledge into reinforcement learning. More specifically, information from the original state is used to query the knowledge source (here, an evolving collection of documents corresponding to other online discussions about world events), and the state representation is augmented by the outcome of the query. Thus, the agent can use both the local context (reinforcement learning environment) and the global context (e.g. recent discussions about world news) when making decisions. Second, we propose to use a two-stage Q-learning framework that makes it feasible to explore the full combinatorial natural language action space. A first Q-function is used to efficiently generate a list of sub-optimal candidate actions, and a second more sophisticated Q-function reranks the list to pick the best action.

2 Task

On Reddit, users reply to posts and other comments in a threaded (tree-structured) discussion. Comments (and posts) are associated with a karma score, which is a combination of positive and negative votes from registered users indicating popularity of the comment. In prior work (He et al., 2016b), popularity prediction in Reddit discussions (comment recommendation) is proposed for studying reinforcement learning with a large scale natural language action space. At each time step , the agent receives a string of text that describes the state and several strings of text that describe the potential actions (new comments to consider). The agent attempts to pick the best action for the purpose of maximizing the long-term reward. In a real-time scenario, the final karma of a comment is not immediately available, so prediction of popularity is based on the text in the comment as well as the context of discussion history. It is common that a lower karma comment will eventually lead to more discussion and popular comments in the future. Thus it is natural to formulate this task as a reinforcement learning problem.

More specifically, the set of comments that are being tracked at time is denoted as . The state, action, and immediate rewards are defined as follows:

  • State: all previously tracked comments, as well as the post (root node of the tree), i.e.

  • Action: an action is taken when a total of new comments , appear as nodes in the subtrees of , and the agent picks a set of comments to be tracked in the next time step: , and if

  • Reward: is the accumulated karma scores111The karma score is observed from an archived version of the discussion, not immediately shown at the time of the comment, so not available to a real-time system. in comments in

In this task, because an action corresponds to a set of comments (sub-actions) chosen from a larger set of candidates, the action space is combinatorial. and are also time-varying, reflecting the flow of the discussion in the paths chosen.

The standard Q-learning defines a function as the expected return starting from and taking the action :

where denotes a discount factor. The Q-function associated with an optimal policy can be found by the Q-learning recursion (Watkins and Dayan, 1992):

where is the learning rate of the algorithm. In He et al. He et al. (2016b), two deep Q-learning architectures are proposed, both with separate networks for the state and action spaces yielding embeddings and , respectively. Those embeddings are combined with a general interaction function to approximate the Q-values, , as in He et al. He et al. (2016a), where the approach of using separate networks for natural language state and action spaces is termed a Deep Reinforcement Relevance Network (DRRN).

(a) DRRN-Sum
(b) DRRN-BiLSTM
Figure 1: Different deep Q-learning architectures

3 Related Work

There has been increasing interest in applying deep reinforcement learning to a variety of problems, including tasks involving natural language. To control agents directly given high-dimensional sensory inputs, a Deep Q-Network (Mnih et al., 2015)

has been proposed and shown high capacity and scalability for handling a large state space. Another stream of work in recent deep learning research is the attention mechanism

(Bahdanau et al., 2015; Sukhbaatar et al., 2015; Vinyals et al., 2015)

, where a probability distribution is computed to pay attention to certain parts of a collection of data. It has been shown that the attention mechanism can handle long sequences or a large collection of data, while being quite interpretable. The attention mechanism work that is closest to ours is memory network (MemNN)

(Weston et al., 2014; Sukhbaatar et al., 2015)

. Most work on MemNNs uses embeddings of a query and documents to compute the attention weights for memory slots. Here, we propose models that also use non-content based features (time, popularity) for memory addressing. This helps retrieve content that provides complementary information to what is modeled in the query embedding vector. In addition, the content-based component of our query scheme uses TF-IDF based semantic-similarity, since the memory comprises a very large corpus of external documents that makes end-to-end learning of attention features impractical.

Multiple studies have explored interacting with a database (or knowledge base) using reinforcement learning. Narasimhan et al. Narasimhan et al. (2016) presents a framework of acquiring and incorporating external evidence to improve extraction accuracy in domains where the amount of training data is scarce. In task-oriented human-computer dialogue interactions, Wen et al. Wen et al. (2016)

introduce a neural network-based trainable dialogue system with a database operator module. Dhingra et al.

Dhingra et al. (2016) proposed a dialogue agent that provides users with an entity from a knowledge base by interactively asking for its attributes. In question answering, knowledge representation and reasoning also plays a central role (Ferrucci et al., 2010; Boyd-Graber et al., 2012). Our goal differs from these studies in that we do not directly optimize a domain-specific knowledge search, instead we use external world knowledge to enrich the state representation in a reinforcement learning task.

Our task of tracking popular Reddit comments is somewhat related to an approach to multi-document summarization described in

(Daumé III et al., 2009). A difference with respect to our problem is that the space of text for selection evolves over time. In addition, in our case, the agent has no access to optimal policy, in contrast to the SEARN algorithm used in that work.

To address overestimations of action values, double Q-learning (Hasselt, 2010; Van Hasselt et al., 2015) has been proposed and it leads to better performance gains on several Atari games. Dulac-Arnold et al. Dulac-Arnold et al. (2016)

present a policy architecture that works efficiently with a large number of actions. While a combinatorial action space can be large and discrete, this method does not directly apply in our case, because the possible actions are changing over different states. Instead, we borrow the philosophy from double Q and propose a two-stage Q-learning approach to reduce computational complexity by using a first Q-function to construct a quick yet rough estimate in the combinatorial action space, and then a second Q-function to rerank a set of sub-optimal actions.

The work described in our paper improves over (He et al., 2016b) by augmenting the state representation with external knowledge and by combining the two architectures that they proposed in two-stage Q-learning to enable exploration of the full action space. The first DRRN evaluates an action by treating sub-actions as independent and summing their contributions to the Q-value (Figure 1(a)), and the second models potential redundancy of sub-actions by using a BiLSTM at the comment level (Figure 1(b)).

4 Incorporating External Knowledge into the State Representation

This approach is inspired by the observation that in a real-world decision making process, it is usually beneficial to consider background knowledge. Here, we introduce a mechanism to incorporate external language knowledge into decision making. The intuition is that the agent will keep track of a memory space that helps with decision making, and when a new state comes, the agent refers to this external knowledge and picks relevant resources to help with decision making.

Figure 2: Incorporating external knowledge to augment a state-side representation with an attention mechanism. The attention features depend on the state and time stamp, helping the agent learn to pay different attention to external knowledge given different states. The shaded blue parts are learned end-to-end within reinforcement learning.

The architecture we propose is illustrated in Figure 2. Every time the agent reads the state information from the environment, it performs a lookup operation in external knowledge in its memory. This external knowledge could be a static knowledge base, or more generally it can be a dynamic database. In our experiments, the agent keeps an evolving collection of documents from the worldnews subreddit. We use an attention mechanism that produces a probability distribution over the entire external knowledge resource. This weight vector is computed by considering a set of features measuring the relevance between the current state and the “world knowledge” of the agent. More specifically, we consider the following three types of relevance:

  • Timing features: when users express their opinions on a website such as Reddit, it is likely they are referring to more recent news events. We use two indicator features to represent whether a document from the external knowledge is within the past 24 hours, or the past 7 days relative to the time of the new state. We denote these features as and , respectively.

  • Semantic similarity: we use the standard tf-idf (term-frequency inverse-document-frequency) (Salton and McGill, 1986)

    and compute cosine similarity scores as a measure for semantic relevance between the current state and each document in the external knowledge. We denote this semantic similarity as

    .

  • Popularity: for reddit posts/comments, we may use karma score as a measure for popularity. It is possible that high popularity topics will occur more often in the environment. To compensate the range difference in different relevance measures, we normalize karma scores222Detailed descriptions are given in Section 6. so the feature values fall in the range . We denote this normalized popularity score as .

For each state the agent extracts the above features for each document in the external knowledge, and form a 4-dimensional feature vector . The attention weights are then computed as a linear combination followed by a softmax over the entire external knowledge:

where the Softmax operates over the collection of documents and has dimension equals the number of documents. Note in our experimental setting, the softmax applies for only documents that exist before the new comments appear, and this simulates a “real-time” dynamic external knowledge resource. The attention weights are then multiplied with document embeddings to form a vector representation (embedding) of “world” knowledge:

The world embedding is concatenated with the original state embedding to enrich understanding of the environment.

5 Two-Stage Q-learning for a Combinatorial Action Space

There are two challenges associated with a combinatorial action space. One is the development of a Q-function framework for estimating the long-term reward. This is addressed in (He et al., 2016b). The other is the potentially high computational complexity, due to evaluating over every possible pair of . In the case of deep Q-learning, most of the time has been spent on the forward-pass from actions to Q-values. For back-propagation, since we only need to back-propagate one particular action the agent has chosen, complexity is not affected by the combinatorial action space.

One solution to sidestep computational complexity is to randomly pick a fixed number, say candidate actions, and perform a

operation. While this is widely used in the reinforcement learning literature, it is problematic in our application because the large and highly skewed action space makes it likely that good actions are missed. Here we propose to use two-stage Q-learning for reducing search complexity. More specifically, we can rewrite the

operation as:

where

(1)

where means picking the top- actions from the whole action set .

In the case of being DRRN-Sum, we can rewrite as:

which is simplified by precomputing sub-action value , . is the simple DRRN introduced in He et al. He et al. (2016a).

To elaborate, the idea is to use a first Q function to perform a quick but rough ranking of . The second Q function , which can be more sophisticated, is used to rerank the top- candidate actions. This is effectively a beam search with coarse-to-fine models and reranking. This ensures that all comments are explored, and at the same time, the architecture can be sophisticated enough to capture detailed dependencies between sub-actions, such as information redundancy. In our experiments, we pick to be DRRN-Sum and to be DRRN-BiLSTM. While the independence assumption on sub-action interdependency is too strong, the DRRN-Sum model is relatively easy to train. Since the parameters on the action side are tied for different sub-actions, we can train a DRRN with and then apply the model for each pair of . This will result in sub-action Q-values . Thus computing Equation 1 is equivalent to sorting values. Thus, we avoid the huge computational cost of first generating actions from sub-actions, then applying a general Q-function approximation to come up with . In Section 6, we train a DRRN (with ) and then copy the parameters to DRRN-Sum, which can be used to evaluate the full action space.333The whole two-stage Q framework is summarized in Algorithm 1 in Appendix.

6 Experiments

6.1 Data set and preprocessing

We carry out experiments on the task of predicting popular discussion threads on Reddit, as proposed by He et al. He et al. (2016b)

. Specifically, we conduct experiments on data from 5 subreddits including askscience, askmen, todayilearned, askwomen, and politics, which cover diverse genres and topics. In order to have long enough discussion threads, we filter out discussion trees with fewer than 100 comments. For each of the 5 subreddits, we randomly partition 90% of the data for online training, and 10% of the data for testing. Our evaluation metric is accumulated karma scores. For each setting we obtain mean (average reward) and standard deviation (shown as error bars or numbers in brackets) by 5 independent runs, each over 10,000 episodes. In all our experiments we set

. The basic subreddit statistics are shown in Table 1. We also report random policy performances and oracle upper bound performances.444Upper bounds are estimated by exhaustively searching through each discussion tree to find max karma discussion threads (overlapped comments are counted only once). This upper bound may not be attainable in a real-time setting. For askscience, and different ’s, the upper bound performances range from 1991.3 () to 2298.0 ().

Subreddit # Posts (in k) # Comments (in M) Random Upper bound
askscience 0.94 0.32 321.3 2109.0
askmen 4.45 1.06 132.4 651.4
todayilearned 9.44 5.11 390.3 2679.6
askwomen 3.57 0.81 132.4 651.4
politics 4.86 2.18 149.3 967.7
worldnews 9.88 5.99 205.8 1853.4
Table 1: Basic statistics of filtered subreddits

In preprocessing we remove punctuation and lowercase all words. We use a bag-of-words representations for each state , and comment

in discussion tracking, and for each document in the external knowledge source. The vocabulary contains the most frequent 5,000 words and the out-of-vocabulary rate is 7.1%. We use fully-connected feed-forward neural networks to compute state, action and document embeddings, with

hidden layers and hidden dimension 20.

Our Q-learning agent uses -greedy () throughout online training and testing. The discounting factor . During training, we use experience replay (Lin, 1992)

and the memory size is set to 10,000. For each experience replay, 500 episodes are generated and tuples are stored in a first-in-first-out fashion. We use mini-batch stochastic gradient descent with batch size of 100, and constant learning rate

. We train separate models for different subreddits.

6.2 Incorporating external knowledge

We first study the effect of incorporating external knowledge, without considering the combinatorial action space. More specifically, we set and use the simple DRRN. Each action is to pick a comment from to track. Our proposed method uses a state representation augmented by the world knowledge, as illustrated in Figure 2.

We utilize the worldnews subreddit as our external knowledge source. This subreddit consists of 9.88k posts. We define each document in the world knowledge to be the post plus its top-5 comments ranked by karma scores. The agent keeps a growing collection of documents. That is, at each time , the external knowledge contains documents from worldnews that appear before time . To compute popularity score of each document, we simply sum the karma scores of post and top-5 comments. Then the karma scores are normalized by dividing the highest score in the external knowledge.555Unlike in Fang et al. Fang et al. (2016), the summed karma scores do not follow a Zipfian distribution, so we do not use quantization or any nonlinear transformation. Thus the popularity feature values for computing attention fall in the range .

Figure 3: DRRN (with multiple ways of incorporating external knowledge) performance gains over baseline DRRN (without external knowledge) across 5 different subreddits

For comparison, we experiment with a baseline DRRN without any external knowledge. We also construct a baseline DRRN with hand-crafted rules for picking documents from external knowledge. Those rules include: i) documents within the past-day, ii) documents within the past-week, iii) 10 semantically most similar documents, iv) 10 most popular documents. We use a bag-of-words representation and construct the world embedding used to augment the state representation.

We compare multiple ways of incorporating external knowledge for different subreddits and show performance gains over a baseline DRRN (without any external knowledge) in Figure 3. The experimental results show that the DRRN using a learned attention mechanism to retrieve relevant knowledge outperforms all other configurations of DRRNs with rules for knowledge retrieval, and significantly outperforms the DRRN baseline that does not use external knowledge. Also we observe that different relevance features have different impact across subreddits. For example, for askscience, past-day documents have higher impact than past-week documents, while for politics past-week documents are more important. The most-popular documents actually have a negative effect for todayilearned, mainly because those are documents which are most popular throughout the entire history, while todayilearned discussions value information about recent events.666In principle, since we are concatenating the world embedding to obtain an augmented state representation, the result should not get worse. We hypothesize this is due to overfitting and use of mismatched documents, as in the most-popular setting for todayilearned. Nevertheless, the attention mechanism learns to rely on proper features to retrieve useful knowledge for the needs of different domains.

6.3 Two-stage Q-learning for a combinatorial action space

In this subsection we study the effect of two-stage Q-learning, without considering external knowledge. We train DRRN () first, and copy over the parameters to DRRN-Sum as . We then train =DRRN-BiLSTM as before, except that we use =DRRN-Sum to explore the whole action space to obtain .

random all DRRN-Sum
DRRN-BiLSTM DRRN-Sum DRRN-BiLSTM
K=2 573.2 (12.9) 663.3 (8.7) 676.9 (5.5)
K=3 711.1 (8.7) 793.1 (8.1) 833.9 (5.7)
K=4 854.7 (16.0) 964.5 (12.0) 987.1 (12.1)
K=5 980.9 (21.1) 1099.4 (15.9) 1101.3 (13.8)
Table 2: A performance comparison (across different ’s on askscience subreddit)

On askscience, we try multiple settings with and the results are shown in Table 2. We compare the proposed two-stage Q-learning with two single-stage Q-learning baselines. The first baseline, following the method in He et al. He et al. (2016b), uses a random subsampling approach to obtain (with ) and takes the max over them using DRRN-BiLSTM. The second baseline uses DRRN-Sum and explores the whole action space. The proposed two-stage Q-learning uses DRRN-Sum for picking a and DRRN-BiLSTM for reranking. We observe a large improvement by switching from “random” to “all”, showing that exploring the entire action space is critical in this task. There is a consistent gain by using two-stage Q-learning instead of a single-stage Q with DRRN-Sum. This shows that using a more sophisticated value function for reranking also helps with performance.

askscience askmen todayilearned  askwomen    politics
random DRRN-BiLSTM 711.1 (8.7) 139.0 (3.6) 606.9 (15.8) 135.0 (1.3) 177.9 (3.3)
all DRRN-Sum 793.1 (8.1) 142.5 (2.3) 679.4 (11.4) 145.9 (2.4) 180.6 (6.3)
DRRN-Sum DRRN-BiLSTM 833.9 (5.7) 148.0 (5.5) 697.9 (9.4) 149.6 (3.3) 204.7 (4.2)
Table 3: A performance comparison (across different subreddits) with

In Table 3, we compare two-stage Q-learning with the two baselines across different subreddits, with . The findings are consistent with those for askscience. Since different subreddits may have very different karma score distributions and language style, our results suggest that the algorithm applies well to different community interaction styles.

During testing, we compare runtime of the DRRN-BiLSTM Q-function with different , simulating over 10,000 episodes with and . The search time for the random selection and the two-stage Q-function are similar, both nearly constant for different . Using two-stage Q the test runtime is reduced by for and for comparing to exploring the whole action space.777Training DRRN-BiLSTM with the whole action space is intractable, so we just used a subspace trained DRRN-BiLSTM model for testing. This however achieves worse performance compared to the two-stage Q probably due to mismatch in training and testing.

6.4 Combined results

Figure 4: Ablation study on effects by incorporating external knowledge and/or two-stage Q-learning across 5 different subreddits.
state top-1 top-2 top-3 least
Would it be possible to artificially create an atmosphere like Earth has on Mars? Ultimate Reality TV: A Crazy Plan for a Mars Colony - It might become the mother of all reality shows. Fully 704 candidates are soon to begin competing for a trip to Mars to establish a colony there. ‘Alien thigh bone’ on Mars: Excitement from alien hunters at ‘evidence’ of extraterrestrial life. Mars likely never had enough oxygen in its atmosphere and elsewhere to support more complex organisms. The Gaia (General Authority on Islamic Affairs) and the UAE (United Arab Emirates) have issued a fatwa on people living on mars, due to the religious reasoning that there is no reason to be there. North Korea’s internet is offline; massive DDOS attack presumed.
Does our sun have any unique features compared to any other star? Star Wars: Episode VII begins filming in UAE desert. This can’t possibly be a modern Star Wars movie! I don’t see a green screen in sight! Ya, it’s more like Galaxy news. African Pop Star turns white (and causes controversy) with new line of skin whitening cream. I would like to see an unshopped photo of her in natural lighting. Dwarf planet discovery hints at a hidden Super Earth in solar system - The body, which orbits the sun at a greater distance than any other known object, may be shepherded by an unseen planet. Hong Kong democracy movement hit by 2018. The vote has no standing in law, by attempting to sabotage it, the Chinese(?) are giving it legitimacy
Table 4: States and documents (partial text) showing how the agent learns to attend to different parts of external knowledge

In Figure 4, we present an ablation study on effects of incorporating external knowledge and/or two-stage Q-learning (with ) across different subreddits. The two contributions we proposed each help improve reinforcement learning performance in a natural language scenario with a combinatorial action space. In addition, combining these two approaches further improves performance. In our task, two-stage Q-learning provides a larger gain. However, in all cases, incorporating external knowledge consistently gives additional gain on top of two-stage Q-learning.

We conduct case studies in Table 4. We show examples of most/least attended documents in the external knowledge given the state description. The documents are shortened for brevity. In the first example, the state is about a question about the atmosphere on Mars. The most-attended documents are correctly related to Mars living conditions, in various sources and aspects. The second example has the state talking about sun’s features compared to other stars. Interestingly, although the agent is able to attend to top documents due to some topic word matching (e.g. sun, star), the picked documents reflect popularity more than topic relevance. The least-attended documents are totally irrelevant in both examples, as expected.

7 Conclusion

In this paper we introduce two approaches for improving natural language based decision making in a combinatorial action space. The first is to augment the state representation of the environment by incorporating external knowledge through a learnable attention mechanism. The second is to use a two-stage Q-learning framework for exploring the entire combinatorial action space, while avoiding enumeration of all possible action combinations. Our experimental results show that both proposed approaches improve the performance in the task of predicting popular Reddit threads.

References

Supplementary Material

Appendix A Algorithm table for two-stage Q-learning

As shown in Algorithm 1.

1:  Initialize Reddit popularity prediction environment and load dictionary.
2:  Initialize DRRN (equivalent as DRRN-Sum with ) with small random weights and train. The DRRN-Sum shares the same parameters as DRRN.
3:  Initialize replay memory to capacity .
4:  for  do
5:     Randomly pick a discussion tree.
6:     Read raw state text and a list of sub-action text from the simulator, and convert them to representation and .
7:     Compute for the list of sub-actions using DRRN forward activation.
8:     For each , form value of .
9:     Keep a list of top actions , where each consists of sub-actions.
10:     for  do
11:        Compute for , the list of top actions using DRRN-BiLSTM forward activation.
12:        Select an action based on policy derived from . Execute in simulator.
13:        Observe reward . Read the next state text and the next list of sub-action texts, and convert them to representation and .
14:        Compute for the list of sub-actions using DRRN.
15:        For each , form value of .
16:        Keep a list of top actions , where each consists of sub-actions.
17:        Store transition in .
18:        if during training then
19:           Sample random mini batch of transitions from .
20:           Set
21:           Perform a gradient descent step on with respect to the network parameters . Back-propagation is performed only for though there are actions.
22:        end if
23:     end for
24:  end for
Algorithm 1 Two-stage Q-learning in combinatorial action space (: DRRN-Sum, : DRRN-BiLSTM)

Appendix B URLs for subreddits used in this paper

As shown in Table 5. All post ids will be released for future work on this task.

Subreddit URL
askscience https://www.reddit.com/r/askscience/
askmen https://www.reddit.com/r/askmen/
todayilearned https://www.reddit.com/r/todayilearned/
askwomen https://www.reddit.com/r/askwomen/
politics https://www.reddit.com/r/politics/
worldnews https://www.reddit.com/r/worldnews/
Table 5: URLs of subreddit data sets