1 Introduction
The explosive growth of the World Wide Web has generated massive data. As a consequence, the information overload problem has become progressively severe [Chang et al. (2006)]. Thus, how to identify objects that satisfy users’ information needs at the appropriate time and place has become increasingly important, which has motivated three representative information seeking mechanisms – search, recommendation, and online advertising. The search mechanism outputs objects that match the query, the recommendation mechanism generates a set of items that match users’ implicit preferences, and the online advertising mechanism is analogous to search and recommendation expect that the objects to be presented are advertisements [GarciaMolina et al. (2011)]. Numerous efforts have been made on designing intelligent methods for these three information seeking mechanisms. However, traditional techniques often face several common challenges. First, the majority of existing methods consider information seeking as a static task and generate objects following a fixed greedy strategy. This may fail to capture the dynamic nature of users’ preferences (or environment). Second, most traditional methods are developed to maximize the shortterm reward, while completely neglecting whether the suggested objects will contribute more in longterm reward [Shani et al. (2005)]. Note that the reward has different definitions among information seeking tasks, such as clickthrough rate (CTR), revenue, and dwell time.
Recent years have witnessed the rapid development of reinforcement learning (RL) techniques and a wide range of RL based applications. Under the RL schema, we tackle complex problems by acquiring experiences through interactions with a dynamic environment. The result is an optimal policy that can provide solutions to complex tasks without any specific instructions [Kaelbling et al. (1996)]. Employing RL for information seeking can naturally resolve the aforementioned challenges. First, considering the information seeking tasks as sequential interactions between an RL agent (system) and users (environment), the agent can continuously update its strategies according to users’ realtime feedback during the interactions, until the system converges to the optimal policy that generates objects best match users’ dynamic preferences. Second, the RL frameworks are designed to maximize the longterm cumulative reward from users. Therefore, the agent could identify objects with small immediate reward but making big contributions to the reward in the long run.
Given the advantages of reinforcement learning, there have been tremendous interests in developing RL based information seeking techniques. Thus, it is timely and necessary to provide an overview of information seeking techniques from a reinforcement learning perspective. In this survey, we present a comprehensive overview of stateoftheart RL based information seeking techniques and discuss some future directions. The remaining of the survey is organized as follows. In Section 2, we introduce technical foundations of reinforcement learning based information seeking techniques. Then we review the three key information seeking tasks – search, recommendation, and online advertising – with representative algorithms from Sections 3 to 5. Finally, we conclude the work with several future research directions.
2 Technical Foundations
Reinforcement learning is learning how to map situations to actions [Sutton and Barto (1998)]. The two fundamental elements in RL are to formulate the situations (mathematical models) and to learn the mapping (policy learning).
2.1 Problem Formulation
In reinforcement learning, there two main settings for problem formulations: multiarmed bandits (without state transition) and Markov decision processes (with state transition).
2.1.1 MultiArmed Bandits
The MultiArmed Bandits (MABs) problem is a simple model for the exploration/exploitation tradeoff [Varaiya and Walrand (1983)]. Formally, a MAB can be defined as follows.
Definition 2.1.
A MAB is a 3tuple , where is the set of actions (arms) and , is the reward distribution when performing action , and policy
describes probability distribution over the possible actions.
An arm with the highest expected reward is called the best arm (denoted as ) and its expected reward is the optimal reward. An algorithm for MAB, at each time step , samples an arm and receives a reward . When making its selection, the algorithm depends on the history (i.e., actions and rewards) up to the time . The contextual bandit model (a.k.a. associative bandits or bandits with side information) is an extension of MAB that takes additional information into account [Auer et al. (2002), Lu et al. (2010)].
2.1.2 Markov Decision Process
A Markov decision process (MDP) is a classical formalization of sequential decision making, which is a mathematically idealized form of reinforcement learning problem [Bellman (2013)]. We define an MDP as follows.
Definition 2.2.
A Markov Decision Process is a 5tuple , where is a set of states, is a discrete set of actions, is the state transition function which specifies a function mapping a state into a new state in response to the selected action , is the reward distribution when performing action in state , and policy describes the behaviors of an agent which is a probability distribution over the possible actions.
The agent and environment interact at each of a sequence of discrete time steps . Consequently, a sequence or trajectory is generated as
. In general, we seek to maximize the expected discounted return, where the return is defined as: , where () is the discounted rate.
The Partially Observable Markov Decision Process (POMDP) is an extension of MDP to the case where the state of the system is not necessarily observable [Åström (1965), Smallwood and
Sondik (1973), Sondik (1978), Kaelbling
et al. (1998)].
2.1.3 MultiAgent Setting
The generalization of the Markov Decision Process to the MultiAgent case is the stochastic game [Bowling and Veloso (2002), Shoham et al. (2003), Busoniu et al. (2008)] as:
Definition 2.3.
A multiagent game is a tuple , where is the number of agents, is the discrete set of environment states, is the discrete set of actions for the agent , is the state transition probability function, is the reward function of agent , and is the policy adopted by agent .
In the multiagent game, the state transition is the result of the joint actions of all the agents , where denotes the action taken by agent at time step . The reward also depends on the joint action. If , i.e., all the agents adopt the same policy to maximize the same expected return, the multiagent game is fully cooperative. If and , i.e., the two agents have opposite policies, the game is fully competitive. Mixed games are stochastic games that are neither fully cooperative nor fully competitive.
2.2 Policy Learning
Reinforcement Learning is a class of learning problems in which the goal of an agent (or multiagent) to find the policy to optimize some measures of its longterm performance. RL solutions can be categorized in different ways. Here we investigate them from two perspectives: whether the full model is available and the way of finding the optimal policy.
2.2.1 Modelbased v.s. Modelfree
Reinforcement learning algorithms, which explicitly learn system models and use them to solve MDP problems, are modelbased methods. Modelbased RL has a strong influence from the control theory and is often explained in terms of different disciplines. These methods include popular algorithms such as the Dyna [Sutton (1991)], Prioritized Sweeping [Moore and Atkeson (1993)], Qiteration [Busoniu et al. (2010)], Policy Gradient (PG) [Williams (1992)], and the variation of PG [Baxter and Bartlett (2001), Kakade (2001)]. The modelfree methods ignore the model and just focus on figuring out the value functions directly from the interaction with the environment. To accomplish this, the methods depend on sampling and observation heavily; thus they don’t need to know the inner working of the system. Some examples of these methods are Qlearning [Kröse (1995)], SARSA [Rummery and Niranjan (1994)], LSPI [Lagoudakis and Parr (2003)], and ActorCritic [Konda and Tsitsiklis (1999)].
2.2.2 Value function v.s. Policy search
The algorithms, which first find the optimal value functions and then extract optimal policies, are value function methods, such as Dyna, Qlearning, SARSA, and DQN [Mnih et al. (2015)]. The alternative approaches are policy search methods which solve an MDP problem by directly searching in the space of policies. An important class of policy search methods is that of Policy Gradient (PG) algorithms [Williams (1992), Baxter and Bartlett (2001), Kakade (2001), Deisenroth and Rasmussen (2011)]. These methods target at modeling and optimizing the policy directly. The policy is usually modeled with a parameterized function with respect to . The value of the reward (objective) function depends on this policy and then various algorithms can be applied to optimize
for the best reward. There are a series of algorithms, which use the PG to search in the policy space, and at the same time estimate a value function. The important class of these methods are ActorCritic (AC) and its variation
[Konda and Tsitsiklis (1999), Peters et al. (2005), Peters and Schaal (2008), Bhatnagar et al. (2007), Bhatnagar et al. (2009)]. These are twotimescale algorithms where the critic uses TemporalDifference (TD) learning with a linear approximation architecture and the actor is updated in an approximate gradient direction based on information provided by the critic.3 Reinforcement Learning for Search
Search aims to find and rank a set of objects (e.g., documents, records) based on a user query [Yin, Hu, Tang, Daly, Zhou, Ouyang, Chen, Kang, Deng, Nobata, et al. (Yin et al.)]. In this section, we review RL applications in key topics of search.
3.1 Query Understanding
Query understanding is the primary task for the search engine to understand users’ information needs. It can be potentially useful for improving general search relevance, user experience, and helping users to accomplish tasks [Croft et al. (2010)]. In [Nogueira and Cho (2017)]
, RL has been leveraged to solve the query reformulation task: a query reformulation framework is proposed based on a neural network, which rewrites a query to maximize the number of relevant documents returned. In the proposed framework, a search engine is treated as a black box that an agent learns to use in order to retrieve more relevant items, which opens the possibility of training an agent to use a search engine for a task other than the one it was originally intended for. Additionally, the upperbound performance of an RLbased model is estimated in a given environment. In
[Nogueira et al. (2018)], a multiagent based method is introduced to efficiently learn diverse query reformulation. It is argued that it is easier to train multiple subagents than a single generalist one since each subagent only needs to learn a policy that performs well for a subset of examples. In the proposed framework, an agent consists of multiple specialized subagents and a metaagent that learns to aggregate the answers from subagents to produce a final answer. Thus, the method makes learning faster with parallelism.3.2 Ranking
Relevance Ranking is the core problem of information retrieval [Yin et al. (2016)] and learning to rank (LTR) is the key technology in relevance ranking. In LTR, the approaches to directly optimize the ranking evaluation measures are representative and have been proved to be effective[Yue et al. (2007), Xu and Li (2007), Xu et al. (2008)]. These methods usually only optimize the evaluation measure calculated at a predefined ranking position, e.g. NDCG at rank in [Xu and Li (2007)]. The information carried by the documents after the rank are neglected. To solve such problem, in [Zeng et al. (2017)], an LTR model, MDPRank, is proposed based on Markov decision process, which has the ability to leverage the measures calculated at all of the ranking positions. The reward function is defined based upon the IR evaluation measures and the model parameters can be learned through maximizing the accumulated rewards to all of the decisions. Implicit relevance feedback refers to an interactive process between search engine and user, and has been proven to be very effective for improving retrieval accuracy [Lv and Zhai (2009)]. Both Bandits and MDPs can model such an interactive process naturally [Vorobev et al. (2015), Katariya et al. (2016), Katariya et al. (2017)]. In [Kveton et al. (2015)], cascading bandits are introduced to identify the most attractive items, and the goal of the agent is to maximize its total reward with respect to the list of the most attractive items. Through maintaining state transition, MDP is able to model the user state in the interaction with search engine. In [Zeng et al. (2018)]
, the interactive process is formulated as an MDP and the Recurrent Neural Network is applied to process the feedback.
Beyond relevance ranking, another important goal is to provide search results that cover a wide range of topics for a query, i.e., search result diversification [Santos et al. (2015), Xu et al. (2017)]. Typical methods formulate the problem of constructing a diverse ranking as a process of greedy sequential document selection. To select an optimal document for a position, it is critical for a diverse ranking model to capture the utility of information users have perceived from the preceding documents. To explicitly model the utility perceived by the users, the construction of a diverse ranking is formalized as a process of sequential decision making and the process is modeled as a continuous state Markov decision process, referred to as MDPDIV [Xia et al. (2017)]. The ranking of documents is formalized as a sequence of decisions and each action corresponds to selecting one document from the candidate set. In the parameter training phase, the policy gradient algorithm of REINFORCE is adopted and the expected longterm discounted rewards in terms of the diversity evaluation measure is maximized. More works for diversity ranking see [Feng et al. (2018), Kapoor et al. (2018)]
3.3 WholePage Optimization
To improve user experiences, modern search engines aggregate versatile results from different verticals – webpages, news, images, video, shopping, knowledge cards, local maps, etc. Page presentation is broadly defined as the strategy to present a set of items on search result page (SERP), which is much more expressive than a ranked list. Finding proper presentation for a gallery of heterogeneous results is critical for modern search engines. One approach of efficiently learning to optimize a large decision space is fractional factorial design. However, the method could cause a combinatorial explosion problem with a large search space. In [Hill et al. (2017)], bandit formulation is applied to explore the layout space efficiently and hillclimbing is used to select optimal content in realtime. The model avoids a combinatorial explosion in model complexity by only considering pairwise interactions between page components. This approach is a greedy alternating optimization strategy that can run online in realtime. In [Wang et al. (2016), Wang et al. (2018)], a framework is proposed to learn the optimal page presentation to render heterogeneous results onto SERP. It leveraged the MDP setting and the agent is designed as the algorithm that determines the presentation of page content on a SERP for each incoming search query. To solve the critical efficiency problem, it proposed a policybased learning method which can rapidly choose actions from the highdimensional space.
3.4 Session Search
The taskoriented search includes a series of search iterations triggered by the query reformulations within a session. Markov chain in session search is observed: user’s judgment of search results in the prior iteration will influence user’s behaviors in the next search iteration. Session search is modeled as a dualagent stochastic game based on Partially Observable Markov Decision Process (POMDP) in
[Luo et al. (2014)]. They mathematically model dynamics in session search as a cooperative game between the user and the search engine, while user and the search engine work together in order to jointly maximize the longterm cumulative rewards. Logbased document reranking is a special type of session search that reranks documents based on the historical search logs which includes the target user’s personalized query log and other users’ search activities. The reranking aims to offer a better order of the initial retrieved documents [Zhang et al. (2014)]. Nowadays, deep reinforcement learning technology has been applied in the ECommerce search engine [Hu et al. (2018), Feng et al. (2018)]. For better utilizing the correlation between different ranking steps, RL is used to learn an optimal ranking policy which maximizes the expected accumulative rewards in a search session [Hu et al. (2018)]. It formally defined the multistep ranking problem in the search session as MDP, denoted as SSMDP, and proposed a novel policy gradient algorithm for learning an optimal ranking policy, which is able to deal with the problem of high reward variance and unbalanced reward distribution. In
[Feng et al. (2018)], multiscenario ranking is formulated as a fully cooperative, partially observable, multiagent sequential decision problem, denoted as MARDPG. MARDPG has a communication component for passing message, several private agents for making action for ranking, and a centralized critic for evaluating the overall performance of the coworking agents. Agents collaborate with each other by sharing a global actionvalue function and passing messages that encode historical information across scenarios.4 Reinforcement Learning for Recommendation
Recommender systems target to capture users’ preferences according to their feedback (or behaviors, e.g. rating and review) and suggest items that match their preferences. In this section, we briefly review how RL is adapted in several key tasks in recommendations.
4.1 Exploitation/Exploration Dilemma
Traditional recommender systems suffer from the exploitationexploration dilemma, where exploitation is to recommend items that are predicted to best match users’ preferences, while exploration is to recommend items randomly to collect more users’ feedback. The contextual bandit models an agent that attempts to balance the competing exploitation and exploration tasks in order to maximize the accumulated longterm reward over a considered period. The traditional strategies to balance exploitation and exploration in bandit setting are greedy [Watkins (1989)], EXP3 [Auer et al. (2002)], and UCB1 [Auer et al. (2002)]. In the news feeds scenario, the exploration/exploitation problem of personalized news recommendation is modeled as a contextual bandit problem [Li et al. (2010)], and a learning algorithm LinUCB is proposed to select articles sequentially for specific users based on the users’ and articles’ contextual information, in order to maximize the total user clicks.
4.2 Temporal Dynamics
Most existing recommender systems such as collaborative filtering, contentbased and learningtorank have been extensively studied with the stationary environment (reward) assumption, where user’s preference is assumed to be static. However, this assumption is usually not true in reality since users’ preferences are dynamic, thus the reward distributions usually change over time. In bandit setting, it usually introduces a variable reward function to delineate the dynamic nature of the environment. For instance, the particle learning based dynamical context drift model is proposed to model the changing of reward mapping function in multiarmed bandit problem, where the drift of the reward mapping function is learned as a group of random walk particles, and well fitted particles are dynamically chosen to describe the mapping function [Zeng et al. (2016)]. A contextual bandit algorithm is presented to detect the changes of environment according to the reward estimation confidence, and updates the arm selection policy accordingly [Wu et al. (2018)]. The changedetection based framework under the piecewisestationary reward assumption for the multiarmed bandit problem is proposed in [Liu et al. (2018)], where upper confidence bound (UCB) policies is used to detect change points actively and restart the UCB indices. Another solution for capturing user’s dynamic preference is to introduce the MDP setting [Chen et al. (2018), Liu et al. (2018), Zhao, Zhang, Ding, Xia, Tang, and Yin (Zhao et al.), Zou et al. (2019)]. Under the MDP setting, state is introduced to represent user’s preference and state transition captures the dynamic nature of user’s preference over time. In [Zhao, Zhang, Ding, Xia, Tang, and Yin (Zhao et al.)], a user’s dynamic preference (agent’s state) is learned from his/her browsing history. Each time the recommender system suggests an item to a user, the user will browse this item and provide feedback (skip, click or purchase), which reveals user’s satisfaction of the recommended item. According to the feedback, the recommender system will update its state to represent user’s new preferences [Zhao, Zhang, Ding, Xia, Tang, and Yin (Zhao et al.)].
4.3 Long Term User Engagement
User engagement in recommendation is the assessment of user’s desirable (even essential) responses to the items (products, services, or information) suggested by the recommender systems [Lalmas et al. (2014)]. User engagement can be measured not only in terms of immediate response (e.g. clicks and rating of the recommended items), but more importantly in terms of longterm response (e.g. user repetitively purchases) [Schopfer and Keller (Schopfer and Keller)]. In [Wu et al. (2017)], the problem of longterm user engagement optimization is formulated as a sequential decision making problem. In each iteration, the agent needs to estimate the risk of losing a user based on the user’s dynamic response to past recommendations. Then, a bandit based method [Wu et al. (2017)] is introduced to balance the immediate user click and the expected future clicks when the user revisits the recommender system. In practical recommendation sessions, users will sequentially access multiple scenarios, such as the entrance pages and the item detail pages, and each scenario has its own recommendation strategy. A multiagent reinforcement learning based approach (DeepChain) is proposed in [Zhao et al. (2019)], which can capture the sequential correlation among different scenarios and jointly optimize multiple recommendation strategies. To be specific, modelbased reinforcement learning technique is introduced to reduce the training data requirement and execute more accurate strategy updates. In the news feeds scenario [Zheng et al. (2018)], to incorporate more user feedback information, the longterm user response (i.e., how frequent user returns) is considered as a supplement to user’s immediate click behaviors, and a Deep QLearning based framework is proposed to optimize the news recommendation strategies.
4.4 PageWise Recommendation
In practical recommender systems, each time users are typically recommended a page of items. In this setting, the recommender systems need to jointly (1) select a set of complementary and diverse items from a larger candidate item set and (2) form an item display (layout configuration) strategy to place the items in a 2D web page that can lead to maximal reward. Given the massive number of items, the action space is extremely large if we treat each whole page recommendation as one action. To mitigate the issue of the large action space, a Deep Deterministic Policy Gradient algorithm is proposed [DulacArnold et al. (2015)] where the Actor generates a deterministic optimal action according to the current state, and the Critic outputs the Qvalue of this stateaction pair. DDPG reduces the computational cost of conventional valuebased reinforcement learning methods, thus it is a fitting choice for the whole page recommendation setting [Cai et al. (2018a), Cai et al. (2018b)]. Several approaches are presented recently to enhance the efficiency [Choi et al. (2018), Chen et al. (2018)]. In [Zhao, Xia, Zhang, Ding, Yin, and Tang (Zhao et al.), Zhao et al. (2017)], CNN techniques are introduced to capture the item display patterns and users’ feedback of each item in the page. To represent each item, itemembedding, categoryembedding and feedback embedding are leveraged, which can help to generate complementary and diverse recommendations and capture user’s interests within the pages. Bandit techniques are also leveraged for wholepage Recommendations [Wang et al. (2017), Lacerda (2017)]. For instance, the whole page recommendation task is considered as a combinatorial semibandit problem, where the system recommends actions from a candidate set of actions, and displays the selected items in (out of ) positions [Wang et al. (2017)].
5 Reinforcement Learning for Online Advertising
The goal of online advertising is to assign the right advertisements to the right users so as to maximize the revenue, clickthrough rate (CTR) or return on investment (ROI) of the advertising campaign. The two main marketing strategy in online advertising are guaranteed delivery (GD) and realtime bidding (RTB).
5.1 Guaranteed Delivery
In guaranteed delivery, advertisements that share a single idea and theme are grouped into campaigns, and are charged on a paypercampaign basis for the prespecified number of deliveries (click or impressions) [Salomatin et al. (2012)]. Most popular GD (Guaranteed Delivery) solutions are based on offline optimization algorithms, and then adjusted for online setup. However, deriving the optimal strategy to allocate impressions is challenging, especially when the environment is unstable in realworld application. In [Wu et al. (2018)], a multiagent reinforcement learning (MARL) approach is proposed to derive cooperative policies for the publisher to maximize its target in an unstable environment. They formulated the impression allocation problem as an auction problem where each contract can submit virtual bids for individual impressions. With this formulation, they derived the optimal impression allocation strategy by solving the optimal bidding functions for contracts.
5.2 RealTime Bidding
RTB allows an advertiser to submit a bid for each individual impression in a very short time frame. Ad selection task is typically modeled as multiarmed bandit (MAB) problem with the setting that samples from each arm are iid, feedback is immediate and rewards are stationary [Yang and Lu (2016), Nuara et al. (2018), Gasparini et al. (2018), Tang et al. (2013), Xu et al. (2013), Yuan et al. (2013), Schwartz et al. (2017)]. The payoff functions of a MAB are allowed to evolve, but they are assumed to evolve slowly over time. On the other hand, display ads created while others are removed regularly in an advertising campaign circulation. The problem of multiarmed bandits with budget constraints and variable costs is studied in [Ding, Qin, Zhang, and Liu (Ding et al.)]. In this case, pulling the arms of bandit will get random rewards with random costs, and the algorithm aims to maximize the longterm reward by pulling arms with a constrained budget. This setting can model Internet advertising in a more precise way than previous works where pulling an arm is costless or has a fixed cost.
Under the MAB setting, the bid decision is considered as a static optimization problem of either treating the value of each impression independently or setting a bid price to each segment of ad volume. However, the bidding for a given ad campaign would repeatedly happen during its life span before the budget running out. Thus, the MDP setting have also been studied [Cai et al. (2017), Tang (2017), Wang et al. (2018), Zhao et al. (2018), Rohde et al. (2018), Wu et al. (2018), Jin et al. (2018)]. A modelbased reinforcement learning framework is proposed to learn bid strategies in RTB advertising [Cai et al. (2017)], where neural network is used to approximate the state value, which can better deal with the scalability problem of large auction volume and limited campaign budget. A modelfree deep reinforcement learning method is proposed to solve the bidding problem with constrained budget [Wu et al. (2018)]: the problem is modeled as a control problem, and RewardNet is designed for generating rewards to solve reward design trap, instead of using the immediate reward. A multiagent bidding model is presented, which takes the other advertisers’ bidding in the system into consideration, and a clustering approach is introduced to solve the large number of advertisers challenge [Jin et al. (2018)].
6 Conclusion and Future Directions
In this article, we present an overview of information seeking from the reinforcement learning perspective. We first introduce mathematical foundations of RL based information seeking approaches. Then we review stateoftheart algorithms of three representative information seeking mechanisms – search, recommendations, and advertising. Next, we here discuss some interesting research directions on reinforcement learning that can bring the information seeking research into a new frontier.
First, most of the existing works train a policy within one scenario, while overlooking users’ behaviors (preference) in other scenarios [Feng et al. (2018)]. This will result in a suboptimal policy, which calls for collaborative RL frameworks that consider search, recommendation and advertising scenarios simultaneously. Second, the type of reward function varies among different computational tasks. More sophisticated reward functions should be designed to achieve more goals of information seeking, such as increasing the supervising degree of recommendations. Third, more types of useragent interactions could be incorporated into RL frameworks, such as adding items into shopping cart, users’ repeat purchase behavior, users’ dwelling time in the system, and user’s chatting with customer service representatives or agent of AI dialog system. Fourth, testing a new algorithm is expensive since it needs lots of engineering efforts to deploy the algorithm in the practical system, and it also may have negative impacts on user experience if the algorithm is not mature. Thus online environment simulator or offline evaluation method based on historical logs are necessary to pretrain and evaluate new algorithms before launching them online. Finally, there is an increasing demand for an open online reinforcement learning environment for information seeking, which can advance the RL and information seeking communities and achieve better consistency between offline and online performance.
Acknowledgements
Xiangyu Zhao and Jiliang Tang are supported by the National Science Foundation (NSF) under grant numbers IIS1714741, IIS1715940 and CNS1815636, and a grant from Criteo Faculty Research Award.
References
 Åström (1965) Åström, K. J. 1965. Optimal control of markov processes with incomplete state information. Journal of Mathematical Analysis and Applications 10, 1, 174–205.
 Auer et al. (2002) Auer, P., CesaBianchi, N., and Fischer, P. 2002. Finitetime analysis of the multiarmed bandit problem. Machine Learning 47, 23, 235–256.
 Auer et al. (2002) Auer, P., CesaBianchi, N., Freund, Y., and Schapire, R. E. 2002. The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32, 1, 48–77.
 Baxter and Bartlett (2001) Baxter, J. and Bartlett, P. L. 2001. Infinitehorizon policygradient estimation. J. Artif. Intell. Res. 15, 319–350.
 Bellman (2013) Bellman, R. 2013. Dynamic programming. Courier Corporation.
 Bhatnagar et al. (2007) Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. 2007. Incremental natural actorcritic algorithms. In NIPS ’07.
 Bhatnagar et al. (2009) Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. 2009. Natural actorcritic algorithms. Automatica 45, 11, 2471–2482.
 Bowling and Veloso (2002) Bowling, M. H. and Veloso, M. M. 2002. Multiagent learning using a variable learning rate. Artif. Intell. 136, 2, 215–250.
 Busoniu et al. (2010) Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D. 2010. Reinforcement learning and dynamic programming using function approximators. CRC press.
 Busoniu et al. (2008) Busoniu, L., Babuska, R., and Schutter, B. D. 2008. A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Systems, Man, and Cybernetics, Part C 38, 2, 156–172.
 Cai et al. (2017) Cai, H., Ren, K., Zhang, W., Malialis, K., Wang, J., Yu, Y., and Guo, D. 2017. Realtime bidding by reinforcement learning in display advertising. In WSDM ’17.
 Cai et al. (2018a) Cai, Q., FilosRatsikas, A., Tang, P., and Zhang, Y. 2018a. Reinforcement mechanism design for ecommerce. In WWW ’18.
 Cai et al. (2018b) Cai, Q., FilosRatsikas, A., Tang, P., and Zhang, Y. 2018b. Reinforcement mechanism design for fraudulent behaviour in ecommerce. In AAAI ’18.
 Chang et al. (2006) Chang, C., Kayed, M., Girgis, M. R., and Shaalan, K. F. 2006. A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18, 10, 1411–1428.
 Chen et al. (2018) Chen, H., Dai, X., Cai, H., Zhang, W., Wang, X., Tang, R., Zhang, Y., and Yu, Y. 2018. Largescale interactive recommendation with treestructured policy gradient. CoRR abs/1811.05869.
 Chen et al. (2018) Chen, S., Yu, Y., Da, Q., Tan, J., Huang, H., and Tang, H. 2018. Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In SIGKDD ’18.
 Choi et al. (2018) Choi, S., Ha, H., Hwang, U., Kim, C., Ha, J., and Yoon, S. 2018. Reinforcement learning based recommender system using biclustering technique. CoRR abs/1801.05532.
 Croft et al. (2010) Croft, W. B., Bendersky, M., Li, H., and Xu, G. 2010. Query representation and understanding workshop. SIGIR Forum 44, 2, 48–53.
 Deisenroth and Rasmussen (2011) Deisenroth, M. P. and Rasmussen, C. E. 2011. PILCO: A modelbased and dataefficient approach to policy search. In ICML ’11.
 Ding, Qin, Zhang, and Liu (Ding et al.) Ding, W., Qin, T., Zhang, X., and Liu, T. Multiarmed bandit with budget constraint and variable costs. In AAAI ’13.
 DulacArnold et al. (2015) DulacArnold, G., Evans, R., van Hasselt, H., Sunehag, P., Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris, T., and Coppin, B. 2015. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679.
 Feng et al. (2018) Feng, J., Li, H., Huang, M., Liu, S., Ou, W., Wang, Z., and Zhu, X. 2018. Learning to collaborate: Multiscenario ranking via multiagent reinforcement learning. In WWW ’18.
 Feng et al. (2018) Feng, Y., Xu, J., Lan, Y., Guo, J., Zeng, W., and Cheng, X. 2018. From greedy selection to exploratory decisionmaking: Diverse ranking with policyvalue networks. In SIGIR ’18.
 GarciaMolina et al. (2011) GarciaMolina, H., Koutrika, G., and Parameswaran, A. G. 2011. Information seeking: convergence of search, recommendations, and advertising. Commun. ACM 54, 11, 121–130.
 Gasparini et al. (2018) Gasparini, M., Nuara, A., Trovò, F., Gatti, N., and Restelli, M. 2018. Targeting optimization for internet advertising by learning from logged bandit feedback. In IJCNN ’18.
 Hill et al. (2017) Hill, D. N., Nassif, H., Liu, Y., Iyer, A., and Vishwanathan, S. V. N. 2017. An efficient bandit algorithm for realtime multivariate optimization. In SIGKDD ’17.
 Hu et al. (2018) Hu, Y., Da, Q., Zeng, A., Yu, Y., and Xu, Y. 2018. Reinforcement learning to rank in ecommerce search engine: Formalization, analysis, and application. In SIGKDD ’18.
 Jin et al. (2018) Jin, J., Song, C., Li, H., Gai, K., Wang, J., and Zhang, W. 2018. Realtime bidding with multiagent reinforcement learning in display advertising. In CIKM ’18.
 Kaelbling et al. (1998) Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artif. Intell. 101, 12, 99–134.
 Kaelbling et al. (1996) Kaelbling, L. P., Littman, M. L., and Moore, A. W. 1996. Reinforcement learning: A survey. J. Artif. Intell. Res. 4, 237–285.
 Kakade (2001) Kakade, S. 2001. A natural policy gradient. In NIPS ’01.
 Kapoor et al. (2018) Kapoor, S., Keswani, V., Vishnoi, N. K., and Celis, L. E. 2018. Balanced news using constrained banditbased personalization. In IJCAI ’18.
 Katariya et al. (2017) Katariya, S., Kveton, B., Szepesvári, C., Vernade, C., and Wen, Z. 2017. Bernoulli rank1 bandits for click feedback. In IJCAI ’17.
 Katariya et al. (2016) Katariya, S., Kveton, B., Szepesvári, C., and Wen, Z. 2016. DCM bandits: Learning to rank with multiple clicks. In ICML ’16.
 Konda and Tsitsiklis (1999) Konda, V. R. and Tsitsiklis, J. N. 1999. Actorcritic algorithms. In NIPS ’99.
 Kröse (1995) Kröse, B. J. A. 1995. Learning from delayed rewards. Robotics and Autonomous Systems 15, 4, 233–235.
 Kveton et al. (2015) Kveton, B., Szepesvári, C., Wen, Z., and Ashkan, A. 2015. Cascading bandits: Learning to rank in the cascade model. In ICML ’15.
 Lacerda (2017) Lacerda, A. 2017. Multiobjective ranked bandits for recommender systems. Neurocomputing 246, 12–24.
 Lagoudakis and Parr (2003) Lagoudakis, M. G. and Parr, R. 2003. Leastsquares policy iteration. Journal of Machine Learning Research 4, 1107–1149.
 Lalmas et al. (2014) Lalmas, M., O’Brien, H., and YomTov, E. 2014. Measuring User Engagement. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers.
 Li et al. (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. 2010. A contextualbandit approach to personalized news article recommendation. In WWW ’10.
 Liu et al. (2018) Liu, F., Lee, J., and Shroff, N. B. 2018. A changedetection based framework for piecewisestationary multiarmed bandit problem. In AAAI ’18.
 Liu et al. (2018) Liu, F., Tang, R., Li, X., Ye, Y., Chen, H., Guo, H., and Zhang, Y. 2018. Deep reinforcement learning based recommendation with explicit useritem interactions modeling. CoRR abs/1810.12027.
 Lu et al. (2010) Lu, T., Pál, D., and Pal, M. 2010. Contextual multiarmed bandits. In AISTATS ’10.
 Luo et al. (2014) Luo, J., Zhang, S., and Yang, H. 2014. Winwin search: dualagent stochastic game in session search. In SIGIR ’14.
 Lv and Zhai (2009) Lv, Y. and Zhai, C. 2009. Adaptive relevance feedback in information retrieval. In CIKM ’09.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. 2015. Humanlevel control through deep reinforcement learning. Nature 518, 7540, 529–533.
 Moore and Atkeson (1993) Moore, A. W. and Atkeson, C. G. 1993. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning 13, 103–130.
 Nogueira et al. (2018) Nogueira, R., Bulian, J., and Ciaramita, M. 2018. Learning to coordinate multiple reinforcement learning agents for diverse query reformulation. CoRR abs/1809.10658.
 Nogueira and Cho (2017) Nogueira, R. and Cho, K. 2017. Taskoriented query reformulation with reinforcement learning. In EMNLP ’17.
 Nuara et al. (2018) Nuara, A., Trovò, F., Gatti, N., and Restelli, M. 2018. A combinatorialbandit algorithm for the online joint bid/budget optimization of payperclick advertising campaigns. In AAAI ’18.
 Peters and Schaal (2008) Peters, J. and Schaal, S. 2008. Natural actorcritic. Neurocomputing 71, 79, 1180–1190.
 Peters et al. (2005) Peters, J., Vijayakumar, S., and Schaal, S. 2005. Natural actorcritic. In ECML ’05.
 Rohde et al. (2018) Rohde, D., Bonner, S., Dunlop, T., Vasile, F., and Karatzoglou, A. 2018. Recogym: A reinforcement learning environment for the problem of product recommendation in online advertising. CoRR abs/1808.00720.
 Rummery and Niranjan (1994) Rummery, G. A. and Niranjan, M. 1994. Online Qlearning using connectionist systems. Vol. 37. University of Cambridge, Department of Engineering Cambridge, England.
 Salomatin et al. (2012) Salomatin, K., Liu, T., and Yang, Y. 2012. A unified optimization framework for auction and guaranteed delivery in online advertising. In CIKM ’12.
 Santos et al. (2015) Santos, R. L. T., MacDonald, C., and Ounis, I. 2015. Search result diversification. Foundations and Trends in Information Retrieval 9, 1, 1–90.
 Schopfer and Keller (Schopfer and Keller) Schopfer, S. and Keller, T. Long term recommender benchmarking for mobile shopping list applications using markov chains. In RecSys ’14.
 Schwartz et al. (2017) Schwartz, E. M., Bradlow, E. T., and Fader, P. S. 2017. Customer acquisition via display advertising using multiarmed bandit experiments. Marketing Science 36, 4, 500–522.
 Shani et al. (2005) Shani, G., Heckerman, D., and Brafman, R. I. 2005. An mdpbased recommender system. Journal of Machine Learning Research 6, 1265–1295.
 Shoham et al. (2003) Shoham, Y., Powers, R., and Grenager, T. 2003. Multiagent reinforcement learning: a critical survey. Tech. rep., Technical report, Stanford University.
 Smallwood and Sondik (1973) Smallwood, R. D. and Sondik, E. J. 1973. The optimal control of partially observable markov processes over a finite horizon. Operations Research 21, 5, 1071–1088.
 Sondik (1978) Sondik, E. J. 1978. The optimal control of partially observable markov processes over the infinite horizon: Discounted costs. Operations Research 26, 2, 282–304.
 Sutton (1991) Sutton, R. S. 1991. Dyna, an integrated architecture for learning, planning, and reacting. SIGART Bulletin 2, 4, 160–163.
 Sutton and Barto (1998) Sutton, R. S. and Barto, A. G. 1998. Introduction to reinforcement learning. Vol. 135. MIT press Cambridge.
 Tang et al. (2013) Tang, L., Rosales, R., Singh, A., and Agarwal, D. 2013. Automatic ad format selection via contextual bandits. In CIKM ’13.
 Tang (2017) Tang, P. 2017. Reinforcement mechanism design. In IJCAI ’17.
 Varaiya and Walrand (1983) Varaiya, P. and Walrand, J. C. 1983. Multiarmed bandit problems and resource sharing systems. In Computer Performance and Reliability, Proceedings of the International Workshop, Pisa, Italy, September 2630, 1983. 181–196.
 Vorobev et al. (2015) Vorobev, A., Lefortier, D., Gusev, G., and Serdyukov, P. 2015. Gathering additional feedback on search results by multiarmed bandits with respect to production ranking. In WWW ’15.
 Wang et al. (2018) Wang, W., Jin, J., Hao, J., Chen, C., Yu, C., Zhang, W., Wang, J., Wang, Y., Li, H., Xu, J., and Gai, K. 2018. Learning to advertise with adaptive exposure via constrained twolevel reinforcement learning. CoRR abs/1809.03149.
 Wang et al. (2017) Wang, Y., Ouyang, H., Wang, C., Chen, J., Asamov, T., and Chang, Y. 2017. Efficient ordered combinatorial semibandits for wholepage recommendation. In AAAI ’17.
 Wang et al. (2016) Wang, Y., Yin, D., Jie, L., Wang, P., Yamada, M., Chang, Y., and Mei, Q. 2016. Beyond ranking: Optimizing wholepage presentation. In WSDM ’16.
 Wang et al. (2018) Wang, Y., Yin, D., Jie, L., Wang, P., Yamada, M., Chang, Y., and Mei, Q. 2018. Optimizing wholepage presentation for web search. TWEB 12, 3, 19:1–19:25.
 Watkins (1989) Watkins, C. J. C. H. 1989. Learning from delayed rewards. Ph.D. thesis, King’s College, Cambridge.
 Williams (1992) Williams, R. J. 1992. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning 8, 229–256.
 Wu et al. (2018) Wu, D., Chen, C., Yang, X., Chen, X., Tan, Q., Xu, J., and Gai, K. 2018. A multiagent reinforcement learning method for impression allocation in online display advertising. CoRR abs/1809.03152.
 Wu et al. (2018) Wu, D., Chen, X., Yang, X., Wang, H., Tan, Q., Zhang, X., Xu, J., and Gai, K. 2018. Budget constrained bidding by modelfree reinforcement learning in display advertising. In CIKM ’18.
 Wu et al. (2018) Wu, Q., Iyer, N., and Wang, H. 2018. Learning contextual bandits in a nonstationary environment. In SIGIR ’18.
 Wu et al. (2017) Wu, Q., Wang, H., Hong, L., and Shi, Y. 2017. Returning is believing: Optimizing longterm user engagement in recommender systems. In CIKM ’17.
 Xia et al. (2017) Xia, L., Xu, J., Lan, Y., Guo, J., Zeng, W., and Cheng, X. 2017. Adapting markov decision process for search result diversification. In SIGIR ’17.
 Xu and Li (2007) Xu, J. and Li, H. 2007. Adarank: a boosting algorithm for information retrieval. In SIGIR ’07.
 Xu et al. (2008) Xu, J., Liu, T., Lu, M., Li, H., and Ma, W. 2008. Directly optimizing evaluation measures in learning to rank. In SIGIR ’08.
 Xu et al. (2017) Xu, J., Xia, L., Lan, Y., Guo, J., and Cheng, X. 2017. Directly optimize diversity evaluation measures: A new approach to search result diversification. ACM TIST 8, 3, 41:1–41:26.
 Xu et al. (2013) Xu, M., Qin, T., and Liu, T. 2013. Estimation bias in multiarmed bandit algorithms for search advertising. In NIPS ’13.
 Yang and Lu (2016) Yang, H. and Lu, Q. 2016. Dynamic contextual multi arm bandits in display advertisement. In ICDM ’16.
 Yin, Hu, Tang, Daly, Zhou, Ouyang, Chen, Kang, Deng, Nobata, et al. (Yin et al.) Yin, D., Hu, Y., Tang, J., Daly, T., Zhou, M., Ouyang, H., Chen, J., Kang, C., Deng, H., Nobata, C., et al. Ranking relevance in yahoo search. In SIGKDD’16.
 Yin et al. (2016) Yin, D., Hu, Y., Tang, J., Jr., T. D., Zhou, M., Ouyang, H., Chen, J., Kang, C., Deng, H., Nobata, C., Langlois, J., and Chang, Y. 2016. Ranking relevance in yahoo search. In SIGKDD ’16.
 Yuan et al. (2013) Yuan, S., Wang, J., and van der Meer, M. 2013. Adaptive keywords extraction with contextual bandits for advertising on parked domains. CoRR abs/1307.3573.

Yue
et al. (2007)
Yue, Y., Finley, T., Radlinski, F., and Joachims,
T. 2007.
A support vector method for optimizing average precision.
In SIGIR ’07.  Zeng et al. (2016) Zeng, C., Wang, Q., Mokhtari, S., and Li, T. 2016. Online contextaware recommendation with time varying multiarmed bandit. In SIGKDD ’16.
 Zeng et al. (2017) Zeng, W., Xu, J., Lan, Y., Guo, J., and Cheng, X. 2017. Reinforcement learning to rank with markov decision process. In SIGIR ’17.
 Zeng et al. (2018) Zeng, W., Xu, J., Lan, Y., Guo, J., and Cheng, X. 2018. Multi page search with reinforcement learning to rank. In ICTIR ’18.
 Zhang et al. (2014) Zhang, S., Luo, J., and Yang, H. 2014. A POMDP model for contentfree document reranking. In SIGIR ’14.
 Zhao et al. (2018) Zhao, J., Qiu, G., Guan, Z., Zhao, W., and He, X. 2018. Deep reinforcement learning for sponsored search realtime bidding. In SIGKDD ’18.
 Zhao, Xia, Zhang, Ding, Yin, and Tang (Zhao et al.) Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., and Tang, J. Deep reinforcement learning for pagewise recommendations. In ResSys’18.
 Zhao et al. (2019) Zhao, X., Xia, L., Zhao, Y., Tang, J., and Yin, D. 2019. Modelbased reinforcement learning for wholechain recommendations. arXiv preprint arXiv:1902.03987.
 Zhao, Zhang, Ding, Xia, Tang, and Yin (Zhao et al.) Zhao, X., Zhang, L., Ding, Z., Xia, L., Tang, J., and Yin, D. Recommendations with negative feedback via pairwise deep reinforcement learning. In SIGKDD’18.
 Zhao et al. (2017) Zhao, X., Zhang, L., Ding, Z., Yin, D., Zhao, Y., and Tang, J. 2017. Deep reinforcement learning for listwise recommendations. arXiv preprint arXiv:1801.00209.
 Zheng et al. (2018) Zheng, G., Zhang, F., Zheng, Z., Xiang, Y., Yuan, N. J., Xie, X., and Li, Z. 2018. DRN: A deep reinforcement learning framework for news recommendation. In WWW ’18.
 Zou et al. (2019) Zou, L., Xia, L., Ding, Z., Yin, D., Song, J., and Liu, W. 2019. Reinforcement learning to diversify recommendations. In DASFAA ’19.
Comments
There are no comments yet.