1 Introduction
Over the past decade, contextualbandits based algorithms have been successfully deployed in a number of industrial level applications, such as personalized recommender systems li2010contextual ; tang2014ensemble ; tang2015personalized as well as advertisement personalization bouneffouf2012contextual ; tang2013automatic and learningtorank slivkins2013ranked . The nature of contextualbandits algorithms that minimizes online regrets is desired if one needs to minimize the cumulative cost during onlinelearning or to address the tradeoff between exploitation and exploration.
In the standard contextual bandits, nature and the player langford2008epoch play a repeated game. Nature defines a reward function mapping contexts (a set of features) to realvalued rewards, which is not known a priori to the player. At each step, nature gives a set of arms each with a context. The player observes the contexts, selects one arm and then receives a reward. The payoff of the player is to minimize the cumulative regret or to maximize the cumulative reward.
There are plenty of algorithms for contextualbandits, most of which are valuebased. The simplest heuristic is the
greedy method and its variants, which employs a function approximation to estimate the value (expected reward) of choosing each arm and exploit the one with the largest value estimation with a probability of
and explore randomly with a probability of , whereis a constant or diminishing positive value. It has been widely used in Reinforcement Learning (RL) algorithms such as Deep QNetwork which achieves humanlevel performance at playing Atari 2600 games
mnih2015human . However, greedy is known as a local exploration method which gets a cumulative regret linear to the times of trails, since there is always a positive probability of exploiting a suboptimal choice.As a result, directed exploration methods are desired and a lot of successful methods have been developed, such as Upper Confidence Bounds (UCB), Thompson Sampling (TS), and their variants. When the expected reward is linear in the context,
li2010contextual ; chu2011contextual ; abbasi2011improved proposed LinUCB, and chapelle2011empirical ; agrawal2013thompson used TS with linear priors. For the nonlinear cases, filippi2010parametric proposed GLMUCB using generalized linear models, krause2011contextual ; srinivas2012information used Gaussian Processes, to model the reward functions. UCBs and TSs have been known to achieve sublinear regrets auer2002finite ; abe2003reinforcement ; li2010contextual ; chu2011contextual ; abbasi2011improved ; chapelle2011empirical ; agrawal2013thompson ; filippi2010parametric ; krause2011contextual ; srinivas2012information . The idea of UCB have also applied to a modelbased RL algorithm called UCRL with regret bounds jaksch2010near .However, the applicability of these methods is heavily limited, especially for largescale problems with highdimensional contexts and complicated reward functions, due to the following reasons:

First, these methods tends to oversimplify the form of the reward. Although the (generalized) linear cases and Gaussian Process cases have been studied, the vast majority of general reward functions remain unsolved.

Secondly, it is difficult to obtain an accurate estimation of posterior distributions at each time step. There exists closedform solutions for smallscale linear bandits if the design matrix is not singular or illposed, but not for general cases.

Thirdly, it is often assumed that the reward of an arm is uniquely determined by the context, and the distribution of contexts is independent of the agent’s action. However, it may not be true in realworld applications such as recommender systems where the behaviors of users heavily depend on the history, i.e., the items that he/she viewed in previous rounds.

Last but not least, valuebased methods are meant to find deterministic policies, which loses generality since, in reality, optimal policies are sometimes stochastic (e.g., a deterministic dialogue generation system is never considered to be desirable). Another limitation is that a subtle change in the Qfunction may cause a discontinuous jump in the resulting policy, which makes these iterationbased algorithms hard to converge sutton2000policy .
In light of these observations, we propose Policy Gradients for Contextual Bandits (PGCB), which uses a policy gradient method to solve general contextualbandits without unrealistic assumptions or prior knowledge, as well as to achieve fast converge and low regret by useful heuristics called TimeDependent Greed and ActorDropout.
Different from previous valuebased works, we show that in a policybased perspective, the performance objective of an algorithm is determined only by its marginal expected probability of choosing each arm (in expectation of other arms). Therefore PGCB adopts a class of policies in which the expected probabilities of choosing an action has a simple closed form and can be estimated efficiently so that the search space is dramatically reduced. By optimizing directly on the policy space and predicting stochastic policy, it naturally fits the problems that require randomized actions.
The proposed method uses an actor network to predict the policy, and a critic network to estimate the value of choosing each arm, similar to sutton2000policy ; silver2014deterministic . Any sort of reward functions can be somehow approximated given the expressive power of deep neural networks. We then show that the policy gradient can be estimated by sampling contexts from the history trajectories. PGCB naturally extends the experience replay technique adam2012experience ; heess2015memory
to a finergrained sampling procedure. So the network weights update at each step by stochastic gradient descent with minibatch training, so it is computationally efficient. We present compatible conditions for the Qfunction approximation and prove that there is no bias of the gradient under this condition. We prove in the supplementary material that the variance of gradients is less than the variances of vanilla policy gradient algorithms.
Directed exploration and Greedy in the Limit with Infinite Exploration (GLIE) are guaranteed by TimeDependent Greed and ActorDropout. TimeDependent Greed is to have the level of greed increases over time, since we would like a stochastic policy explores a lot in the early stage and then converges to a greedy policy. ActorDropout is to use dropout when training and predicting on the actornetwork. It is proposed to approximate the Bayesian estimation and MonteCarlo sampling by the dropout method. It provides with directed exploration for PGCB, rather than local exploration like greedy. The two techniques may be of independent interest to similar problems.
Furthermore, PGCB can directly apply to contextual bandits in a Markov Decision Process (MDP) setting, i.e. with states and state transitions. Suppose that, at each step, contexts are drawn i.i.d. from a distribution conditional on the current state. Furthermore, when an arm is chosen, the immediate reward is decided by both the state and the selected context. The state is then transitioned into the next state. Such a model is tailored for a wide range of important realistic applications such as personalized recommender systems where users’ preferences are regarded as states and items are regarded as arms with contexts shani2005mdp ; taghipour2008hybrid
, natural language generation where queries (or previous sentences) are regarded as states and the corresponding candidate replies (or the next sentence) are regarded as arms with contexts
yu2017seqgan ; zhou2018elastic , ecommerce where the private information (e.g., cost, reputation) of sellers can be viewed as states and different commercial strategies are regarded as contexts cai2018reinforcement .We evaluate PGCB on toy datasets and a realistic dataset of music recommendation. By comparing with baseline methods including LinUCB, GLMUCB, Thompson Sampling, greedy, and vanilla policy gradients, we find that PGCB converges fast as well as achieves the lowest cumulative regret and the highest average reward in various standard contextualbandits settings. Moreover, when states and state transitions are included in the realworld recommender environment, GLMUCB and TS fail to incorporate information from states, but PGCB consistently outperforms other baselines.
2 Problem formulation
We first introduce the standard contextualbandits problem. At each step, we have a set of contexts that corresponds to arms, where is the context of the arm. The contexts
are independently and identically distributed random variables with outcome space
. The action is to select an arm inLet denote the context of the selected arm. The immediate reward is denoted by , where is a function that takes the context as input and outputs a random reward, which is not known to the player a priori. For ease of notation, we use to denote the matrix of all contexts, and use to denote the one chosen by action . A policy is a function that maps the contexts to a distribution of actions. We denote the action determined by policy by a random variable regardless of the policy being stochastic or deterministic. The performance of a policy is measured as usual by the expected reward of chosen arm over all possible contexts:
(1) 
When the policy is parameterized as , our learning task is to learn that maximizes .
Next, we further introduce the generalization of contextual bandits in an MDP setting with states and state transitions, which is refered to as MDPCB. At each step , the player observes its state as well as a set of contexts correlated to the state . We assume that the distributions of contexts are independent conditioning on the state: for all , where is the probability density of contexts given state . When an action is selected, a reward is received and the state is transitioned to the next state by a Markovian state transition probability . Note that the setting in this paper is different with generalized bandits with transitions such as Restless bandits whittle1988restless and other works.
The goal is to find a policy that maximizes the expected cumulative discounted reward, so the objective is where () is a discounting factor that balances short and long term rewards.
Same as previous works sutton2000policy ; silver2014deterministic on policy gradients, we denote by the probability density at state after transitioning for time steps from state . We assume that the environments satisfy the property that for any policy , the discounted distribution of states is always stationary. We denote the discounted state density by where is the probability density of initial states. Note that the reward and the state transition are determined by the state and the chosen context, we define the action value function Rewrite the objective as
(2) 
3 Policy Gradients for Contextual Bandits
In this section we investigate several key features of our purposed PGCB method. We will discuss the standard case of contextualbandits, as well as how to extend to solve the generalized MDPCB.
3.1 Marginal probability for choosing an arm and policy gradients
Due to the property of bandits problem that the reward only depends on the selected context, we claim that for any policy , there exists a permutation invariant policy that obtains at least its performance. Please refer to the supplementary material for the proof.
Definition 1 (Permutation invariant policy).
A policy is said to be permutation invariant if for all and any its permutation , it has
Lemma 1.
For any policy , there exists a permutation invariant policy s.t. .
Lemma 1 states that we can WLOG focus on permutation invariant policies. The objective is then
(3) 
where is the marginal probability of choosing an arm with context (in expectation of randomness of the other arms, denoted as ), by a permutation invariant policy:
(4) 
Suppose we have a score function which takes the context as inputs and outputs a score, where are the parameters. We can construct a class of permutation invariant policies with the score function, parameterized by :
(5) 
where is an operator that satisfies permutation invariance.
Note that this class of policies include policies of most wellknown valuebased bandit algorithms. For example, if the score function is the estimation of the reward, and chooses the arm with the maximum estimated reward with probability and chooses randomly with probability , the policy is exactly the wellknown greedy policy sutton1998reinforcement . If the score function is a summation of the reward estimation and an upper confidence bound, and chooses the arm with the maximum score, it results in the wellknown upper confidence bound (UCB) algorithm auer2002finite ; li2010contextual .
The policy gradient for the standard contextual bandits can be directly derived from (3)
(6) 
To extend to MDPCB, We use to denote the augmented context by pairing together a state and a single context , i.e., . Given a policy , the states can be roughly thought of as drawn from the discounted stationary distribution . As we already defined the density of contexts given the state as , we have the discounted density of the augmented context by the axiom of probability Since we assume the state distribution is stationary, it is natural that is also stationary.
Then by applying the same technique as we derive the marginal probability, we derive the performance objective as follows:
(7) 
where is the marginal probability of choosing given .
Now we derive the gradients of , similar to the result in sutton2000policy . The proof is refered to the supplementary material. Surprisingly, it only replaces in (6) by , as
Theorem 2 (Policy gradient bandits theorem).
Assuming the policy leads to stationary distributions for states and contexts, the unbiased policy gradient is
(8) 
where is the stateaction value, and is the discounted density of .
However the marginal probability of choosing an arm is not explicitly known given an arbitrary policy . As a result, we put forward a family of stochastic policies where this marginal probability has a closed form and the gradient of can be estimated efficiently.
3.1.1 Subclass of policies that has closedform marginal probabilities of choosing arms
Now we propose a class of policies for our PGCB algorithm, show how to estimate the marginal probability of choosing an arm for this class, and estimate the policy gradient efficiently. For the standard contextualbandits, following the form of a policy described in (5), we define a class of stochastic policies denoted by as
(9) 
where is a normalization and Multinoulli returns a multinoulli random variable. The form of our policy (9) generalizes several important policies in reinforcement learning. For example, when is an exponential function , it reduces to the wellknown softmax policy which tradeoffs between exploitation and exploration. If approaches to infity, it converges to an argmax policy that chooses the arm with highest score. For any policy ,
(10) 
which is a continuous positive function of parameters . So it is straightforward to estimate by sampling from the contexts that the player have seen before. We denote by the contexts that have appeared up to step . The estimation of is unbiased because as assumed all contexts in are i.i.d from the context space.
For MDPCB, it is straightforward to extend (9) to introduce states by replacing to .
3.1.2 Actorcritic with compatible function approximations
For the standard contextual bandits, the most direct way to estimate the reward function is to directly apply supervised learning methods to find an estimator
with parameter minimizing the mean squared error, i.e.,(11) 
where is the set of chosen contexts and is the received reward for choosing context . This is actually how most valuebased bandit methods do, such as greedy, LinUCB and Thompson Sampling. However we argue that in policybased framework, it brings bias. Since our goal is to maximize the expected reward rather than minimizing the empirical loss as in supervised learning, the marginal probabilities of choosing an arm must be taken into consideration and the form of can not be chosen arbitrary. Similarly, when states and state transitions are involved in MDPCB, we also need to find an appropreate to approximate . Similar to sutton2000policy ; silver2014deterministic , we define the following compatible conditions to assure that the policy gradient is orthogonal to the error in value approximation. The proof is postponed to the supplementary material.
Theorem 3.
The policy gradient using function approximation
(12) 
is unbiased to (8) if the following conditions are satisfied:
(i) the gradients for the value function and the policy are compatible,
(ii) the value function parameters reach a local minimum of the mean squared error over the stationary context distribution such that
3.2 The basic PGCB algorithm
We now formally propose the policy gradients algorithm for general contextual bandits problems, coined by PGCB. Recall that our policy returns a Multinoulli random variable which chooses by
The key feature for updates is to estimate the marginal expected probabilities for each arm. When estimating it for some context, say , in some state , we resample from other contexts to get another contexts from the same state. Similar to previous actorcritic algorithms lillicrap2015continuous , we can use Sarsa updates sutton1998reinforcement to estimate the actionvalue function and then update the policy parameters respectively by the following policy gradients for contextualbandits algorithm,
(13)  
(14)  
(15)  
(16)  
(17) 
In practice, the gradients can be updated on minibatches by modern optimizers from vanilla stochastic gradient descent to Adam optimizer kingma2014adam which we already used for experiments. So PGCB naturally fits to deep Reinforcement Learning and Online Learning, and techniques from these area may also be applied. Note that PGCB can also apply to the standard setting without states.
3.3 Two useful heuristics: TimeDependent Greed and ActorDropout
Greedy in the Limit with Infinite Exploration (GLIE), is the basic criteria desired for bandit algorithms. GLIE is to explore all the actions infinite times and then to converge to a greedy policy that reaches the global optimal reward if it runs for enough time. Valuebased methods can satisfy GLIE if a positive but diminishing exploration value is given to all the actions. But for policybased methods it is not straightforward, because one cannot explicitly show the exploration level of a stochastic policy.
For PGCB, on the contrary, it is easy to have GLIE by TimeDependent Greed, which applies a TimeDependent Greed factor to the scoring function . A straightforward usage is to let where is a predetermined positive constant value and is the current timestep. When , the policy tends to choose only the arm with the largest score. Also the marginal probability remains positive with the assumption that for all context , so any arm gets an infinite chance to be explored if it runs for enough time. This technique can also apply to other policybased RL methods as well.
Directed exploration is also desired by contextual bandits. UCB and TS methods are wellknown to have directed exploration so that they automatically tradeoffs between exploration and exploitation and get sublinear total regrets. The basic insight of UCB and TS is to learn the model uncertainty during the online decisionmaking process and to explore the arms with larger uncertainty (or higher potential to get a large reward). Often a Bayesian framework is used to model the uncertainty by estimating the posterior distributions. However, these methods are limited that prior knowledge to the rewards or parameter distributions is required.
We, therefore, propose ActorDropout for PGCB to achieve directed exploration. The idea is simple: to use dropout when training the actornetwork and predicting the policy. It is theoretically proved that a neural network with dropout applied before weight layers, is mathematically equivalent to an approximation to the probabilistic deep Gaussian process gal2016dropout . So ActorDropout naturally learns the uncertainty and does Monte Carlo sampling when predicting the policy by directly using dropout. Since ActorDropout needs no prior knowledge, it can apply to more general cases than UCB and TS.
To use ActorDropout, in practice it is good enough for exploration to add dropout to just one layer of weights. For example, for an fullyconnected actornetwork, one can use dropout to the weights before the output layer, with a dropout ratio of 0.5 or 0.67. It can be understood as to train several actors and to randomly pick one at each step, so it tradeoffs between exploration and exploitation since each actor learns something different to each other. We also found ActorDropout worth trying for other RL or Online Learning tasks in the exploration phase.
In Figure 1(d), we show that in experiments, ActorDropout significantly helps PGCB to converge. The growth rate of cumulative regret for PGCB without ActorDropout is similar to greedy, indicating that the original algorithm fails to converge and has linear regrets. But when equipped with ActorDropout, the regrets are smaller, especially when the dropout rate is set to , the growth rate of PGCB’s regret is similar to that of GLMUCB which can theoretical achieve sublinear regrets. So empirically we remark that ActorDropout is a strong weapon for PGCB in order to get a convergence guarantee, even with almost no assumptions on the problem.
4 Experiments
4.1 Experiments on toy datasets of standard contextual bandits
We test PGCB on three standard contextualbandits problems. We simulate a contextualbandits environment with arms at each step and each arm is represented by a dimensional context uniformly i.i.d. sampled from a unit cube , . Once an arm with context is chosen by the player, the environment returns a reward . We test three reward functions: (a) the linear reward with Gaussian noise, as ; (b) the Bernoulli reward, as where is the probability to return reward for choosing ; (c) the mixed reward, which returns a random linear reward with probability and returns a zero reward with probability , as a mixture of linear and Bernoulli rewards. and are coefficients unknown to the player a priori. and
are white noises to introduce some randomness. We use the
cumulative regretas the evaluation metric, which is defined as the cumulative difference between the reward received and the reward of the optimal arm.
We compare PGCB with the following algorithms: greedy: It estimates the reward by a network. It chooses the arm with largest estimated value with a probability of and chooses randomly otherwise. LinUCB: The widely studied version of UCB for linear rewards by li2010contextual ; chu2011contextual ; abbasi2011improved , which uses a linear function to approximate the reward, and chooses the arm with the maximum sum of the estimated reward and the estimated confidence bound. GLMUCB: The UCB method for generalized linear rewards proposed by filippi2010parametric , which can solve nonlinear rewards if the reward function can be fitted by a generalized linear model to contexts, such as Bernoulli rewards and Poisson rewards. Thompson Sampling: It uses the same function approximation as LinUCB or GLMUCB for linear and nonlinear rewards. It samples from the posterior distribution of parameters, estimates each arm’s value, and chooses the arm with the maximum estimation chapelle2011empirical ; agrawal2013thompson .
The experimental setup is as follows: For PGCB, fully connected networks with a hidden layer of nodes are used. At each step we sample batches of size and optimize the loss by a gradient descent algorithm Adam kingma2014adam . For LinUCB, GLMUCB and Thompson Sampling, we use the same training procedures as suggested in li2010contextual . For greedy, we uses the same value function approximation as PGCB and is set to . We run 20 times and average their cumulative regrets for all the algorithms. Results are shown in Figure 1. It shows that PGCB converges faster and has lower regrets in these cases, while UCBs and TSs sometimes converges slower and greedy fails to converge.
4.2 Experiments on a music recommendation dataset
We test PGCB on a realworld dataset of music recommendation provided by KKBox and opensourced on Kaggle.com^{1}^{1}1https://www.kaggle.com/c/kkboxmusicrecommendationchallenge/. The challenge is to predict the chances of a user listening to a song repeatedly. We construct two simulators based on the distributions of the dataset with different settings: one without explicit states, the other with states and state transitions. At each time step, a user comes to the system. We set last 3 songs the system recommended previously to the user and the corresponding feedbacks (listened or not) as the current state. the recommender system selects one song from songs randomly sampled from the user’s listening history and recommends one to the user. If the user listens to it again (this is the original target for the supervised learning dataset), the system gets a reward otherwise it gets a reward
. Each song has a context vector with size
, including information about the song’s genre, artists, composers, and language. Each simulation consists of 5 million time steps and each simulation is repeated for 5 times. Since the optimal choice is unknown, we use the average reward as the performance metric.The experimental setup in the setting without states is as follows: PGCB uses networks with two hidden layers of sizes 60 and 20. greedy has exactly the same network structure with PGCB. Both PGCB and greedy are trained with Adam algorithm with the same learning rate on random batches with size . GLMUCB is tested here because of the Bernoulli rewards. As is shown in Figure 2(a), PGCB outperforms other algorithms. Traditional contextualbandits methods learn well from the beginning, which indicates that they are good at exploration, but their average rewards stop increasing rapidly due to the limitation of the fitting power of general linear models. greedy outperforms GLMUCB and TS after a long run, but it learns badly at the beginning. Comparing with these algorithms, PGCB has the best performance from the beginning to the end of the learning process.
Next, we explain the experimental setup with states. We enlarge the size of the first hidden layer from 60 to 90 in PGCB because the network now inputs the contexts and the states. UCB and TS here take the augmented contexts as inputs, with general linear modeled priors. We also test vanilla PG sutton2000policy as a baseline which has the same network structure and training details as PGCB. The result of the experiment is shown in figure 2(b). PGCB outperforms other algorithms with larger map comparing with the previous experiment. An interesting fact is that both UCB and TS get almost the same average rewards as in the previous experiment, which indicates that they can hardly make any use of the information from states. PGCB learns faster and gets stateoftheart performance in this task.
5 Conclusion
This paper has studied how to use the actorcritic algorithm with neural networks to solve general contextual bandits, including the standard case and MDPCB. We first show that the class of permutation invariant policies is sufficient for our problem, and then derive that the performance of policy depends on its marginal expected probability of choosing each arm. We next propose a subclass of policies in which the objective has a simple closed form and is differentiable to parameters. We prove that policies in this class have a succinct form of gradient if the actor and the critic satisfy a compatible condition, resulting in the proposed PGCB algorithm. Furthermore, additional techniques are proposed to significantly improve the performance and to guarantee the convergence property. TimeDependent Greed ensures the algorithm to be GLIE. ActorDropout, which using dropout on the actornetwork as a Bayesian approximation, empirically improves PGCB to sublinear regret. By testing on a toy dataset and a recommendation dataset, we showed that PGCB indeed achieves stateoftheart performance for both classic contextualbandits and MDPCB with state transitions in a realworld scenario. Future work could study ActorDropout in more general RL environments like robotics or game playing. It is also a promising direction to extend our results to a variant of bandits with states, i.e, choosing multiple arms at each step, or having more general conditions.
References
 [1] Yasin AbbasiYadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
 [2] Naoki Abe, Alan W Biermann, and Philip M Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.
 [3] Sander Adam, Lucian Busoniu, and Robert Babuska. Experience replay for realtime reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2):201–212, 2012.

[4]
Shipra Agrawal and Navin Goyal.
Thompson sampling for contextual bandits with linear payoffs.
In
International Conference on Machine Learning
, pages 127–135, 2013.  [5] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 [6] Djallel Bouneffouf, Amel Bouzeghoub, and Alda Lopes Gançarski. A contextualbandit algorithm for mobile contextaware recommender system. In International Conference on Neural Information Processing, pages 324–331. Springer, 2012.
 [7] Qingpeng Cai, Aris FilosRatsikas, Pingzhong Tang, and Yiwei Zhang. Reinforcement mechanism design for ecommerce. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 1339–1348. International World Wide Web Conferences Steering Committee, 2018.
 [8] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.

[9]
Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire.
Contextual bandits with linear payoff functions.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pages 208–214, 2011.  [10] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586–594, 2010.

[11]
Yarin Gal and Zoubin Ghahramani.
Dropout as a bayesian approximation: Representing model uncertainty in deep learning.
In international conference on machine learning, pages 1050–1059, 2016.  [12] Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, and David Silver. Memorybased control with recurrent neural networks. arXiv preprint arXiv:1512.04455, 2015.
 [13] Thomas Jaksch, Ronald Ortner, and Peter Auer. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
 [14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [15] Andreas Krause and Cheng S Ong. Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447–2455, 2011.

[16]
John Langford and Tong Zhang.
The epochgreedy algorithm for multiarmed bandits with side information.
In Advances in neural information processing systems, pages 817–824, 2008.  [17] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
 [18] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529, 2015.
 [20] Guy Shani, David Heckerman, and Ronen I Brafman. An mdpbased recommender system. Journal of Machine Learning Research, 6(Sep):1265–1295, 2005.
 [21] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
 [22] Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. Ranked bandits in metric spaces: learning diverse rankings over large document collections. Journal of Machine Learning Research, 14(Feb):399–436, 2013.
 [23] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias W Seeger. Informationtheoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
 [24] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 [25] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
 [26] Nima Taghipour and Ahmad Kardan. A hybrid web recommender system based on qlearning. In Proceedings of the 2008 ACM symposium on Applied computing, pages 1164–1168. ACM, 2008.
 [27] Liang Tang, Yexi Jiang, Lei Li, and Tao Li. Ensemble contextual bandits for personalized recommendation. In Proceedings of the 8th ACM Conference on Recommender Systems, pages 73–80. ACM, 2014.
 [28] Liang Tang, Yexi Jiang, Lei Li, Chunqiu Zeng, and Tao Li. Personalized recommendation via parameterfree contextual bandits. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 323–332. ACM, 2015.
 [29] Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. Automatic ad format selection via contextual bandits. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1587–1594. ACM, 2013.
 [30] Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988.
 [31] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.
 [32] Ganbin Zhou, Ping Luo, Yijun Xiao, Fen Lin, Bo Chen, and Qing He. Elastic responding machine for dialog generation with dynamically mechanism selecting. In AAAI, 2018.
Proof of Lemma 1
Proof.
Suppose there exists a policy such that
(i) it is not permutation invariant, i.e. there exists and some permutation operator that ;
(ii) The expected reward following is larger than all permutation invariant policies that .
Then it follows that
(18) 
where the expectation is over all sets of contexts. Recall that the contexts are drawn i.i.d. from the same distribution, so we have
(19) 
so there exists at least one that
(20) 
But because is not permutation invariant, we find a policy that is permutation invariant, where , then
(21) 
which leads to a confliction to (20) and (19). So Lemma 1 holds. ∎
Proof of Theorem 2
Proof.
We denote the statevalue for a given state under policy as
(22) 
it follows that
(23) 
By repeatedly unrolling the equation, we have
(24) 
Integrating both side over the startstate and recalling the discounted state density and discounted augmented context density , we get the policy gradient as
(25) 
∎
Proof of Theorem 3
Lower variance of the gradients of PGCB than vanilla PG
We prove that the variance of updating the actor and the critic of PGCb is less than that of vanilla PG.
Since context does not exist in the classic formulation of reinforcement learning, it is often regarded as part of information of the state. Given a stochastic policy , PG has policy gradients
(28) 
where denotes a unit vector and is the probability for choosing the arm. For simplicity, we write . Since we focus on policy gradients, we assume that PG has a critic function with the same form as PGCB. The corresponding update steps for PG is
(29)  
(30) 
Since contextualbandits involves discrete actions with high dimensional random contexts, we claim that our PGCB achieves lower estimation variance comparing to classic stochastic policy gradient methods such as [25]. The reasons are twofold. Firstly, by the Lemma of permutation invariant policies we know permutation invariant policies are sufficient for contextualbandits problems. PGCB adopts class of stochastic policies where the only input of the policy is for . On the contrary, in other policy gradient methods, one should treat a state and the whole contexts altogether as inputs of the policy function, so usually a larger number sample space is necessary, which results in lower sample efficiency. Secondly, even if with the same form of policy, normal actorcritic methods tend to converge slower than PGCB because the expected probabilities of choosing arms in PGCB is estimated more efficiently.
In this section we make a fair comparison for variances between PG and PGCB by assuming that they share the same policy and actionvalue functions.
Lemma 4.
Given a policy and a value approximation , both and
are unbiased estimators for the true gradients of actionvalue approximation
(31) 
And . Additionally if PGCB uses a fixed , as , with probability we have
(32) 
Proof.
It is obvious that both in and in are unbiased to . So both and are unbiased to .
To analyze the variance, we focus on the estimations of the probability of choosing an arm: and . Let Then for PGCB,
(33) 
where denotes the probability of choosing at the time of sampling. In the worst case, it samples exactly the same set of arms every time, then . Otherwise if there exists and that the samples are different such that , then the correlation is strictly less than and we have in this case. Finally when enough time steps passed, for is a fixed positive integer, the probability of each arm being sampled at most once is
So with probability the sampled contexts are all different to each other so the estimated probabilities of choosing an arm are i.i.d., then . ∎
We get the following theorem applying the similar technique to Lemma 4. We claim that Policy gradients (12) has no higher variance than gradients in PG.
Theorem 5.
.
Proof.
Note that, in practice PGCB does not necessarily set to a large integer since it is naturally a finergrained experience replay [3]. Surprisingly, when , PGCB can have a better performance than even in a simplest setting. In the next section, we will demonstrate experimental results that show that PGCB with achieves better performance in various settings comparing to other baseline methods including PG.
The results can be interpreted as follows. From a statistical point of view, PGCB takes advantage from a resampling technique so the estimations have lower variances. From an optimization perspective, PGCB reduces the correlation of estimating probabilities of choosing the arms within the same time step, so it has less chance to suffer from exploiting and overfitting, while PG cannot. For example, when the estimated values of contexts are given, an optimizer for PG would simultaneously increase one arm’s chosen probability and reduce other ones’, which results in training the policy into a deterministic one: the arm with the largest estimated value will get a chosen probability close to , and others get arbitrary small probabilities close to . Afterwards, the arms with chosen probabilities will hardly have any influence to further updates. So eventually, PG is likely to overfits the existing data. On the contrary, when PGCB estimates the gradients, even if an arm is not better than other competitors at its own time step, it may still get upgraded because it outranked some arms from other time steps. Therefore, PGCB tends to be more robust and explores better than PG.