Policy Gradients for Contextual Bandits

We study a generalized contextual-bandits problem, where there is a state that decides the distribution of contexts of arms and affects the immediate reward when choosing an arm. The problem applies to a wide range of realistic settings such as personalized recommender systems and natural language generations. We put forward a class of policies in which the marginal probability of choosing an arm (in expectation of other arms) in each state has a simple closed form and is differentiable. In particular, the gradient of this class of policies is in a succinct form, which is an expectation of the action-value multiplied by the gradient of the marginal probability over pairs of states and single contexts. These findings naturally lead to an algorithm, coined policy gradient for contextual bandits (PGCB). As a further theoretical guarantee, we show that the variance of PGCB is less than the standard policy gradients algorithm. We also derive the off-policy gradients, and evaluate PGCB on a toy dataset as well as a music recommender dataset. Experiments show that PGCB outperforms both classic contextual-bandits methods and policy gradient methods.


page 1

page 2

page 3

page 4


Augment-Reinforce-Merge Policy Gradient for Binary Stochastic Policy

Due to the high variance of policy gradients, on-policy optimization alg...

Nonparametric Contextual Bandits in an Unknown Metric Space

Consider a nonparametric contextual multi-arm bandit problem where each ...

Differentiable Meta-Learning in Contextual Bandits

We study a contextual bandit setting where the learning agent has access...

Worst-case Performance of Greedy Policies in Bandits with Imperfect Context Observations

Contextual bandits are canonical models for sequential decision-making u...

Imitation-Regularized Offline Learning

We study the problem of offline learning in automated decision systems u...

Stateful Offline Contextual Policy Evaluation and Learning

We study off-policy evaluation and learning from sequential data in a st...

Bandits for Learning to Explain from Explanations

We introduce Explearn, an online algorithm that learns to jointly output...

1 Introduction

Over the past decade, contextual-bandits based algorithms have been successfully deployed in a number of industrial level applications, such as personalized recommender systems li2010contextual ; tang2014ensemble ; tang2015personalized as well as advertisement personalization bouneffouf2012contextual ; tang2013automatic and learning-to-rank slivkins2013ranked . The nature of contextual-bandits algorithms that minimizes online regrets is desired if one needs to minimize the cumulative cost during online-learning or to address the trade-off between exploitation and exploration.

In the standard contextual bandits, nature and the player langford2008epoch play a repeated game. Nature defines a reward function mapping contexts (a set of features) to real-valued rewards, which is not known a priori to the player. At each step, nature gives a set of arms each with a context. The player observes the contexts, selects one arm and then receives a reward. The payoff of the player is to minimize the cumulative regret or to maximize the cumulative reward.

There are plenty of algorithms for contextual-bandits, most of which are value-based. The simplest heuristic is the

-greedy method and its variants, which employs a function approximation to estimate the value (expected reward) of choosing each arm and exploit the one with the largest value estimation with a probability of

and explore randomly with a probability of , where

is a constant or diminishing positive value. It has been widely used in Reinforcement Learning (RL) algorithms such as Deep Q-Network which achieves human-level performance at playing Atari 2600 games 

mnih2015human . However, -greedy is known as a local exploration method which gets a cumulative regret linear to the times of trails, since there is always a positive probability of exploiting a sub-optimal choice.

As a result, directed exploration methods are desired and a lot of successful methods have been developed, such as Upper Confidence Bounds (UCB), Thompson Sampling (TS), and their variants. When the expected reward is linear in the context,

li2010contextual ; chu2011contextual ; abbasi2011improved proposed Lin-UCB, and chapelle2011empirical ; agrawal2013thompson used TS with linear priors. For the non-linear cases, filippi2010parametric proposed GLM-UCB using generalized linear models, krause2011contextual ; srinivas2012information used Gaussian Processes, to model the reward functions. UCBs and TSs have been known to achieve sub-linear regrets auer2002finite ; abe2003reinforcement ; li2010contextual ; chu2011contextual ; abbasi2011improved ; chapelle2011empirical ; agrawal2013thompson ; filippi2010parametric ; krause2011contextual ; srinivas2012information . The idea of UCB have also applied to a model-based RL algorithm called UCRL with regret bounds jaksch2010near .

However, the applicability of these methods is heavily limited, especially for large-scale problems with high-dimensional contexts and complicated reward functions, due to the following reasons:

  • First, these methods tends to over-simplify the form of the reward. Although the (generalized) linear cases and Gaussian Process cases have been studied, the vast majority of general reward functions remain unsolved.

  • Secondly, it is difficult to obtain an accurate estimation of posterior distributions at each time step. There exists closed-form solutions for small-scale linear bandits if the design matrix is not singular or ill-posed, but not for general cases.

  • Thirdly, it is often assumed that the reward of an arm is uniquely determined by the context, and the distribution of contexts is independent of the agent’s action. However, it may not be true in real-world applications such as recommender systems where the behaviors of users heavily depend on the history, i.e., the items that he/she viewed in previous rounds.

  • Last but not least, value-based methods are meant to find deterministic policies, which loses generality since, in reality, optimal policies are sometimes stochastic (e.g., a deterministic dialogue generation system is never considered to be desirable). Another limitation is that a subtle change in the Q-function may cause a discontinuous jump in the resulting policy, which makes these iteration-based algorithms hard to converge sutton2000policy .

In light of these observations, we propose Policy Gradients for Contextual Bandits (PGCB), which uses a policy gradient method to solve general contextual-bandits without unrealistic assumptions or prior knowledge, as well as to achieve fast converge and low regret by useful heuristics called Time-Dependent Greed and Actor-Dropout.

Different from previous value-based works, we show that in a policy-based perspective, the performance objective of an algorithm is determined only by its marginal expected probability of choosing each arm (in expectation of other arms). Therefore PGCB adopts a class of policies in which the expected probabilities of choosing an action has a simple closed form and can be estimated efficiently so that the search space is dramatically reduced. By optimizing directly on the policy space and predicting stochastic policy, it naturally fits the problems that require randomized actions.

The proposed method uses an actor network to predict the policy, and a critic network to estimate the value of choosing each arm, similar to sutton2000policy ; silver2014deterministic . Any sort of reward functions can be somehow approximated given the expressive power of deep neural networks. We then show that the policy gradient can be estimated by sampling contexts from the history trajectories. PGCB naturally extends the experience replay technique adam2012experience ; heess2015memory

to a finer-grained sampling procedure. So the network weights update at each step by stochastic gradient descent with mini-batch training, so it is computationally efficient. We present compatible conditions for the Q-function approximation and prove that there is no bias of the gradient under this condition. We prove in the supplementary material that the variance of gradients is less than the variances of vanilla policy gradient algorithms.

Directed exploration and Greedy in the Limit with Infinite Exploration (GLIE) are guaranteed by Time-Dependent Greed and Actor-Dropout. Time-Dependent Greed is to have the level of greed increases over time, since we would like a stochastic policy explores a lot in the early stage and then converges to a greedy policy. Actor-Dropout is to use dropout when training and predicting on the actor-network. It is proposed to approximate the Bayesian estimation and Monte-Carlo sampling by the dropout method. It provides with directed exploration for PGCB, rather than local exploration like -greedy. The two techniques may be of independent interest to similar problems.

Furthermore, PGCB can directly apply to contextual bandits in a Markov Decision Process (MDP) setting, i.e. with states and state transitions. Suppose that, at each step, contexts are drawn i.i.d. from a distribution conditional on the current state. Furthermore, when an arm is chosen, the immediate reward is decided by both the state and the selected context. The state is then transitioned into the next state. Such a model is tailored for a wide range of important realistic applications such as personalized recommender systems where users’ preferences are regarded as states and items are regarded as arms with contexts shani2005mdp ; taghipour2008hybrid

, natural language generation where queries (or previous sentences) are regarded as states and the corresponding candidate replies (or the next sentence) are regarded as arms with contexts

yu2017seqgan ; zhou2018elastic , e-commerce where the private information (e.g., cost, reputation) of sellers can be viewed as states and different commercial strategies are regarded as contexts cai2018reinforcement .

We evaluate PGCB on toy datasets and a realistic dataset of music recommendation. By comparing with baseline methods including Lin-UCB, GLM-UCB, Thompson Sampling, -greedy, and vanilla policy gradients, we find that PGCB converges fast as well as achieves the lowest cumulative regret and the highest average reward in various standard contextual-bandits settings. Moreover, when states and state transitions are included in the real-world recommender environment, GLM-UCB and TS fail to incorporate information from states, but PGCB consistently outperforms other baselines.

2 Problem formulation

We first introduce the standard contextual-bandits problem. At each step, we have a set of contexts that corresponds to arms, where is the context of the arm. The contexts

are independently and identically distributed random variables with outcome space

. The action is to select an arm in

Let denote the context of the selected arm. The immediate reward is denoted by , where is a function that takes the context as input and outputs a random reward, which is not known to the player a priori. For ease of notation, we use to denote the matrix of all contexts, and use to denote the one chosen by action . A policy is a function that maps the contexts to a distribution of actions. We denote the action determined by policy by a random variable regardless of the policy being stochastic or deterministic. The performance of a policy is measured as usual by the expected reward of chosen arm over all possible contexts:


When the policy is parameterized as , our learning task is to learn that maximizes .

Next, we further introduce the generalization of contextual bandits in an MDP setting with states and state transitions, which is refered to as MDP-CB. At each step , the player observes its state as well as a set of contexts correlated to the state . We assume that the distributions of contexts are independent conditioning on the state: for all , where is the probability density of contexts given state . When an action is selected, a reward is received and the state is transitioned to the next state by a Markovian state transition probability . Note that the setting in this paper is different with generalized bandits with transitions such as Restless bandits whittle1988restless and other works.

The goal is to find a policy that maximizes the expected cumulative discounted reward, so the objective is where () is a discounting factor that balances short and long term rewards.

Same as previous works sutton2000policy ; silver2014deterministic on policy gradients, we denote by the probability density at state after transitioning for time steps from state . We assume that the environments satisfy the property that for any policy , the discounted distribution of states is always stationary. We denote the discounted state density by where is the probability density of initial states. Note that the reward and the state transition are determined by the state and the chosen context, we define the action value function Rewrite the objective as


3 Policy Gradients for Contextual Bandits

In this section we investigate several key features of our purposed PGCB method. We will discuss the standard case of contextual-bandits, as well as how to extend to solve the generalized MDP-CB.

3.1 Marginal probability for choosing an arm and policy gradients

Due to the property of bandits problem that the reward only depends on the selected context, we claim that for any policy , there exists a permutation invariant policy that obtains at least its performance. Please refer to the supplementary material for the proof.

Definition 1 (Permutation invariant policy).

A policy is said to be permutation invariant if for all and any its permutation , it has

Lemma 1.

For any policy , there exists a permutation invariant policy s.t. .

Lemma 1 states that we can WLOG focus on permutation invariant policies. The objective is then


where is the marginal probability of choosing an arm with context (in expectation of randomness of the other arms, denoted as ), by a permutation invariant policy:


Suppose we have a score function which takes the context as inputs and outputs a score, where are the parameters. We can construct a class of permutation invariant policies with the score function, parameterized by :


where is an operator that satisfies permutation invariance.

Note that this class of policies include policies of most well-known value-based bandit algorithms. For example, if the score function is the estimation of the reward, and chooses the arm with the maximum estimated reward with probability and chooses randomly with probability , the policy is exactly the well-known -greedy policy sutton1998reinforcement . If the score function is a summation of the reward estimation and an upper confidence bound, and chooses the arm with the maximum score, it results in the well-known upper confidence bound (UCB) algorithm auer2002finite ; li2010contextual .

The policy gradient for the standard contextual bandits can be directly derived from (3)


To extend to MDP-CB, We use to denote the augmented context by pairing together a state and a single context , i.e., . Given a policy , the states can be roughly thought of as drawn from the discounted stationary distribution . As we already defined the density of contexts given the state as , we have the discounted density of the augmented context by the axiom of probability Since we assume the state distribution is stationary, it is natural that is also stationary.

Then by applying the same technique as we derive the marginal probability, we derive the performance objective as follows:


where is the marginal probability of choosing given .

Now we derive the gradients of , similar to the result in sutton2000policy . The proof is refered to the supplementary material. Surprisingly, it only replaces in (6) by , as

Theorem 2 (Policy gradient bandits theorem).

Assuming the policy leads to stationary distributions for states and contexts, the unbiased policy gradient is


where is the state-action value, and is the discounted density of .

However the marginal probability of choosing an arm is not explicitly known given an arbitrary policy . As a result, we put forward a family of stochastic policies where this marginal probability has a closed form and the gradient of can be estimated efficiently.

3.1.1 Sub-class of policies that has closed-form marginal probabilities of choosing arms

Now we propose a class of policies for our PGCB algorithm, show how to estimate the marginal probability of choosing an arm for this class, and estimate the policy gradient efficiently. For the standard contextual-bandits, following the form of a policy described in (5), we define a class of stochastic policies denoted by as


where is a normalization and Multinoulli returns a multinoulli random variable. The form of our policy (9) generalizes several important policies in reinforcement learning. For example, when is an exponential function , it reduces to the well-known softmax policy which trade-offs between exploitation and exploration. If approaches to infity, it converges to an argmax policy that chooses the arm with highest score. For any policy ,


which is a continuous positive function of parameters . So it is straightforward to estimate by sampling from the contexts that the player have seen before. We denote by the contexts that have appeared up to step . The estimation of is unbiased because as assumed all contexts in are i.i.d from the context space.

For MDP-CB, it is straightforward to extend (9) to introduce states by replacing to .

3.1.2 Actor-critic with compatible function approximations

For the standard contextual bandits, the most direct way to estimate the reward function is to directly apply supervised learning methods to find an estimator

with parameter minimizing the mean squared error, i.e.,


where is the set of chosen contexts and is the received reward for choosing context . This is actually how most value-based bandit methods do, such as -greedy, Lin-UCB and Thompson Sampling. However we argue that in policy-based framework, it brings bias. Since our goal is to maximize the expected reward rather than minimizing the empirical loss as in supervised learning, the marginal probabilities of choosing an arm must be taken into consideration and the form of can not be chosen arbitrary. Similarly, when states and state transitions are involved in MDP-CB, we also need to find an appropreate to approximate . Similar to sutton2000policy ; silver2014deterministic , we define the following compatible conditions to assure that the policy gradient is orthogonal to the error in value approximation. The proof is postponed to the supplementary material.

Theorem 3.

The policy gradient using function approximation


is unbiased to (8) if the following conditions are satisfied:

(i) the gradients for the value function and the policy are compatible,

(ii) the value function parameters reach a local minimum of the mean squared error over the stationary context distribution such that

3.2 The basic PGCB algorithm

We now formally propose the policy gradients algorithm for general contextual bandits problems, coined by PGCB. Recall that our policy returns a Multinoulli random variable which chooses by

The key feature for updates is to estimate the marginal expected probabilities for each arm. When estimating it for some context, say , in some state , we resample from other contexts to get another contexts from the same state. Similar to previous actor-critic algorithms lillicrap2015continuous , we can use Sarsa updates sutton1998reinforcement to estimate the action-value function and then update the policy parameters respectively by the following policy gradients for contextual-bandits algorithm,


In practice, the gradients can be updated on mini-batches by modern optimizers from vanilla stochastic gradient descent to Adam optimizer kingma2014adam which we already used for experiments. So PGCB naturally fits to deep Reinforcement Learning and Online Learning, and techniques from these area may also be applied. Note that PGCB can also apply to the standard setting without states.

3.3 Two useful heuristics: Time-Dependent Greed and Actor-Dropout

Greedy in the Limit with Infinite Exploration (GLIE), is the basic criteria desired for bandit algorithms. GLIE is to explore all the actions infinite times and then to converge to a greedy policy that reaches the global optimal reward if it runs for enough time. Value-based methods can satisfy GLIE if a positive but diminishing exploration value is given to all the actions. But for policy-based methods it is not straightforward, because one cannot explicitly show the exploration level of a stochastic policy.

For PGCB, on the contrary, it is easy to have GLIE by Time-Dependent Greed, which applies a Time-Dependent Greed factor to the scoring function . A straightforward usage is to let where is a pre-determined positive constant value and is the current time-step. When , the policy tends to choose only the arm with the largest score. Also the marginal probability remains positive with the assumption that for all context , so any arm gets an infinite chance to be explored if it runs for enough time. This technique can also apply to other policy-based RL methods as well.

Directed exploration is also desired by contextual bandits. UCB and TS methods are well-known to have directed exploration so that they automatically trade-offs between exploration and exploitation and get sub-linear total regrets. The basic insight of UCB and TS is to learn the model uncertainty during the online decision-making process and to explore the arms with larger uncertainty (or higher potential to get a large reward). Often a Bayesian framework is used to model the uncertainty by estimating the posterior distributions. However, these methods are limited that prior knowledge to the rewards or parameter distributions is required.

We, therefore, propose Actor-Dropout for PGCB to achieve directed exploration. The idea is simple: to use dropout when training the actor-network and predicting the policy. It is theoretically proved that a neural network with dropout applied before weight layers, is mathematically equivalent to an approximation to the probabilistic deep Gaussian process gal2016dropout . So Actor-Dropout naturally learns the uncertainty and does Monte Carlo sampling when predicting the policy by directly using dropout. Since Actor-Dropout needs no prior knowledge, it can apply to more general cases than UCB and TS.

To use Actor-Dropout, in practice it is good enough for exploration to add dropout to just one layer of weights. For example, for an fully-connected actor-network, one can use dropout to the weights before the output layer, with a dropout ratio of 0.5 or 0.67. It can be understood as to train several actors and to randomly pick one at each step, so it trade-offs between exploration and exploitation since each actor learns something different to each other. We also found Actor-Dropout worth trying for other RL or Online Learning tasks in the exploration phase.

In Figure 1(d), we show that in experiments, Actor-Dropout significantly helps PGCB to converge. The growth rate of cumulative regret for PGCB without Actor-Dropout is similar to -greedy, indicating that the original algorithm fails to converge and has linear regrets. But when equipped with Actor-Dropout, the regrets are smaller, especially when the dropout rate is set to , the growth rate of PGCB’s regret is similar to that of GLM-UCB which can theoretical achieve sub-linear regrets. So empirically we remark that Actor-Dropout is a strong weapon for PGCB in order to get a convergence guarantee, even with almost no assumptions on the problem.

4 Experiments

4.1 Experiments on toy datasets of standard contextual bandits

We test PGCB on three standard contextual-bandits problems. We simulate a contextual-bandits environment with arms at each step and each arm is represented by a -dimensional context uniformly i.i.d. sampled from a unit cube , . Once an arm with context is chosen by the player, the environment returns a reward . We test three reward functions: (a) the linear reward with Gaussian noise, as ; (b) the Bernoulli reward, as where is the probability to return reward for choosing ; (c) the mixed reward, which returns a random linear reward with probability and returns a zero reward with probability , as a mixture of linear and Bernoulli rewards. and are coefficients unknown to the player a priori. and

are white noises to introduce some randomness. We use the

cumulative regret

as the evaluation metric, which is defined as the cumulative difference between the reward received and the reward of the optimal arm.

We compare PGCB with the following algorithms: -greedy: It estimates the reward by a network. It chooses the arm with largest estimated value with a probability of and chooses randomly otherwise. Lin-UCB: The widely studied version of UCB for linear rewards by li2010contextual ; chu2011contextual ; abbasi2011improved , which uses a linear function to approximate the reward, and chooses the arm with the maximum sum of the estimated reward and the estimated confidence bound. GLM-UCB: The UCB method for generalized linear rewards proposed by filippi2010parametric , which can solve non-linear rewards if the reward function can be fitted by a generalized linear model to contexts, such as Bernoulli rewards and Poisson rewards. Thompson Sampling: It uses the same function approximation as Lin-UCB or GLM-UCB for linear and non-linear rewards. It samples from the posterior distribution of parameters, estimates each arm’s value, and chooses the arm with the maximum estimation chapelle2011empirical ; agrawal2013thompson .

(a) (b) (c) (d)
Figure 1: Experiments on standard contextual bandits. (a) Linear rewards: PGCB perform comparably to Lin-UCB and is much better than the other two. (b) Bernoulli rewards: PGCB outperform the others with large margins. (c) Mixed-rewards: PGCB outperforms the others. TS is not tested for there is few tools for it to deal with this setting. (d) Testing PGCB with different level of Actor-Dropout. It shows PGCB without Actor-Dropout would have linear regret like -greedy and fail to converge, but with Actor-Dropout, PGCB empirically converges and gets low regret similar to UCB.

The experimental setup is as follows: For PGCB, fully connected networks with a hidden layer of nodes are used. At each step we sample batches of size and optimize the loss by a gradient descent algorithm Adam kingma2014adam . For Lin-UCB, GLM-UCB and Thompson Sampling, we use the same training procedures as suggested in li2010contextual . For -greedy, we uses the same value function approximation as PGCB and is set to . We run 20 times and average their cumulative regrets for all the algorithms. Results are shown in Figure 1. It shows that PGCB converges faster and has lower regrets in these cases, while UCBs and TSs sometimes converges slower and -greedy fails to converge.

4.2 Experiments on a music recommendation dataset

We test PGCB on a real-world dataset of music recommendation provided by KKBox and open-sourced on Kaggle.com111https://www.kaggle.com/c/kkbox-music-recommendation-challenge/. The challenge is to predict the chances of a user listening to a song repeatedly. We construct two simulators based on the distributions of the dataset with different settings: one without explicit states, the other with states and state transitions. At each time step, a user comes to the system. We set last 3 songs the system recommended previously to the user and the corresponding feedbacks (listened or not) as the current state. the recommender system selects one song from songs randomly sampled from the user’s listening history and recommends one to the user. If the user listens to it again (this is the original target for the supervised learning dataset), the system gets a reward otherwise it gets a reward

. Each song has a context vector with size

, including information about the song’s genre, artists, composers, and language. Each simulation consists of 5 million time steps and each simulation is repeated for 5 times. Since the optimal choice is unknown, we use the average reward as the performance metric.

The experimental setup in the setting without states is as follows: PGCB uses networks with two hidden layers of sizes 60 and 20. -greedy has exactly the same network structure with PGCB. Both PGCB and -greedy are trained with Adam algorithm with the same learning rate on random batches with size . GLM-UCB is tested here because of the Bernoulli rewards. As is shown in Figure 2(a), PGCB outperforms other algorithms. Traditional contextual-bandits methods learn well from the beginning, which indicates that they are good at exploration, but their average rewards stop increasing rapidly due to the limitation of the fitting power of general linear models. -greedy outperforms GLM-UCB and TS after a long run, but it learns badly at the beginning. Comparing with these algorithms, PGCB has the best performance from the beginning to the end of the learning process.

(a) (b)
Figure 2: Average rewards of episodes for (a) music recommender without emplicit states; (b) music recommender with states and state transitions. The solid lines are averaged from 5 repeated runs.

Next, we explain the experimental setup with states. We enlarge the size of the first hidden layer from 60 to 90 in PGCB because the network now inputs the contexts and the states. UCB and TS here take the augmented contexts as inputs, with general linear modeled priors. We also test vanilla PG sutton2000policy as a baseline which has the same network structure and training details as PGCB. The result of the experiment is shown in figure 2(b). PGCB outperforms other algorithms with larger map comparing with the previous experiment. An interesting fact is that both UCB and TS get almost the same average rewards as in the previous experiment, which indicates that they can hardly make any use of the information from states. PGCB learns faster and gets state-of-the-art performance in this task.

5 Conclusion

This paper has studied how to use the actor-critic algorithm with neural networks to solve general contextual bandits, including the standard case and MDP-CB. We first show that the class of permutation invariant policies is sufficient for our problem, and then derive that the performance of policy depends on its marginal expected probability of choosing each arm. We next propose a sub-class of policies in which the objective has a simple closed form and is differentiable to parameters. We prove that policies in this class have a succinct form of gradient if the actor and the critic satisfy a compatible condition, resulting in the proposed PGCB algorithm. Furthermore, additional techniques are proposed to significantly improve the performance and to guarantee the convergence property. Time-Dependent Greed ensures the algorithm to be GLIE. Actor-Dropout, which using dropout on the actor-network as a Bayesian approximation, empirically improves PGCB to sub-linear regret. By testing on a toy dataset and a recommendation dataset, we showed that PGCB indeed achieves state-of-the-art performance for both classic contextual-bandits and MDP-CB with state transitions in a real-world scenario. Future work could study Actor-Dropout in more general RL environments like robotics or game playing. It is also a promising direction to extend our results to a variant of bandits with states, i.e, choosing multiple arms at each step, or having more general conditions.


  • [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
  • [2] Naoki Abe, Alan W Biermann, and Philip M Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.
  • [3] Sander Adam, Lucian Busoniu, and Robert Babuska. Experience replay for real-time reinforcement learning control. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2):201–212, 2012.
  • [4] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In

    International Conference on Machine Learning

    , pages 127–135, 2013.
  • [5] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • [6] Djallel Bouneffouf, Amel Bouzeghoub, and Alda Lopes Gançarski. A contextual-bandit algorithm for mobile context-aware recommender system. In International Conference on Neural Information Processing, pages 324–331. Springer, 2012.
  • [7] Qingpeng Cai, Aris Filos-Ratsikas, Pingzhong Tang, and Yiwei Zhang. Reinforcement mechanism design for e-commerce. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 1339–1348. International World Wide Web Conferences Steering Committee, 2018.
  • [8] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
  • [9] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    , pages 208–214, 2011.
  • [10] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586–594, 2010.
  • [11] Yarin Gal and Zoubin Ghahramani.

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning.

    In international conference on machine learning, pages 1050–1059, 2016.
  • [12] Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, and David Silver. Memory-based control with recurrent neural networks. arXiv preprint arXiv:1512.04455, 2015.
  • [13] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • [14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [15] Andreas Krause and Cheng S Ong. Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447–2455, 2011.
  • [16] John Langford and Tong Zhang.

    The epoch-greedy algorithm for multi-armed bandits with side information.

    In Advances in neural information processing systems, pages 817–824, 2008.
  • [17] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
  • [18] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • [19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • [20] Guy Shani, David Heckerman, and Ronen I Brafman. An mdp-based recommender system. Journal of Machine Learning Research, 6(Sep):1265–1295, 2005.
  • [21] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In ICML, 2014.
  • [22] Aleksandrs Slivkins, Filip Radlinski, and Sreenivas Gollapudi. Ranked bandits in metric spaces: learning diverse rankings over large document collections. Journal of Machine Learning Research, 14(Feb):399–436, 2013.
  • [23] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias W Seeger. Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
  • [24] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
  • [25] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pages 1057–1063, 2000.
  • [26] Nima Taghipour and Ahmad Kardan. A hybrid web recommender system based on q-learning. In Proceedings of the 2008 ACM symposium on Applied computing, pages 1164–1168. ACM, 2008.
  • [27] Liang Tang, Yexi Jiang, Lei Li, and Tao Li. Ensemble contextual bandits for personalized recommendation. In Proceedings of the 8th ACM Conference on Recommender Systems, pages 73–80. ACM, 2014.
  • [28] Liang Tang, Yexi Jiang, Lei Li, Chunqiu Zeng, and Tao Li. Personalized recommendation via parameter-free contextual bandits. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 323–332. ACM, 2015.
  • [29] Liang Tang, Romer Rosales, Ajit Singh, and Deepak Agarwal. Automatic ad format selection via contextual bandits. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1587–1594. ACM, 2013.
  • [30] Peter Whittle. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A):287–298, 1988.
  • [31] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.
  • [32] Ganbin Zhou, Ping Luo, Yijun Xiao, Fen Lin, Bo Chen, and Qing He. Elastic responding machine for dialog generation with dynamically mechanism selecting. In AAAI, 2018.

Proof of Lemma 1


Suppose there exists a policy such that

(i) it is not permutation invariant, i.e. there exists and some permutation operator that ;

(ii) The expected reward following is larger than all permutation invariant policies that .

Then it follows that


where the expectation is over all sets of contexts. Recall that the contexts are drawn i.i.d. from the same distribution, so we have


so there exists at least one that


But because is not permutation invariant, we find a policy that is permutation invariant, where , then


which leads to a confliction to (20) and (19). So Lemma 1 holds. ∎

Proof of Theorem 2


We denote the state-value for a given state under policy as


it follows that


By repeatedly unrolling the equation, we have


Integrating both side over the start-state and recalling the discounted state density and discounted augmented context density , we get the policy gradient as


Proof of Theorem 3


By condition (ii), as we assumed the distribution of contexts is stationary with respect to the policy , it is easy to see when the conditions hold,


Then by condition (i) we have


which is the difference between (8) and (12). ∎

Lower variance of the gradients of PGCB than vanilla PG

We prove that the variance of updating the actor and the critic of PGCb is less than that of vanilla PG.

Since context does not exist in the classic formulation of reinforcement learning, it is often regarded as part of information of the state. Given a stochastic policy , PG has policy gradients


where denotes a unit vector and is the probability for choosing the arm. For simplicity, we write . Since we focus on policy gradients, we assume that PG has a critic function with the same form as PGCB. The corresponding update steps for PG is


Since contextual-bandits involves discrete actions with high dimensional random contexts, we claim that our PGCB achieves lower estimation variance comparing to classic stochastic policy gradient methods such as [25]. The reasons are two-fold. Firstly, by the Lemma of permutation invariant policies we know permutation invariant policies are sufficient for contextual-bandits problems. PGCB adopts class of stochastic policies where the only input of the policy is for . On the contrary, in other policy gradient methods, one should treat a state and the whole contexts altogether as inputs of the policy function, so usually a larger number sample space is necessary, which results in lower sample efficiency. Secondly, even if with the same form of policy, normal actor-critic methods tend to converge slower than PGCB because the expected probabilities of choosing arms in PGCB is estimated more efficiently.

In this section we make a fair comparison for variances between PG and PGCB by assuming that they share the same policy and action-value functions.

Lemma 4.

Given a policy and a value approximation , both and

are unbiased estimators for the true gradients of action-value approximation


And . Additionally if PGCB uses a fixed , as , with probability we have


It is obvious that both in and in are unbiased to . So both and are unbiased to .

To analyze the variance, we focus on the estimations of the probability of choosing an arm: and . Let Then for PGCB,


where denotes the probability of choosing at the time of sampling. In the worst case, it samples exactly the same set of arms every time, then . Otherwise if there exists and that the samples are different such that , then the correlation is strictly less than and we have in this case. Finally when enough time steps passed, for is a fixed positive integer, the probability of each arm being sampled at most once is

So with probability the sampled contexts are all different to each other so the estimated probabilities of choosing an arm are i.i.d., then . ∎

We get the following theorem applying the similar technique to Lemma 4. We claim that Policy gradients (12) has no higher variance than gradients in PG.

Theorem 5.



Similar to the proof of Lemma 4, we denote the variance of by .


By the update rules (16) of PGCB, the variance of is


Because of the assumption that the sampled contexts in each sampling procedure are independent and identical distributed, we have for all and the theorem is proved. ∎

Note that, in practice PGCB does not necessarily set to a large integer since it is naturally a finer-grained experience replay [3]. Surprisingly, when , PGCB can have a better performance than even in a simplest setting. In the next section, we will demonstrate experimental results that show that PGCB with achieves better performance in various settings comparing to other baseline methods including PG.

The results can be interpreted as follows. From a statistical point of view, PGCB takes advantage from a resampling technique so the estimations have lower variances. From an optimization perspective, PGCB reduces the correlation of estimating probabilities of choosing the arms within the same time step, so it has less chance to suffer from exploiting and over-fitting, while PG cannot. For example, when the estimated values of contexts are given, an optimizer for PG would simultaneously increase one arm’s chosen probability and reduce other ones’, which results in training the policy into a deterministic one: the arm with the largest estimated value will get a chosen probability close to , and others get arbitrary small probabilities close to . Afterwards, the arms with chosen probabilities will hardly have any influence to further updates. So eventually, PG is likely to over-fits the existing data. On the contrary, when PGCB estimates the gradients, even if an arm is not better than other competitors at its own time step, it may still get upgraded because it outranked some arms from other time steps. Therefore, PGCB tends to be more robust and explores better than PG.