Greedy Bandits with Sampled Context

07/27/2020 ∙ by Dom Huh, et al. ∙ George Mason University 0

Bayesian strategies for contextual bandits have proved promising in single-state reinforcement learning tasks by modeling uncertainty using context information from the environment. In this paper, we propose Greedy Bandits with Sampled Context (GB-SC), a method for contextual multi-armed bandits to develop the prior from the context information using Thompson Sampling, and arm selection using an epsilon-greedy policy. The framework GB-SC allows for evaluation of context-reward dependency, as well as providing robustness for partially observable context vectors by leveraging the prior developed. Our experimental results show competitive performance on the Mushroom environment in terms of expected regret and expected cumulative regret, as well as insights on how each context subset affects decision-making.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real-world tasks, often requiring the balance of exploration and exploitation trade-off Sutton and Barto (1998) through experimentation, can be formulated into a multi-armed bandit (MAB) paradigm. The MAB paradigm, in short, describes a system of

machines, commonly referred to as arms, each with an unknown, often Bernoulli, distribution of the reward. For example, for personalized web-content recommendation tasks, an arm would represent the web content, and playing on that arm would represent advertising that web content. Thus, a MAB algorithm attempts to maximize the expected reward to reach the optimal reward, which is received from playing the optimal arms, based on past experimentation of different arms. Regret is the metric used to measure the difference between the expected reward to the optimal reward. Under the scope of the MAB paradigm, there exist three classes of MAB algorithm

-greedy Watkins (1989), upper confidence bound maximization Lai and Robbins (1985); Agrawal (1995); Katehakis and Robbins (1995); Auer et al. (2002), and uncertainty analysis under the Bayesian framework Bradt et al. (1956); Gittens and Dempster (1979); Thompson (1933). In this paper, we will focus on the latter approach, specifically utilizing Thompson Sampling Thompson (1933); Chapelle and Li (2011)

, a Bayesian strategy to optimize the MAB paradigm by probability matching using a Beta distribution, to handle context processing and

-greedy, a policy that chooses under the probability of random actions, and under the probability of 1- greedily, for arm selection.

Contextual MAB problems Li et al. (2010), in contrast to vanilla MAB, takes into account the context which is presented at each trial, and often utilize past MAB algorithms with appropriate modification Li et al. (2010); Chu et al. (2011); Bietti et al. (2018); Riquelme et al. (2018)

, representing each arm with a set of context features. These context features are often processed using linear or nonlinear models, or often an neural network.

In this paper, we introduce Greedy Bandits with Sampled Context (GB-SC), a contextual MAB framework using Thompson Sampling. To process the features, GB-SC treats each unique value of each context feature as a distribution, which are modeled using Thompson Sampling over trials. To select which arm to play, GB-SC will use formulate a confidence value, calculating the conditioning on the highest confidence samples of the context feature distribution, on each action in the action space and use -greedy policy whether to play or not.

2 Preliminaries

To test our model, we experimented on the Mushroom environment specified in Guez (2015); Blundell et al. (2015); Riquelme et al. (2018)

using the Mushroom dataset from UC Irvine Machine Learning Repository

Dua and Graff (2017). The Mushroom dataset includes 8124 examples of 23 species of mushrooms, each with 22 features and labeled poisonous or safe. The rules of the Mushroom environment is as follows: eating a safe mushroom provides reward of +5, eating a poisonous mushroom provides reward of +5 with probability 1/2 and reward of -35 otherwise, and no eating provides reward of 0. The set of actions contains whether to eat or not eat the mushroom. We ran 1500 trials to initially tune our algorithm before any tests.

We diverge from the previous oracle defined in Guez (2015); Blundell et al. (2015); Riquelme et al. (2018) to receive a reward +5 for a safe mushroom, or receive a reward 0 for a poisonous mushroom if the intended reward was -35, otherwise the reward remains +5. Thus, our oracle used to computed the optimal expected regret and the optimal cumulative expected regret acts as a better optimal policy than the one described in Guez (2015); Blundell et al. (2015); Riquelme et al. (2018)

as it takes into account the positive upside of taking risks of playing on a poisonous mushroom. From 100 independent 50 arms trials, the expected cumulative reward of our oracle is 60.24 points higher than the oracle in past works with a standard deviation of 16.46 points.

3 Greedy Bandits with Sampled Context (GB-SC)

In this section, we will first describe the context processing procedure, then the arm selection policy, and lastly formalize the Greedy Bandits with Sampled Context (GB-SC) algorithm in context with the Mushroom environment discussed in Section 2.

Given the context from the environment, we can assume the context to be a set of discrete features, which we will refer to as context subsets. Within each context subset, we can model a random variable for each unique value. Thus, we have a set of random variables for each context subset. Assuming an action space consisting of 2 actions, we can utilize Thompson Sampling to model a Beta distribution for each random variable. In Fig.

1, the example illustrates context processing described here given a context vector with three explicitly shown context subsets. Further, the example shows activation of only a single node within each context subset.

With the activated nodes within the context subsets, we sample from their associated random variable to obtain a confidence, . We determine confidence based on how close the sample, , is from either zero or one. If the sample value is closer to 1, we assign the confidence to one action, and if the sample value is closer to 0, we assign the confidence to the other action.


Lastly, we average highest confidence for each action to obtain our action confidence, . The choice of will be explored in Section 4.


For arm selection for , we use the traditional framework of -greedy policy, to determine whether to play or not. We set the exploration value, , to the inverse of the current number of trials explored, , to ensure a logarithmic regret, which was theoretically guaranteed in Auer et al. (2002).

Figure 1: Context Processing + Confidence: The context vector is split into context subsets, each activating only one node within its respective context subset. Each activated node is sampled, and samples with the highest confidence are used to compute the action confidence.

The GB-SC algorithm utilizes the context processing, confidence, and -greedy policy to address the contextual MAB problem. With the Mushroom environment described in Section 2, GB-SC first creates 22 sets of random variables, each with distributions initialized to , where = 1 and = 1. GB-SC follows the update rule of Thompson Sampling specified in Thompson (1933), and additionally we scaled the update by the reward value. Thus, if the reward was positive, we updated the parameter of that random variable by the reward value, and if the reward was negative, we updated the parameter of that random variable by the absolute reward value. The action confidence is computed by sampling the activated random variables and averaging the highest confidence for each action. The different values of was experimented in Section 4. If there isn’t sufficient number of confidence for in an action, we simply use all confidence values. Then, GB-SC utilizes the -greedy policy with with set to where is the current number of trials completed, to determine whether to play the arm or not. Playing an arm, in the context of the Mushroom environment, would mean whether to eat or not to eat a mushroom.

4 Results

As discussed in Section 2, we tested the GB-SC algorithm on the Mushroom environment, tracking the expected regret on each arm and the expected cumulative regret with 50 arm trials on the oracle we have defined. We used 1500 trials to obtain the final distribution for each context node, and progress can be seen in Fig. 2 for , which represents the number of highest confidence values we take into account to calculate the action confidence. We updated the for the -greedy policy every 150 trials of 50 arms, thus the time step of the Fig. 2 is at 150.

We can see that there does exist a difference in how well model converges based on the value. So, we tested various values of and compared the performance in terms of expected cumulative regret on 50 arms with 10 trials in Fig. 3. Based on Fig. 3

, we can conclude that the variance increases as

increases, which could be that there are too many factors playing a role in the arm selection. Thus, the optimal values for seem to be between 1 and 3.

To gain insight on what features in the context vector are frequently used than the others, we can derive feature importance within the context vector on the Mushroom environment. To understand how to determine frequency, we explore two different approaches, and recommend other approaches to be experimented in further works.

Figure 2: Regret over Trials: The expected regret on each arm and the expected cumulative regret with 50 arm trials over 1500 trials at a time step of 150 trials for different values is shown as well as the regret using a random policy on this environment
Figure 3: Expected cumulative regret over : The expected cumulative regret with 50 arm trials are measured after the 1500 trials. The measurements were taken with 10 independent trials, and the mean and standard deviation are shown.

One option is to visualize all the distributions in each context subset and try to understand how each feature affects arm selection. This can be seen in Fig. 4 using GB-SC with =3. From Fig. 4, we can evaluate which context subset provide higher confidence values. To clarify, the closer the distribution is to one, the higher probability the context subset will be used for the action confidence for play, whereas the closer the distribution is to zero, the higher probability the context subset will be used for the action confidence for no play. However, this approach is not scalable for larger context vectors as the complexity of taking this approach and understanding how each context subset affects arm selection grows linearly with the number of context subsets and the number of unique values within each context subset.

Figure 4: Learned priors over Context Subsets: The probability density of the Beta distributions that represent the context subsets are shown on a symmetrical log scale.
Figure 5: Utilization Frequency over Context Subsets: The count frequency over each context subset is shown. The count is incremented by the number of arms used the context subset in 10 independent trials.

Another option we explore is counting the number of times how frequently each context subset is selected for action confidence calculations for a given arbitrary number of trials. In Fig 5, we set = 10 using GB-SC with =3, and we can see the most utilized context subsets to compute the action confidence, and also the least utilized context subsets as well. It is interesting to see that no context subset is never used, but some context subset were not used for no play.

Additionally, we experimented with partially observable context vectors, where some context subsets would be missing. For this purpose, we should allocate a random variable in each context subset to handle nil values. For priority masking, we removed the more frequently utilized context subsets at a higher probability based on Fig 5, and for random masking, we set all the context subset to be masked all with equal probability. The results can be seen in Fig 6 using GB-SC with =3.

Figure 6: Masking Context Subsets on Performance: The difference between expected cumulative regret over 10 independent trials between a masked context vector and non-masked context vector.

5 Conclusion

In this paper, we have proposed the Greedy Bandit with Sampled Context(GB-SC) algorithm, a method of handling contextual MAB problem by using Thompson Sampling for context processing and -greedy policy for arm selection. We demonstrate competitive performance compared to the baselines shown in Riquelme et al. (2018), and we showed additional benefits such as insight into feature importance and being robust in arms with partially observable context vectors. In future works, we hope to expand the application with continuous context vectors, either through neural memory or some other method.


  • R. Agrawal (1995) Sample mean based index policies with o(log n) regret for the multi-armed bandit problem. Advances in Applied Probability 27 (4), pp. 1054–1078. External Links: ISSN 00018678, Link Cited by: §1.
  • P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47, pp. 235–256. External Links: Document Cited by: §1, §3.
  • A. Bietti, A. Agarwal, and J. Langford (2018) A contextual bandit bake-off. External Links: 1802.04064 Cited by: §1.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. External Links: 1505.05424 Cited by: §2, §2.
  • R. N. Bradt, S. M. Johnson, and S. Karlin (1956) On sequential designs for maximizing the sum of observations. Ann. Math. Statist. 27 (4), pp. 1060–1074. External Links: Document, Link Cited by: §1.
  • O. Chapelle and L. Li (2011) An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 2249–2257. External Links: Link Cited by: §1.
  • W. Chu, L. Li, L. Reyzin, and R. Schapire (2011) Contextual bandits with linear payoff functions. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

    , G. Gordon, D. Dunson, and M. Dudík (Eds.),
    Proceedings of Machine Learning Research, Vol. 15, Fort Lauderdale, FL, USA, pp. 208–214. External Links: Link Cited by: §1.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §2.
  • J. Gittens and M. Dempster (1979) Bandit processes and dynamic allocation indices [with discussion]. Journal of the Royal Statistical Society. Series B: Methodological 41, pp. 148–177. External Links: Document Cited by: §1.
  • A. Guez (2015) Sample-based search methods for bayes-adaptive planning. Ph.D. Thesis. Cited by: §2, §2.
  • M. N. Katehakis and H. Robbins (1995) Sequential choice from several populations. Proceedings of the National Academy of Sciences 92 (19), pp. 8584–8585. External Links: Document, ISSN 0027-8424, Link, Cited by: §1.
  • T. Lai and H. Robbins (1985) Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), pp. 4 – 22. External Links: ISSN 0196-8858, Document, Link Cited by: §1.
  • L. Li, W. Chu, J. Langford, and R. E. Schapire (2010) A contextual-bandit approach to personalized news article recommendation. CoRR abs/1003.0146. External Links: Link, 1003.0146 Cited by: §1.
  • C. Riquelme, G. Tucker, and J. Snoek (2018) Deep bayesian bandits showdown: an empirical comparison of bayesian deep networks for thompson sampling. External Links: 1802.09127 Cited by: §1, §2, §2, §5.
  • R. S. Sutton and A. G. Barto (1998) Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. External Links: ISBN 0262193981 Cited by: §1.
  • W. R. Thompson (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. External Links: ISSN 00063444, Link Cited by: §1, §3.
  • C. J. C. H. Watkins (1989) Learning from delayed rewards. Ph.D. Thesis, King’s College, Cambridge, UK. Cited by: §1.