1 Introduction
Many realworld tasks, often requiring the balance of exploration and exploitation tradeoff Sutton and Barto (1998) through experimentation, can be formulated into a multiarmed bandit (MAB) paradigm. The MAB paradigm, in short, describes a system of
machines, commonly referred to as arms, each with an unknown, often Bernoulli, distribution of the reward. For example, for personalized webcontent recommendation tasks, an arm would represent the web content, and playing on that arm would represent advertising that web content. Thus, a MAB algorithm attempts to maximize the expected reward to reach the optimal reward, which is received from playing the optimal arms, based on past experimentation of different arms. Regret is the metric used to measure the difference between the expected reward to the optimal reward. Under the scope of the MAB paradigm, there exist three classes of MAB algorithm
greedy Watkins (1989), upper confidence bound maximization Lai and Robbins (1985); Agrawal (1995); Katehakis and Robbins (1995); Auer et al. (2002), and uncertainty analysis under the Bayesian framework Bradt et al. (1956); Gittens and Dempster (1979); Thompson (1933). In this paper, we will focus on the latter approach, specifically utilizing Thompson Sampling Thompson (1933); Chapelle and Li (2011), a Bayesian strategy to optimize the MAB paradigm by probability matching using a Beta distribution, to handle context processing and
greedy, a policy that chooses under the probability of random actions, and under the probability of 1 greedily, for arm selection.Contextual MAB problems Li et al. (2010), in contrast to vanilla MAB, takes into account the context which is presented at each trial, and often utilize past MAB algorithms with appropriate modification Li et al. (2010); Chu et al. (2011); Bietti et al. (2018); Riquelme et al. (2018)
, representing each arm with a set of context features. These context features are often processed using linear or nonlinear models, or often an neural network.
In this paper, we introduce Greedy Bandits with Sampled Context (GBSC), a contextual MAB framework using Thompson Sampling. To process the features, GBSC treats each unique value of each context feature as a distribution, which are modeled using Thompson Sampling over trials. To select which arm to play, GBSC will use formulate a confidence value, calculating the conditioning on the highest confidence samples of the context feature distribution, on each action in the action space and use greedy policy whether to play or not.
2 Preliminaries
To test our model, we experimented on the Mushroom environment specified in Guez (2015); Blundell et al. (2015); Riquelme et al. (2018)
using the Mushroom dataset from UC Irvine Machine Learning Repository
Dua and Graff (2017). The Mushroom dataset includes 8124 examples of 23 species of mushrooms, each with 22 features and labeled poisonous or safe. The rules of the Mushroom environment is as follows: eating a safe mushroom provides reward of +5, eating a poisonous mushroom provides reward of +5 with probability 1/2 and reward of 35 otherwise, and no eating provides reward of 0. The set of actions contains whether to eat or not eat the mushroom. We ran 1500 trials to initially tune our algorithm before any tests.We diverge from the previous oracle defined in Guez (2015); Blundell et al. (2015); Riquelme et al. (2018) to receive a reward +5 for a safe mushroom, or receive a reward 0 for a poisonous mushroom if the intended reward was 35, otherwise the reward remains +5. Thus, our oracle used to computed the optimal expected regret and the optimal cumulative expected regret acts as a better optimal policy than the one described in Guez (2015); Blundell et al. (2015); Riquelme et al. (2018)
as it takes into account the positive upside of taking risks of playing on a poisonous mushroom. From 100 independent 50 arms trials, the expected cumulative reward of our oracle is 60.24 points higher than the oracle in past works with a standard deviation of 16.46 points.
3 Greedy Bandits with Sampled Context (GBSC)
In this section, we will first describe the context processing procedure, then the arm selection policy, and lastly formalize the Greedy Bandits with Sampled Context (GBSC) algorithm in context with the Mushroom environment discussed in Section 2.
Given the context from the environment, we can assume the context to be a set of discrete features, which we will refer to as context subsets. Within each context subset, we can model a random variable for each unique value. Thus, we have a set of random variables for each context subset. Assuming an action space consisting of 2 actions, we can utilize Thompson Sampling to model a Beta distribution for each random variable. In Fig.
1, the example illustrates context processing described here given a context vector with three explicitly shown context subsets. Further, the example shows activation of only a single node within each context subset.With the activated nodes within the context subsets, we sample from their associated random variable to obtain a confidence, . We determine confidence based on how close the sample, , is from either zero or one. If the sample value is closer to 1, we assign the confidence to one action, and if the sample value is closer to 0, we assign the confidence to the other action.
(1) 
Lastly, we average highest confidence for each action to obtain our action confidence, . The choice of will be explored in Section 4.
(2) 
For arm selection for , we use the traditional framework of greedy policy, to determine whether to play or not. We set the exploration value, , to the inverse of the current number of trials explored, , to ensure a logarithmic regret, which was theoretically guaranteed in Auer et al. (2002).
(3) 
The GBSC algorithm utilizes the context processing, confidence, and greedy policy to address the contextual MAB problem. With the Mushroom environment described in Section 2, GBSC first creates 22 sets of random variables, each with distributions initialized to , where = 1 and = 1. GBSC follows the update rule of Thompson Sampling specified in Thompson (1933), and additionally we scaled the update by the reward value. Thus, if the reward was positive, we updated the parameter of that random variable by the reward value, and if the reward was negative, we updated the parameter of that random variable by the absolute reward value. The action confidence is computed by sampling the activated random variables and averaging the highest confidence for each action. The different values of was experimented in Section 4. If there isn’t sufficient number of confidence for in an action, we simply use all confidence values. Then, GBSC utilizes the greedy policy with with set to where is the current number of trials completed, to determine whether to play the arm or not. Playing an arm, in the context of the Mushroom environment, would mean whether to eat or not to eat a mushroom.
4 Results
As discussed in Section 2, we tested the GBSC algorithm on the Mushroom environment, tracking the expected regret on each arm and the expected cumulative regret with 50 arm trials on the oracle we have defined. We used 1500 trials to obtain the final distribution for each context node, and progress can be seen in Fig. 2 for , which represents the number of highest confidence values we take into account to calculate the action confidence. We updated the for the greedy policy every 150 trials of 50 arms, thus the time step of the Fig. 2 is at 150.
We can see that there does exist a difference in how well model converges based on the value. So, we tested various values of and compared the performance in terms of expected cumulative regret on 50 arms with 10 trials in Fig. 3. Based on Fig. 3
, we can conclude that the variance increases as
increases, which could be that there are too many factors playing a role in the arm selection. Thus, the optimal values for seem to be between 1 and 3.To gain insight on what features in the context vector are frequently used than the others, we can derive feature importance within the context vector on the Mushroom environment. To understand how to determine frequency, we explore two different approaches, and recommend other approaches to be experimented in further works.
One option is to visualize all the distributions in each context subset and try to understand how each feature affects arm selection. This can be seen in Fig. 4 using GBSC with =3. From Fig. 4, we can evaluate which context subset provide higher confidence values. To clarify, the closer the distribution is to one, the higher probability the context subset will be used for the action confidence for play, whereas the closer the distribution is to zero, the higher probability the context subset will be used for the action confidence for no play. However, this approach is not scalable for larger context vectors as the complexity of taking this approach and understanding how each context subset affects arm selection grows linearly with the number of context subsets and the number of unique values within each context subset.
Another option we explore is counting the number of times how frequently each context subset is selected for action confidence calculations for a given arbitrary number of trials. In Fig 5, we set = 10 using GBSC with =3, and we can see the most utilized context subsets to compute the action confidence, and also the least utilized context subsets as well. It is interesting to see that no context subset is never used, but some context subset were not used for no play.
Additionally, we experimented with partially observable context vectors, where some context subsets would be missing. For this purpose, we should allocate a random variable in each context subset to handle nil values. For priority masking, we removed the more frequently utilized context subsets at a higher probability based on Fig 5, and for random masking, we set all the context subset to be masked all with equal probability. The results can be seen in Fig 6 using GBSC with =3.
5 Conclusion
In this paper, we have proposed the Greedy Bandit with Sampled Context(GBSC) algorithm, a method of handling contextual MAB problem by using Thompson Sampling for context processing and greedy policy for arm selection. We demonstrate competitive performance compared to the baselines shown in Riquelme et al. (2018), and we showed additional benefits such as insight into feature importance and being robust in arms with partially observable context vectors. In future works, we hope to expand the application with continuous context vectors, either through neural memory or some other method.
References
 Sample mean based index policies with o(log n) regret for the multiarmed bandit problem. Advances in Applied Probability 27 (4), pp. 1054–1078. External Links: ISSN 00018678, Link Cited by: §1.
 Finitetime analysis of the multiarmed bandit problem. Machine Learning 47, pp. 235–256. External Links: Document Cited by: §1, §3.
 A contextual bandit bakeoff. External Links: 1802.04064 Cited by: §1.
 Weight uncertainty in neural networks. External Links: 1505.05424 Cited by: §2, §2.
 On sequential designs for maximizing the sum of observations. Ann. Math. Statist. 27 (4), pp. 1060–1074. External Links: Document, Link Cited by: §1.
 An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems 24, J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 2249–2257. External Links: Link Cited by: §1.

Contextual bandits with linear payoff functions.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, G. Gordon, D. Dunson, and M. Dudík (Eds.), Proceedings of Machine Learning Research, Vol. 15, Fort Lauderdale, FL, USA, pp. 208–214. External Links: Link Cited by: §1.  UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §2.
 Bandit processes and dynamic allocation indices [with discussion]. Journal of the Royal Statistical Society. Series B: Methodological 41, pp. 148–177. External Links: Document Cited by: §1.
 Samplebased search methods for bayesadaptive planning. Ph.D. Thesis. Cited by: §2, §2.
 Sequential choice from several populations. Proceedings of the National Academy of Sciences 92 (19), pp. 8584–8585. External Links: Document, ISSN 00278424, Link, https://www.pnas.org/content/92/19/8584.full.pdf Cited by: §1.
 Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), pp. 4 – 22. External Links: ISSN 01968858, Document, Link Cited by: §1.
 A contextualbandit approach to personalized news article recommendation. CoRR abs/1003.0146. External Links: Link, 1003.0146 Cited by: §1.
 Deep bayesian bandits showdown: an empirical comparison of bayesian deep networks for thompson sampling. External Links: 1802.09127 Cited by: §1, §2, §2, §5.
 Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. External Links: ISBN 0262193981 Cited by: §1.
 On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. External Links: ISSN 00063444, Link Cited by: §1, §3.
 Learning from delayed rewards. Ph.D. Thesis, King’s College, Cambridge, UK. Cited by: §1.
Comments
There are no comments yet.