Introduction
In sequential decision problems such as clinical trials [35] or recommender system [29, 12], a decisionmaking algorithm must select among several actions at each given timepoint. Each of these actions is associated with side information, or context (e.g., a user’s profile), and the reward feedback is limited to the chosen option. For example, in the clinical trials [35, 17, 11], the actions correspond the treatment options being compared, the context is the patient’s medical record (e.g. health condition, family history, etc.) and the reward represents the outcome (successful or not) of the proposed treatment. In this setting, we are looking for a good tradeoff between the exploration of the new drug and the exploitation of the known drug.
This inherent explorationexploitation tradeoff exists in many sequential decision problems, and is traditionally modeled as multiarmed bandit (MAB) problem, stated as follows: there are
“arms” (possible actions), each associated with a fixed but unknown reward probability distribution
[24, 4, 27]. At each step, an agent plays an arm (chooses an action) and receives a reward. This reward is drawn according to the selected arm’s law and is independent of the previous actions.A particularly useful version of MAB is the contextual multiarmed bandit problem. In this problem, at each iteration, before choosing an arm, the agent observes a dimensional feature vector, or context
, associated with each arm. The learner uses these contexts, along with the rewards of the arms played in the past, to choose which arm to play in the current iteration. Over time, the learner’s aim is to collect enough information about the relationship between the context vectors and rewards, so that it can predict the next best arm to play by looking at the corresponding contexts (feature vectors)
[2]. One smart solution for the contextual bandit is the LINUCB algorithm, which is based on online ridge regression, and takes the concept of upperconfidence bound
[26] to strategically balance between exploration and exploitation. The parameter essentially controls exploration/ exploitation. The problem is that it is difficult to decide in advance the optimal value of We introduce in this paper two algorithms, named ”OPLINUCB” and ” DOPLINUCB”, that computes the optimal value of in both stationer and switching environment by adaptively balancing exr/exp according to the context.The main contributions of this paper include proposing two new algorithms, for both stationary and nonstationary settings, which extend the existing bandit algorithms to the new setting, and (3) evaluating the algorithms empirically on a variety of datasets.
Related Work
The multiarmed bandit problem is a model of exploration versus exploitation tradeoff, where a player gets to pick within a finite set of decisions the one maximizing the cumulative reward. This problem has been extensively studied. Optimal solutions have been provided using a stochastic formulation [24, 4, 15, 28, 13, 8, 9, 14, 30]., a Bayesian formulation [18, 33, 16], or using an adversarial formulation [6, 5, 7]. However, these approaches do not take into account the context which may affect to the arm’s performance. In LINUCB [26, 21]
and in Contextual Thompson Sampling (CTS)
[2] and neural bandit [3], the authors assume a linear dependency between the expected reward of an action and its context; the representation space is modeled using a set of linear predictors. However, the exploration magnitude on these algorithms need to be given by the user. Authors in [31], addresses the exploration tradeoff by learning a good exploration strategy for offline tasks based on synthetic data, on which it can simulate the contextual bandit setting. Based on these simulations, the proposed algorithm uses an imitation learning strategy to learn a good exploration policy that can then be applied to true contextual bandit tasks at test time. the authors compare the algorithm to seven strong baseline contextual bandit algorithms on a set of three hundred realworld datasets, on which it outperforms alternatives in most settings, especially when differences in rewards are large.
, the authors show that greedy algorithms that exploit current estimates without any exploration may be suboptimal in general. However, explorationfree greedy algorithms are desirable in practical settings where exploration may be costly or unethical (e.g., clinical trials). They find that a simple greedy algorithm can be rateoptimal if there is sufficient randomness in the observed contexts. They prove that this is always the case for a twoarmed bandit under a general class of context distributions that satisfy a condition they term covariate diversity. Furthermore, even absent this condition, we show that a greedy algorithm can be rate optimal with positive probability.
As we can see, none of these previously proposed approaches involves learning dynamically the exploration tradeoff in the contextual bandit setting, which is the main focus of this work.
Key Notion
This section focuses on introducing the key notions used in this paper.
The Contextual Bandit Problem.
Following [25], this problem is defined as follows. At each time point (iteration) , a player is presented with a context (feature vector) before choosing an arm . We will denote by the set of features (variables) defining the context. Let denote a reward vector, where is a reward at time associated with the arm . Herein, we will primarily focus on the Bernoulli bandit with binary reward, i.e. . Let denote a policy. Also,
denotes a joint distribution
. We will assume that the expected reward is a linear function of the context, i.e. , where is an unknown weight vector (to be learned from the data) associated with the arm .Thompson Sampling (TS).
The TS [34], also known as
Basyesian posterior sampling, is a classical approach to multiarm bandit problem, where the reward for choosing an arm at time is assumed to follow a distribution with the parameter . Given a prior on these parameters, their posterior distribution is given by the Bayes rule, . A particular case of the Thomson Sampling approach assumes a Bernoulli bandit problem, with rewards being 0 or 1, and the parameters following the Beta prior.
TS initially assumes arm to have prior on (the probability of success). At time , having observed successes (reward = 1) and failures (reward = 0), the algorithm updates the distribution on as . The algorithm then generates independent samples from these posterior distributions of the , and selects the arm with the largest sample value. For more details, see, for example, [1].
Regression Tree
In total generality unsupervised learning is an ill posed problem. Nevertheless in some situations it is possible to back the clustering problem by a supervised one. In our setting we can use the reward estimation as a supervision for the group creation. Thus we just need a supervised learning technique which explicitly creates groups that we could reuse. See
[23]for a survey on such techniques. Among all existing approaches, the use of a tree built by recursive partitioning is a popular approach. It allows to estimate different means under specifics explanatory variables and have an interpretation as a regression analysis
[32]. Moreover some efficient implementations as ”CART” [20] and ”C4.5” are available.For a regression models describing the conditional distribution of a response variable
given the status of covariates by means of treestructured recursive partitioning. The dimensional covariate vector is taken from a sample space . Both response variables and covariates may be measured at arbitrary scale. If we assume that the conditional distribution of the respond variable given the covariate depend of a function of the covariate:With . Let’s be a learning sample for training the model of regression relationship on a random sample of independent and identically distributed observation, possibly with some covariates observation is missing :
CTree (conditional inference Tree) is an algorithm derived from C.A.R.T : CTree proposed by doi:10.1198/106186006X133933 (2006) which is a non parametric class of regression trees embedding treestructured regression models into a welldefined theory of conditional inference procedures. The main advantage of CTree is to handle any kind of regresssion problemes, including nominal, ordinal, numeric, censored as well multivariate response variable and arbitrary measurement scales of the covariates . CTree also manages the bias induced by maximizing a splitting criterion over all possible splits simultaneously (SHIH2004457). The following algorithm consider for a learning sample as a non negative integer valued case weights . Each node of a tree is represented by a vector of case weights having nonzero elements when the corresponding observations are elements of the node and are zero otherwise.
This algorithm stops when the global null hypotheses of independence between the response and any covariates cannot be rejected at a prespecified normial . Details about variable selection, stopping criteria, splitting criteria or missing values and surrogate splits can be found at doi:10.1198/106186006X133933 (2006). This algorithm handles the missing values and uses a Bonferroni correction to counteract the problem of multiple comparisons. The computational complexity of the variable selection depend of the covariates nature: for continuous variable, searching the optimal split is of order , for nominal covariates measured of levels , the evaluation of all possible splits is maximize by .
Algorithms for Learning the Exploration Value
We describe here two algorithm that learn the exploration of the contextual bandit algorithm.
Oplinucb
The proposed algorithm is named ”OPLINUCB” for NonParametric LINUCB. This algorithm has to solve two levels multiarmed bandit problems. The first level is the classical multiarmed bandit problem applied to find the parameters of the algorithm. The second level problem is a contextual bandit problem that use the parameters find in the first level to find the optimal arm to play. Let be the number of times the th exploration value has been selected so far, let be the cumulative reward associated with the exploration value , and let be the reward associated with the arm at time . The algorithm takes as an input the candidate values for
, as well as the initial values of the Beta distribution parameters in TS. At each iteration
, we update the values of those parameters, and (steps 5 and 6), to represent the current total number of successes and failures, respectively, and then sample the ”probability of success” parameter from the corresponding distribution, separately for each exploration value to estimate , which is the mean reward conditioned to the use of the variable (step 7).The pseudocode of OPLINUCB is sketched in Algorithm 3.
In Algorithm 2, is the set of arms at iteration , where is the feature vector of arms with dimension, is the unknown coefficient vector of the feature , is a constant and A. is a design matrix of dimension at trial , whose rows correspond to training inputs (e.g., contexts that are observed previously for arm ), and is the corresponding response vector (e.g., the corresponding click/noclick user feedback). Applying ridge regression to the training data gives an estimate of the coefficients: , where is the identity matrix and are independent conditioned by corresponding rows in .
Doplinucb
The proposed algorithm is named ”DOPLINUCB” for Dynamic NonParametric LINUCB. This algorithm has to solve two levels contextual bandit problems. The first level where the algorithm is using the context to decide on the exploration value of the algorithm, this is done using the CTree algorithm described in Algorithm 1. The second level problem is also contextual bandit problem, where the algorithm uses the context and the exploration provided by the first level to find the optimal arm to play. The pseudocode of DOPLINUCB is sketched in Algorithm 3.
Experimentation
We show in the following two senarios, one where we evaluate the algorithms in stationary environment, and the second with a nonstationary environment.
Stationary Environment
In this scenario we consider the case of two Bernouilli amrs, our dataset is Adult dataset[22]. Reward is a prediction task to determine whether a person makes over 50K a year. Each person is define by some categorical and continuous information (age, work class, etc). We fix interval of possible in with step of . Mean , median, min and max of empirical average cumulative of each LinUCB with different are provided in Table 1. Rewards function of each arm are stationary. We ran LinUCB with a hundred different values of and compare empirical average cumulative regret with DOPLINUCB and OPLINUCB. Without train set, it’s not possible to use DOPLINUCB. Nevertheless, OPLINUCB choose an in 100 value of alpha. DOPLINUCB and OPLINUCB can’t beats the best value (min on table 2) but OPLINUCB beats every time the median/mean.
size of train set  0  1000  5000  10 000 

max  5216  5116  4406  3658 
min  5096  4964  4297  3560 
mean  5152.1  5035,17  5035,17  3605,63 
median  5149  5031  5031  3605 
OPLINUCB  5121  4981  4334  3571 
DOPLINUCB  5014  4399  3654 
NonStationary Environment
In this experiment, we consider a challenging setting: reward function of each arm changes at a fixed number of iteration. We keep a trainingset of 5000 items. Table 2 show cumulative regrets when reward function of each arm changes at 100/1000/10000 iteration.
We can observe that in this scenario the proposed algorithm outperform beats the best value (min on table). This is explainable by the fact that the proposed algorithm is learning a dependency between the context and the right exploration value, which is not the case of the min approach. This results show that when we are facing an nonstationary environment, having a dynamic exploration exploitation tradeoff is useful, to learn this context/exploration dependency, which is beneficial for the contextual bandit algorithm.
switch iteration  100  1000  10 000 

max  16123  15970  11063 
min  15869  15331  9291 
mean  15973.04  15584.21  10179.37 
median  15960.5  15570  10200 
OPLINUCB  16061  15630  12601 
DOPLINUCB  13720  13569  10702 
Conclusion
We have studied the problem of learning the exploration exploitation tradoff in the contextual bandit problem with linear reward function setting. In the traditional algorithms that solve the contextual bandit problem, the exploration is a parameter that is tuned by the user. However, our proposed algorithm learn to choose the right exploration parameters in an online manner based on the observed context, and the immediate reward received for the chosen action. We have presented here two algorithms that uses a bandit to find the optimal exploration of the contextual bandit algorithm. The Evaluation showed that the two proposed algorithms gives better results then which we hope is the first step toward an automated multiarmed bandit algorithm.
References
 [1] (2012) Analysis of thompson sampling for the multiarmed bandit problem. In COLT 2012  The 25th Annual Conference on Learning Theory, June 2527, 2012, Edinburgh, Scotland, pp. 39.1–39.26. External Links: Link Cited by: Thompson Sampling (TS)..
 [2] (2013) Thompson sampling for contextual bandits with linear payoffs. In ICML (3), pp. 127–135. Cited by: Introduction, RELATED WORK.

[3]
(2014)
A neural networks committee for the contextual bandit problem
. In Neural Information Processing  21st International Conference, ICONIP 2014, Kuching, Malaysia, November 36, 2014. Proceedings, Part I, pp. 374–381. External Links: Link, Document Cited by: RELATED WORK.  [4] (2002) Finitetime analysis of the multiarmed bandit problem. Machine Learning 47 (23), pp. 235–256. Cited by: Introduction, RELATED WORK.
 [5] (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32 (1), pp. 48–77. Cited by: RELATED WORK.
 [6] (1998) Online learning with malicious noise and the closure algorithm. Ann. Math. Artif. Intell. 23 (12), pp. 83–99. Cited by: RELATED WORK.

[7]
(2018)
Using contextual bandits with behavioral constraints for constrained online movie recommendation.
In
Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 1319, 2018, Stockholm, Sweden.
, pp. 5802–5804. External Links: Link, Document Cited by: RELATED WORK.  [8] (2019) Incorporating behavioral constraints in online AI systems. AAAI 2019 . External Links: Link, 1809.05720 Cited by: RELATED WORK.
 [9] (2019) Using multiarmed bandits to learn ethical priorities for online AI systems. IBM Journal of Research and Development 63 (4/5), pp. 1:1–1:13. External Links: Link, Document Cited by: RELATED WORK.
 [10] (2017) Mostly explorationfree algorithms for contextual bandits. arXiv preprint arXiv:1704.09011. Cited by: RELATED WORK.
 [11] (2012) A contextualbandit algorithm for mobile contextaware recommender system. In Neural Information Processing  19th International Conference, ICONIP 2012, Doha, Qatar, November 1215, 2012, Proceedings, Part III, T. Huang, Z. Zeng, C. Li, and C. Leung (Eds.), Lecture Notes in Computer Science, Vol. 7665, pp. 324–331. External Links: Link, Document Cited by: Introduction.
 [12] (2013) Riskaware recommender systems. In Neural Information Processing  20th International Conference, ICONIP 2013, Daegu, Korea, November 37, 2013. Proceedings, Part I, M. Lee, A. Hirose, Z. Hou, and R. M. Kil (Eds.), Lecture Notes in Computer Science, Vol. 8226, pp. 57–65. External Links: Link, Document Cited by: Introduction.
 [13] (2016) Multiarmed bandit problem with known trend. Neurocomputing 205, pp. 16–21. External Links: Link, Document Cited by: RELATED WORK.

[14]
(2014)
Contextual bandit for active learning: active thompson sampling
. In Neural Information Processing  21st International Conference, ICONIP 2014, Kuching, Malaysia, November 36, 2014. Proceedings, Part I, pp. 405–412. External Links: Link, Document Cited by: RELATED WORK.  [15] (2019) Optimal exploitation of clustering and history information in multiarmed bandit. In Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 1016, 2019, S. Kraus (Ed.), pp. 2016–2022. External Links: Link, Document Cited by: RELATED WORK.
 [16] (2017) Context attentive bandits: contextual bandit with restricted context. In IJCAI 2017, Melbourne, Australia, August 1925, 2017, pp. 1468–1475. External Links: Link, Document Cited by: RELATED WORK.
 [17] (2017) Bandit models of human behavior: reward processing in mental disorders. In Artificial General Intelligence  10th International Conference, AGI 2017, Melbourne, VIC, Australia, August 1518, 2017, Proceedings, T. Everitt, B. Goertzel, and A. Potapov (Eds.), Lecture Notes in Computer Science, Vol. 10414, pp. 237–248. External Links: Link, Document Cited by: Introduction.
 [18] (2019) A survey on practical applications of multiarmed and contextual bandits. CoRR abs/1904.10040. External Links: Link, 1904.10040 Cited by: RELATED WORK.
 [19] (2016) Exponentiated gradient exploration for active learning. Computers 5 (1), pp. 1. External Links: Link, Document Cited by: RELATED WORK.
 [20] (1984) Classification and regression trees. Wadsworth and Brooks, Monterey, CA. Cited by: Regression Tree.
 [21] (2011) Contextual bandits with linear payoff functions.. In AISTATS, G. J. Gordon, D. B. Dunson, and M. Dudik (Eds.), JMLR Proceedings, Vol. 15, pp. 208–214. External Links: Link Cited by: RELATED WORK.
 [22] (1994) UCI machine learning repository. Silicon Graphics. External Links: Link Cited by: Stationary Environment.
 [23] (2007) Supervised machine learning: a review of classification techniques. informatica 31:249–268. Cited by: Regression Tree.
 [24] (1985) Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), pp. 4–22. External Links: Link Cited by: Introduction, RELATED WORK.

[25]
(2008)
The epochgreedy algorithm for multiarmed bandits with side information
. In Advances in neural information processing systems, pp. 817–824. Cited by: The Contextual Bandit Problem..  [26] (2010) A contextualbandit approach to personalized news article recommendation. CoRR. Cited by: Introduction, RELATED WORK.

[27]
(2018)
Contextual bandit with adaptive feature extraction
. In 2018 IEEE International Conference on Data Mining Workshops, ICDM Workshops, Singapore, Singapore, November 1720, 2018, pp. 937–944. External Links: Link, Document Cited by: Introduction.  [28] (2018) Contextual bandit with adaptive feature extraction. In 2018 IEEE International Conference on Data Mining Workshops, ICDM Workshops, Singapore, Singapore, November 1720, 2018, H. Tong, Z. J. Li, F. Zhu, and J. Yu (Eds.), pp. 937–944. External Links: Link, Document Cited by: RELATED WORK.
 [29] (2015) Bandits and recommender systems. In Machine Learning, Optimization, and Big Data  First International Workshop, MOD 2015, Taormina, Sicily, Italy, July 2123, 2015, Revised Selected Papers, pp. 325–336. External Links: Link, Document Cited by: Introduction.

[30]
(2018)
Interpretable multiobjective reinforcement learning through policy orchestration
. CoRR abs/1809.08343. External Links: Link, 1809.08343 Cited by: RELATED WORK.  [31] (2019) Metalearning for contextual bandit exploration. arXiv preprint arXiv:1901.08159. Cited by: RELATED WORK.
 [32] (1999) The asymptotic theory of permutation statistics. Mathematical Methods of Statistics 8, pp. 220–250. Cited by: Regression Tree.
 [33] (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.. Biometrika 25, pp. 285–294. Cited by: RELATED WORK.
 [34] (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: Thompson Sampling (TS)..
 [35] (2015) Multiarmed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics 30 (2), pp. 199. Cited by: Introduction.