In sequential decision problems such as clinical trials  or recommender system [29, 12], a decision-making algorithm must select among several actions at each given time-point. Each of these actions is associated with side information, or context (e.g., a user’s profile), and the reward feedback is limited to the chosen option. For example, in the clinical trials [35, 17, 11], the actions correspond the treatment options being compared, the context is the patient’s medical record (e.g. health condition, family history, etc.) and the reward represents the outcome (successful or not) of the proposed treatment. In this setting, we are looking for a good trade-off between the exploration of the new drug and the exploitation of the known drug.
This inherent exploration-exploitation trade-off exists in many sequential decision problems, and is traditionally modeled as multi-armed bandit (MAB) problem, stated as follows: there are
“arms” (possible actions), each associated with a fixed but unknown reward probability distribution[24, 4, 27]. At each step, an agent plays an arm (chooses an action) and receives a reward. This reward is drawn according to the selected arm’s law and is independent of the previous actions.
A particularly useful version of MAB is the contextual multi-armed bandit problem. In this problem, at each iteration, before choosing an arm, the agent observes a -dimensional feature vector, or context
, associated with each arm. The learner uses these contexts, along with the rewards of the arms played in the past, to choose which arm to play in the current iteration. Over time, the learner’s aim is to collect enough information about the relationship between the context vectors and rewards, so that it can predict the next best arm to play by looking at the corresponding contexts (feature vectors)
. One smart solution for the contextual bandit is the LINUCB algorithm, which is based on online ridge regression, and takes the concept of upper-confidence bound to strategically balance between exploration and exploitation. The parameter essentially controls exploration/ exploitation. The problem is that it is difficult to decide in advance the optimal value of We introduce in this paper two algorithms, named ”OPLINUCB” and ” DOPLINUCB”, that computes the optimal value of in both stationer and switching environment by adaptively balancing exr/exp according to the context.
The main contributions of this paper include proposing two new algorithms, for both stationary and non-stationary settings, which extend the existing bandit algorithms to the new setting, and (3) evaluating the algorithms empirically on a variety of datasets.
The multi-armed bandit problem is a model of exploration versus exploitation trade-off, where a player gets to pick within a finite set of decisions the one maximizing the cumulative reward. This problem has been extensively studied. Optimal solutions have been provided using a stochastic formulation [24, 4, 15, 28, 13, 8, 9, 14, 30]., a Bayesian formulation [18, 33, 16], or using an adversarial formulation [6, 5, 7]. However, these approaches do not take into account the context which may affect to the arm’s performance. In LINUCB [26, 21]
and in Contextual Thompson Sampling (CTS) and neural bandit , the authors assume a linear dependency between the expected reward of an action and its context; the representation space is modeled using a set of linear predictors. However, the exploration magnitude on these algorithms need to be given by the user. Authors in 
, addresses the exploration trade-off by learning a good exploration strategy for offline tasks based on synthetic data, on which it can simulate the contextual bandit setting. Based on these simulations, the proposed algorithm uses an imitation learning strategy to learn a good exploration policy that can then be applied to true contextual bandit tasks at test time. the authors compare the algorithm to seven strong baseline contextual bandit algorithms on a set of three hundred real-world datasets, on which it outperforms alternatives in most settings, especially when differences in rewards are large.
, the authors show that greedy algorithms that exploit current estimates without any exploration may be sub-optimal in general. However, exploration-free greedy algorithms are desirable in practical settings where exploration may be costly or unethical (e.g., clinical trials). They find that a simple greedy algorithm can be rate-optimal if there is sufficient randomness in the observed contexts. They prove that this is always the case for a two-armed bandit under a general class of context distributions that satisfy a condition they term covariate diversity. Furthermore, even absent this condition, we show that a greedy algorithm can be rate optimal with positive probability.
As we can see, none of these previously proposed approaches involves learning dynamically the exploration trade-off in the contextual bandit setting, which is the main focus of this work.
This section focuses on introducing the key notions used in this paper.
The Contextual Bandit Problem.
Following , this problem is defined as follows. At each time point (iteration) , a player is presented with a context (feature vector) before choosing an arm . We will denote by the set of features (variables) defining the context. Let denote a reward vector, where is a reward at time associated with the arm . Herein, we will primarily focus on the Bernoulli bandit with binary reward, i.e. . Let denote a policy. Also,
denotes a joint distribution. We will assume that the expected reward is a linear function of the context, i.e. , where is an unknown weight vector (to be learned from the data) associated with the arm .
Thompson Sampling (TS).
The TS , also known as
Basyesian posterior sampling, is a classical approach to multi-arm bandit problem, where the reward for choosing an arm at time is assumed to follow a distribution with the parameter . Given a prior on these parameters, their posterior distribution is given by the Bayes rule, . A particular case of the Thomson Sampling approach assumes a Bernoulli bandit problem, with rewards being 0 or 1, and the parameters following the Beta prior.
TS initially assumes arm to have prior on (the probability of success). At time , having observed successes (reward = 1) and failures (reward = 0), the algorithm updates the distribution on as . The algorithm then generates independent samples from these posterior distributions of the , and selects the arm with the largest sample value. For more details, see, for example, .
In total generality unsupervised learning is an ill posed problem. Nevertheless in some situations it is possible to back the clustering problem by a supervised one. In our setting we can use the reward estimation as a supervision for the group creation. Thus we just need a supervised learning technique which explicitly creates groups that we could reuse. See
for a survey on such techniques. Among all existing approaches, the use of a tree built by recursive partitioning is a popular approach. It allows to estimate different means under specifics explanatory variables and have an interpretation as a regression analysis. Moreover some efficient implementations as ”CART”  and ”C4.5” are available.
For a regression models describing the conditional distribution of a response variablegiven the status of covariates by means of tree-structured recursive partitioning. The -dimensional covariate vector is taken from a sample space . Both response variables and covariates may be measured at arbitrary scale. If we assume that the conditional distribution of the respond variable given the covariate depend of a function of the covariate:
With . Let’s be a learning sample for training the model of regression relationship on a random sample of independent and identically distributed observation, possibly with some covariates observation is missing :
CTree (conditional inference Tree) is an algorithm derived from C.A.R.T : CTree proposed by doi:10.1198/106186006X133933 (2006) which is a non parametric class of regression trees embedding tree-structured regression models into a well-defined theory of conditional inference procedures. The main advantage of CTree is to handle any kind of regresssion problemes, including nominal, ordinal, numeric, censored as well multivariate response variable and arbitrary measurement scales of the covariates . CTree also manages the bias induced by maximizing a splitting criterion over all possible splits simultaneously (SHIH2004457). The following algorithm consider for a learning sample as a non negative integer valued case weights . Each node of a tree is represented by a vector of case weights having non-zero elements when the corresponding observations are elements of the node and are zero otherwise.
This algorithm stops when the global null hypotheses of independence between the response and any covariates cannot be rejected at a pre-specified normial . Details about variable selection, stopping criteria, splitting criteria or missing values and surrogate splits can be found at doi:10.1198/106186006X133933 (2006). This algorithm handles the missing values and uses a Bonferroni correction to counteract the problem of multiple comparisons. The computational complexity of the variable selection depend of the covariates nature: for continuous variable, searching the optimal split is of order , for nominal covariates measured of levels , the evaluation of all possible splits is maximize by .
Algorithms for Learning the Exploration Value
We describe here two algorithm that learn the exploration of the contextual bandit algorithm.
The proposed algorithm is named ”OPLINUCB” for Non-Parametric LINUCB. This algorithm has to solve two levels multi-armed bandit problems. The first level is the classical multi-armed bandit problem applied to find the parameters of the algorithm. The second level problem is a contextual bandit problem that use the parameters find in the first level to find the optimal arm to play. Let be the number of times the -th exploration value has been selected so far, let be the cumulative reward associated with the exploration value , and let be the reward associated with the arm at time . The algorithm takes as an input the candidate values for
, as well as the initial values of the Beta distribution parameters in TS. At each iteration, we update the values of those parameters, and (steps 5 and 6), to represent the current total number of successes and failures, respectively, and then sample the ”probability of success” parameter from the corresponding distribution, separately for each exploration value to estimate , which is the mean reward conditioned to the use of the variable (step 7).
The pseudo-code of OPLINUCB is sketched in Algorithm 3.
In Algorithm 2, is the set of arms at iteration , where is the feature vector of arms with -dimension, is the unknown coefficient vector of the feature , is a constant and A. is a design matrix of dimension at trial , whose rows correspond to training inputs (e.g., contexts that are observed previously for arm ), and is the corresponding response vector (e.g., the corresponding click/no-click user feedback). Applying ridge regression to the training data gives an estimate of the coefficients: , where is the identity matrix and are independent conditioned by corresponding rows in .
The proposed algorithm is named ”DOPLINUCB” for Dynamic Non-Parametric LINUCB. This algorithm has to solve two levels contextual bandit problems. The first level where the algorithm is using the context to decide on the exploration value of the algorithm, this is done using the CTree algorithm described in Algorithm 1. The second level problem is also contextual bandit problem, where the algorithm uses the context and the exploration provided by the first level to find the optimal arm to play. The pseudo-code of DOPLINUCB is sketched in Algorithm 3.
We show in the following two senarios, one where we evaluate the algorithms in stationary environment, and the second with a non-stationary environment.
In this scenario we consider the case of two Bernouilli amrs, our dataset is Adult dataset. Reward is a prediction task to determine whether a person makes over 50K a year. Each person is define by some categorical and continuous information (age, work class, etc). We fix interval of possible in with step of . Mean , median, min and max of empirical average cumulative of each LinUCB with different are provided in Table 1. Rewards function of each arm are stationary. We ran LinUCB with a hundred different values of and compare empirical average cumulative regret with DOPLINUCB and OPLINUCB. Without train set, it’s not possible to use DOPLINUCB. Nevertheless, OPLINUCB choose an in 100 value of alpha. DOPLINUCB and OPLINUCB can’t beats the best value (min on table 2) but OPLINUCB beats every time the median/mean.
|size of train set||0||1000||5000||10 000|
In this experiment, we consider a challenging setting: reward function of each arm changes at a fixed number of iteration. We keep a training-set of 5000 items. Table 2 show cumulative regrets when reward function of each arm changes at 100/1000/10000 iteration.
We can observe that in this scenario the proposed algorithm outperform beats the best value (min on table). This is explainable by the fact that the proposed algorithm is learning a dependency between the context and the right exploration value, which is not the case of the min approach. This results show that when we are facing an non-stationary environment, having a dynamic exploration exploitation trade-off is useful, to learn this context/exploration dependency, which is beneficial for the contextual bandit algorithm.
|switch iteration||100||1000||10 000|
We have studied the problem of learning the exploration exploitation trad-off in the contextual bandit problem with linear reward function setting. In the traditional algorithms that solve the contextual bandit problem, the exploration is a parameter that is tuned by the user. However, our proposed algorithm learn to choose the right exploration parameters in an online manner based on the observed context, and the immediate reward received for the chosen action. We have presented here two algorithms that uses a bandit to find the optimal exploration of the contextual bandit algorithm. The Evaluation showed that the two proposed algorithms gives better results then which we hope is the first step toward an automated multi-armed bandit algorithm.
-  (2012) Analysis of thompson sampling for the multi-armed bandit problem. In COLT 2012 - The 25th Annual Conference on Learning Theory, June 25-27, 2012, Edinburgh, Scotland, pp. 39.1–39.26. External Links: Cited by: Thompson Sampling (TS)..
-  (2013) Thompson sampling for contextual bandits with linear payoffs. In ICML (3), pp. 127–135. Cited by: Introduction, RELATED WORK.
A neural networks committee for the contextual bandit problem. In Neural Information Processing - 21st International Conference, ICONIP 2014, Kuching, Malaysia, November 3-6, 2014. Proceedings, Part I, pp. 374–381. External Links: Cited by: RELATED WORK.
-  (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 (2-3), pp. 235–256. Cited by: Introduction, RELATED WORK.
-  (2002) The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32 (1), pp. 48–77. Cited by: RELATED WORK.
-  (1998) On-line learning with malicious noise and the closure algorithm. Ann. Math. Artif. Intell. 23 (1-2), pp. 83–99. Cited by: RELATED WORK.
Using contextual bandits with behavioral constraints for constrained online movie recommendation.
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden., pp. 5802–5804. External Links: Cited by: RELATED WORK.
-  (2019) Incorporating behavioral constraints in online AI systems. AAAI 2019 . External Links: Cited by: RELATED WORK.
-  (2019) Using multi-armed bandits to learn ethical priorities for online AI systems. IBM Journal of Research and Development 63 (4/5), pp. 1:1–1:13. External Links: Cited by: RELATED WORK.
-  (2017) Mostly exploration-free algorithms for contextual bandits. arXiv preprint arXiv:1704.09011. Cited by: RELATED WORK.
-  (2012) A contextual-bandit algorithm for mobile context-aware recommender system. In Neural Information Processing - 19th International Conference, ICONIP 2012, Doha, Qatar, November 12-15, 2012, Proceedings, Part III, T. Huang, Z. Zeng, C. Li, and C. Leung (Eds.), Lecture Notes in Computer Science, Vol. 7665, pp. 324–331. External Links: Cited by: Introduction.
-  (2013) Risk-aware recommender systems. In Neural Information Processing - 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part I, M. Lee, A. Hirose, Z. Hou, and R. M. Kil (Eds.), Lecture Notes in Computer Science, Vol. 8226, pp. 57–65. External Links: Cited by: Introduction.
-  (2016) Multi-armed bandit problem with known trend. Neurocomputing 205, pp. 16–21. External Links: Cited by: RELATED WORK.
Contextual bandit for active learning: active thompson sampling. In Neural Information Processing - 21st International Conference, ICONIP 2014, Kuching, Malaysia, November 3-6, 2014. Proceedings, Part I, pp. 405–412. External Links: Cited by: RELATED WORK.
-  (2019) Optimal exploitation of clustering and history information in multi-armed bandit. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, S. Kraus (Ed.), pp. 2016–2022. External Links: Cited by: RELATED WORK.
-  (2017) Context attentive bandits: contextual bandit with restricted context. In IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pp. 1468–1475. External Links: Cited by: RELATED WORK.
-  (2017) Bandit models of human behavior: reward processing in mental disorders. In Artificial General Intelligence - 10th International Conference, AGI 2017, Melbourne, VIC, Australia, August 15-18, 2017, Proceedings, T. Everitt, B. Goertzel, and A. Potapov (Eds.), Lecture Notes in Computer Science, Vol. 10414, pp. 237–248. External Links: Cited by: Introduction.
-  (2019) A survey on practical applications of multi-armed and contextual bandits. CoRR abs/1904.10040. External Links: Cited by: RELATED WORK.
-  (2016) Exponentiated gradient exploration for active learning. Computers 5 (1), pp. 1. External Links: Cited by: RELATED WORK.
-  (1984) Classification and regression trees. Wadsworth and Brooks, Monterey, CA. Cited by: Regression Tree.
-  (2011) Contextual bandits with linear payoff functions.. In AISTATS, G. J. Gordon, D. B. Dunson, and M. Dudik (Eds.), JMLR Proceedings, Vol. 15, pp. 208–214. External Links: Cited by: RELATED WORK.
-  (1994) UCI machine learning repository. Silicon Graphics. External Links: Cited by: Stationary Environment.
-  (2007) Supervised machine learning: a review of classification techniques. informatica 31:249–268. Cited by: Regression Tree.
-  (1985) Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), pp. 4–22. External Links: Cited by: Introduction, RELATED WORK.
The epoch-greedy algorithm for multi-armed bandits with side information. In Advances in neural information processing systems, pp. 817–824. Cited by: The Contextual Bandit Problem..
-  (2010) A contextual-bandit approach to personalized news article recommendation. CoRR. Cited by: Introduction, RELATED WORK.
Contextual bandit with adaptive feature extraction. In 2018 IEEE International Conference on Data Mining Workshops, ICDM Workshops, Singapore, Singapore, November 17-20, 2018, pp. 937–944. External Links: Cited by: Introduction.
-  (2018) Contextual bandit with adaptive feature extraction. In 2018 IEEE International Conference on Data Mining Workshops, ICDM Workshops, Singapore, Singapore, November 17-20, 2018, H. Tong, Z. J. Li, F. Zhu, and J. Yu (Eds.), pp. 937–944. External Links: Cited by: RELATED WORK.
-  (2015) Bandits and recommender systems. In Machine Learning, Optimization, and Big Data - First International Workshop, MOD 2015, Taormina, Sicily, Italy, July 21-23, 2015, Revised Selected Papers, pp. 325–336. External Links: Cited by: Introduction.
Interpretable multi-objective reinforcement learning through policy orchestration. CoRR abs/1809.08343. External Links: Cited by: RELATED WORK.
-  (2019) Meta-learning for contextual bandit exploration. arXiv preprint arXiv:1901.08159. Cited by: RELATED WORK.
-  (1999) The asymptotic theory of permutation statistics. Mathematical Methods of Statistics 8, pp. 220–250. Cited by: Regression Tree.
-  (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.. Biometrika 25, pp. 285–294. Cited by: RELATED WORK.
-  (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: Thompson Sampling (TS)..
-  (2015) Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics 30 (2), pp. 199. Cited by: Introduction.