1 Introduction
The contextual bandit problem is a variant of the extensively studied multiarmed bandit problem Lai and Robbins (1985); Gittins (1979); Auer et al. (2002); Lin et al. (2018), where at each iteration, the agent observes an dimensional context (
feature vector
) and uses it, along with the rewards of the arms played in the past, to decide which arm to play Langford and Zhang (2008); Agarwal et al. (2009); Auer et al. (2003); Bouneffouf and Féraud (2016). The objective of the agent is to learn the relationship between the context and reward, in order to find the best armselection policy for maximizing cumulative reward over time.Recently, a promising variant of contextual bandits, called a Context Attentive Bandit (CAB) was proposed in Bouneffouf et al. (2017), where no context is given by default, but the agent can request to observe (to focus its "attention" on) a limited number of context variables at each iteration. We propose here an extension of this problem setting: a small subset of context variables is revealed at each iteration (i.e. partially observable context), followed by the agent’s choice of additional features, where , , and the set of immediately observed features are fixed at all iterations. The agent must learn to select both the best additional features and, subsequently, the best arm to play, given the resulting observed features. (The original Context Attentive Bandit corresponds to .)
The proposed setting is motivated by several reallife applications. For instance, in a clinical setting, a doctor may first take a look at patient’s medical record (partially observed context) to decide which medical test (additional context variables) to perform, before choosing a treatment plan (selecting an arm to play). It is often too costly or even impossible to conduct all possible tests (i.e., observe the full context); therefore, given the limit on the number of tests, the doctor must decide which subset of tests will result into maximally effective treatment choice (maximize the reward). Similar problems can arise in multiskill orchestration for AI agents. For example, in dialog orchestration, a user’s query is first directed to a number of domainspecific agents, each providing a different response, and then the best response is selected. However, it might be too costly to request the answers from all
domainspecific experts, especially in multipurpose dialog systems with a very large number of domains experts. Given a limit on the number of experts to use for each query, the orchestration agent must choose the best subset of experts to use. In this application, the query is the immediately observed part of the overall context, while the responses of domainspecific experts are the initially unobserved features from which a limited subset must be selected and observed, before choosing an arm, i.e. deciding on the best out of the available responses. For multipurpose dialog systems, such as, for example, personal home assistants, retrieving features or responses from every domainspecific agent is computationally expensive or intractable, with the potential to cause a poor user experience, again underscoring the need for effective feature selection.
Overall, the main contributions of this paper include: (1) a generalization of Context Attentive Bandit, and a first lower bound for this problem, (2) an algorithm called Context Attentive Thompson Sampling for stationary and nonstationary environments, and its regret bound in the case of stationary environment, and (3) an extensive empirical evaluation demonstrating advantages of our proposed algorithm over the previous contextattentive bandit approach Bouneffouf et al. (2017), on a range of datasets in both stationary and nonstationary settings.
2 Related Work
The contextual bandit (CB) problem has been extensively studied in the past, and a variety of solutions have been proposed. In LINUCB Li et al. (2010); AbbasiYadkori et al. (2011); Li et al. (2011); Bouneffouf and Rish (2019), Neural Bandit Allesiardo et al. (2014) and in linear Thompson Sampling Agrawal and Goyal (2013); Balakrishnan et al. (2019a, b), a linear dependency is assumed between the expected reward given the context and an action taken after observing this context; the representation space is modeled using a set of linear predictors. However, the context is assumed to be fully observable, which is not the case in this work. Motivated by dimensionality reduction tasks, AbbasiYadkori et al. (2012) studied a sparse variant of stochastic linear bandits, where only a relatively small and unknown subset of features is relevant to a multivariate function optimization. It presents an application to the problem of optimizing a function that depends on many features, where only a small, initially unknown subset of features is relevant. Similarly, Carpentier and Munos (2012) also considered highdimensional stochastic linear bandits with sparsity. There the authors combined ideas from compressed sensing and bandit theory to derive a novel algorithm. In Oswal et al. (2020), authors explores a new form of the linear bandit problem in which the algorithm receives the usual stochastic rewards as well as stochastic feedback about which features are relevant to the rewards and propose an algorithm that can achieve regret, without prior knowledge of which features are relevant. In Bastani and Bayati (2015)
, the problem is formulated as a multiarm bandit (MAB) problem with highdimensional covariates, and a new efficient bandit algorithm based on the LASSO estimator is presented. However,
the above work, unlike ours, assumes fully observable context variables, which is not always the case in some applications, as discussed in the previous section. In Bouneffouf et al. (2017) the authors proposed the novel framework of contextual bandit with restricted context, where observing the whole feature vector at each iteration is too costly or impossible for some reasons; however, the agent can request to observe the values of an arbitrary subset of features within a given budget, i.e. the limit on the number of features observed. This paper explores a more general problem, and unlike Bouneffouf et al. (2017), we provide a theoretical analysis of the proposed problem and the proposed algorithm.The Context Attentive Bandit problem is related to the budgeted learning problem, where a learner can access only a limited number of attributes from the training set or from the test set (see for instance CesaBianchi et al. (2011)). In Foster et al. (2016), the authors studied the online budgeted learning problem. They showed a significant negative result: for any no algorithm can achieve regret bounded by in polynomial time. For overcoming this negative result, an additional assumption is necessary. Here, following Durand and Gagné (2014), we assume that the expected reward of selecting a subset of features is the sum of the expected rewards of selecting individually the features. We obtain an efficient algorithm, which has a linear algorithmic complexity in terms of time horizon.
3 Context Attentive Bandit Problem
We now introduce the problem setting, outlined in Algorithm 1. Let be a set of features. At each time point the environment generates a feature vector , which the agent cannot observe fully, but a partial observation of the context is allowed: the values of a subset of observed features , are revealed: , . Based on this partially observed context, the agent is allowed to request an additional subset of unobserved features , . The goal of the agent is to maximize its total reward over time via (1) the optimal choice of the additional set of features , given the initial observed features , and (2) the optimal choice of an arm based on , . We assume
, an unknown probability distribution of the reward given the context and the action taken in that context. In the following the expectations are taken over the probability distribution
.The contextual bandit problem. Following Langford and Zhang (2008), this problem is defined as follows. At each time point , an agent is presented with a context (i.e. feature vector) before choosing an arm . Let denote a reward vector, where is the reward at time associated with the arm . Let denote a policy, mapping a context into an arm . We assume that the expected reward is a linear function of the context.
Assumption 1 (linear contextual bandit): Whatever the subset of selected features the expected reward is a linear function of the context: , where is an unknown parameter, , and .
Contextual Combinatorial Bandit. The contextual combinatorial bandit problem Qin et al. (2014) can be viewed as a game where the agent sequentially observes a context , selects a subset and observes the reward corresponding to the selected subset. The goal is to maximize the reward over time. Let be the reward associated with the set of selected features knowing the context vector and the policy . We have , where . Each feature
is associated with the corresponding random variable
which indicates the reward obtained when choosing the th feature at time .Assumption 2 (linear contextual combinatorial bandit): the mean reward of selecting the set of features is: , and the expectation of the reward of selecting the feature is a linear function of the context vector : , where is an unknown weight vector associated with the feature , , and . Let be the set of linear policies such that only the features coming from are used, where is a fixed subset of , and is any subset of . The objective of Contextual Attentive Bandit (Algorithm 1) is to find an optimal policy , over iterations or time points, so that the total reward is maximized.
Definition 1 (Optimal Policy for CAB).
The optimal policy for handling the CAB problem is selecting the arm at time : , where .
Definition 2 (Cumulative regret).
The cumulative regret over iterations of the policy , is defined as .
Property 1 (Regret decomposition).
The cumulative regret over iterations of the policy can be rewritten as following:
Remark 1.
CAB problem generalizes contextual bandit with restricted context problem (Bouneffouf et al. (2017)). Indeed, when the subset of observed context is empty, the reward of selecting the feature is given by , which is the coordinate of the vector .
Before introducing an algorithm for solving the above CAB problem, we will derive a lower bound on the expected regret of any algorithm used to solve this problem.
Theorem 1.
For any policy solving Context Attentive Bandit problem (Algorithm 1) under Assumption 1.1 and 1.2, there exists probability distribution and , such that the lower bound of the regret accumulated by over iterations is :
4 Context Attentive Thompson Sampling (CATS)
We now propose an algorithm for solving the CAB problem, called ContextAttentive Thompson Sampling (CATS), and summarize it in Algorithm 2. The basic idea of CATS is to use linear Thompson Sampling Agrawal and Goyal (2013) for solving the linear bandit problems for selecting the set of additional relevant features knowing , and for selecting the best arm knowing . Linear Thompson Sampling assumes a Gaussian prior for the likelihood function, which corresponds in CAB for arm to , and for feature to, . Then the posterior at time are respectively for arm , and for feature .
The algorithm takes the total number of features , the number of features initially observed , the number of additional features to observe , the set of observed features , the number of actions , the time horizon , the distribution parameter used in linear Thompson Sampling, and a function of time , which is used for adapting the algorithm to nonstationary linear bandits.
At each iteration the values of features in the subset are observed (line 4 Algorithm 2). Then the vector parameters are sampled for each feature (lines 57) from the posterior distribution (line 6). Then the subset of best estimated features at time is selected (lines 89). Once the feature vector is observed line 10, Linear Thompson Sampling is applied in steps 1115 to choose an arm. When the reward of selected arm is observed (line 15) the parameters are updated lines 1619.
Remark 2 (Algorithmic complexity).
At each time set, Algorithm 2 sorts a set of size and inverts matrices in dimensions that leads to an algorithmic complexity in .
Due to assumption 2, CATS algorithm benefits from a linear algorithmic complexity, overcoming the negative result stated in Foster et al. (2016). Before providing an upper bound of the regret of CATS in ) ( hides logarithmic factors), we need an additional assumption on the noise .
Assumption 3 (SubGaussian noise): and , the noise is conditionally subGaussian with , that is for all ,
Lemma 1.
(Theorem 2 in Agrawal and Goyal (2013)) When the measurement noise satisfies Assumption 2, , , , and the regret of Thompson Sampling in the Linear bandit problem with parameters is upper bounded with a probability by:
We can now derive the following result.
Theorem 2.
When the measurement noise satisfies Assumption 2, , , , and the regret of CATS (Algorithm 2) is upper bounded with a probability by:
Proof.
Theorem 2 states that the regret of CATS depends on the following two terms: the left term is the regret due to selecting a suboptimal arm, while the right term is the regret of selecting a suboptimal subset of features. We can see that there is still a gap between the lower bound of the Context Attentive Bandit problem and the upper bound of the proposed algorithm. The left term of the lower bound scales in , while the left term of the upper bound of CATS scales in , where hides logarithmic factor. The right term of the lower bound scales in , while the right term of the upper bound of CATS scales in . These gaps are due to the upper bound of regret of CATS, which uses Lemma 1. This suggests that the use of linear bandits based on an upper confidence balls, which scale in AbbasiYadkori et al. (2011) ( is the dimension of contexts), could reduce this theoretical gap. As we show in the next section, we choose the Thompson Sampling approach for its better empirical performances.
5 Experiments
We compare the proposed CATS algorithm with: (1) RandomEI: in addition to the observed features, this algorithm selects a Random subset of features of the specified size at each Iteration (thus, RandomEI), and then invokes the linear Thompson sampling algorithm. (2) Randomfix: this algorithm invokes linear Thompson sampling on a subset of features, where the subset is randomly selected once prior to seeing any data samples, and remains fixed. (3) The stateofart method for contextattentive bandits proposed in Bouneffouf et al. (2017), Thompson Sampling with Restricted Context (TSRC): TSRC solves the CBRC (contextual bandit with restricted context) problem discussed earlier: at each iteration, the algorithm decides on features to observe (referred to as unobserved context). In our setting, however, out of features are immediately observed at each iteration (referred to as known context), then TSRC decision mechanism is used to select
additional unknown features to observe, followed by linear Thompson sampling on U+V features. (4) CALINUCB: where we replace the contextual TS in CATS with LINUCB. (5) CATSfix: is heuristic where we stop the features exploration after some iterations
(we report here the an average over the best results). Empirical evaluation of Randomfix, RandomEI, CATSfix, CALINUCB, CATS ^{1}^{1}1Note that we have used the same exploration parameter value used in Chapelle and Li (2011) for TS and LINUCB type algorithms which are and and TSRC was performed on several publicly available datasets, as well as on a proprietary corporate dialog orchestration dataset. Publicly available Covertype and CNAE9 were featured in the original TSRC paper and Warfarin Sharabiani et al. (2015) is a historically popular dataset for evaluating bandit methods.To simulate the known and unknown context space, we randomly fix of the context feature space of each dataset to be known at the onset and explore a subset of unknown features. To consider the possibility of nonstationarity in the unknown context space over time, we introduce a weight decay parameter that reduces the effect of past examples when updating the CATS parameters. We refer to the stationary case as CATS and fix . For the nonstationary setting, we simulate nonstationarity in the unknown feature space by duplicating each dataset, randomly fixing the known context in the same manner as above, and shuffling the unknown feature set  label pairs. Then we stochastically replace events in the original dataset with their shuffled counterparts, with the probability of replacement increasing uniformly with each additional event. We refer to the nonstationary case as NCATS and use as defined by the GPUCB algorithm Srinivas et al. (2009). We compare NCATS to NCATSfix and NCALINUCB which are the non stationary version of CATSfix and CALINUCB. we have also compare NCATS to the Weighted TSRC (WTSRC), the nonstationary version of TSRC also developed by Bouneffouf et al. (2017). WTSRC makes updates to its feature selection model based only on recent events, where recent events are defined by a time period, or "window" . We choose for WTSRC. We report the total average reward divided by T over 200 trials across a range of corresponding to various percentages of for each algorithm in Table 1.
The results in Table 1 are promising, with our methods outperforming the state of the art in the majority of cases across both settings. The most notable exception is found for CNAE9 dataset, where CATS sometimes outperforms or nearly matches TSRC performance. This outcome is somewhat expected, since in the original work on TSRC Bouneffouf et al. (2017), the mean error rate of TSRC was only lower than the error corresponding to randomly fixing a subset of unknown features to reveal for each event on CNAE9. This observation suggests that, for this particular dataset, there may not be a subset of features which would be noticeably more predictive of the reward than the rest of the features. We also observe that the LINUCB version of CATS has comparable performance with CATS with slight advantage to CATS. Another observation is that CATSfix is performing better than CATS in some situations, the explanation could be that after finding the best features the algorithm do not need to explore anymore and focus on finding the best arms based on these featues.
We perform a deeper analysis of the Covertype dataset, examining multistaged selection of the unknown context feature sets. In CATS, the known context is used to select all additional context feature sets at once. In a multistaged approach, the known context grows and is used to select each of the additional context features incrementally (one feature at a time). Maintaining , for the stationary case we denote these two cases of the CATS algorithm as CATS and CATSStaged respectively and report their performance when of the context is randomly fixed, across various in Figure 0(a). Note that when the remaining 90% of features are revealed, the CATS and TSRC methods all reduce to simple linear Thompson sampling with the full feature set. Similarly, when 0 additional feature sets are revealed, the methods all reduce to linear Thompson sampling with a sparsely represented known context. Observe that CATS consistently outperforms CATSStaged across all tested. CATSStaged likely suffers because incremental feature selection adds nonstationarity to the known context  CATS learns relationships between the known and unknown features while CATSStaged learns relationships between them as the known context grows. Nonetheless, both methods outperform TSRC. In the nonstationary case we use the GPUCB algorithm for , refer to the single and multistaged cases as NCATS and NCATSStaged, and illustrate their performance in Figure 0(b). Here we observe that NCATS and NCATSStaged have comparable performance, and the improvement gain over baseline, in this case WTSRC, is even greater than in the stationary case.
Next we evaluate our methods on Customer Assistant, a proprietary multiskill dialog orchestration dataset. Recall that this kind of application motivates the CAB setting because there is a natural divide between the known and unknown context spaces; the query and its associated features are known at the onset and the potential responses and their associated features are only known for the domain specific agents the query is posed to. The Customer Assistant orchestrates 9 domain specific agents which we arbitrarily denote as in the discussion that follows. In this application, example skills lie in the domains of payroll, compensation, travel, health benefits, and so on. In addition to a textual response to a user query, the skills orchestrated by Customer Assistant also return the following features: an intent, a short string descriptor that categorizes the perceived intent of the query, and a confidence, a real value between 0 and 1 indicating how confident a skill is that its response is relevant to the query. Skills have multiple intents associated with them. The orchestrator uses all the features associated with the query and the candidate responses from all the skills to choose which skill should carry the conversation.
The Customer Assistant dataset contains 28,412 events associated with a correct skill response. We encode each query by averaging 50 dimensional GloVe word embeddings Pennington et al. (2014)
for each word in each query and for each skill we create a feature set consisting of its confidence and a onehot encoding of its intent. The skill feature set size for
are 181, 9, 4, 7, 6, 27, 110, 297, and 30 respectively. We concatenate the query features and all of the skill features to form a 721 dimensional context feature vector for each event in this dataset. Recall that there is no need for simulation of the known and unknown contexts; in a live setting the query features are immediately calculable or known, whereas the confidence and intent necessary to build a skill’s feature set are unknown until a skill is executed. Because the confidence and intent for a skill are both accessible post execution, we reveal them together. We accommodate this by slightly modifying the objective of CATS to reveal unknown skill feature sets instead of unknown individual features for each event. We perform a deeper analysis of the Customer Assistant dataset, examining multistaged selection of the unknown context feature sets. Maintaining , for the stationary case the results are summarized in Figure 1(a). Here both CATSStaged and CATS methods outperform TSRC by a large margin.For the nonstationary case we simulate nonstationarity in the same manner as the publicly available datasets, except using the natural partition of the query features as the known context and the skill feature sets as the unknown context instead of simulated percentages. We use the GPUCB algorithm for and illustrate the performance of NCATS and NCATSStaged alongside WTSRC in Figure 1(b). Here we observe that NCATS slightly outperforms NCATSStaged, and both outperform the WTSRC baseline.
6 Conclusions and Future Work
We have introduced here a novel bandit problem with only partially observable context and the option of requesting a limited number of additional observations. We also propose an algorithm, designed to take an advantage of the initial partial observations in order to improve its choice of which additional features to observe, and demonstrate its advantages over the prior art, a standard contextattentive bandit with no partial observations of the context prior to feature selection step. Our problem setting is motivated by several realistic scenarios, including medical applications as well as multidomain dialog systems. Note that our current formulation assumes that all unobserved features have equal observation cost. However, a more practical assumption is that some features may be more costly than others; thus, in our future work, we plan to expand this notion of budget to accommodate more scenarios involving different feature costs.
7 Broader Impact
This problem has broader impacts in several domains such as voice assistants, healthcare and ecommerce.

Better medical diagnosis. In a clinical setting, it is often too costly or infeasible to conduct all possible tests; therefore, given the limit on the number of tests, the doctor must decide which subset of tests will result into maximally effective treatment choice in an iterative manner. A doctor may first take a look at patient’s medical record to decide which medical test to perform, before choosing a treatment plan.

Better user preference modeling. Our approach can help to develop better chatbots and automated personal assistants. For example, following a request such as, for example, "play music", an AIbased home assistant must learn to ask several followup questions (from a list of possible questions) to better understand the intent of a user and to remove ambiguities: e.g., what type of music do you prefer (jazz, pop, etc)? Would you like it on hifi system or on TV? And so on. Another example: a support desk chatbot, in response to user’s complaint ("My Internet connection is bad") must learn to ask a sequence of appropriate questions (from a list of possible connection issues): how far is your WIFI hotspot? Do you have a 4G subscription? These scenarios are wellhandled by the framework we proposed in this paper.

Better recommendations. Voice assistants and recommendation systems in general tend to lock us in our preferences, which can have deleterious effects: e.g., recommendations based only on the past history of user’s choices may reinforce certain undesirable tendencies, e.g., suggesting an online content based on a user’s with particular bias (e.g., racist, sexist, etc). On the contrary, our approach could potentially help a user to break out of this loop, by suggesting the items (e.g. news) on additional questions (additional features) which can be used to broaden user’s horizons.
References
 Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: §2, §4.

Onlinetoconfidenceset conversions and application to sparse stochastic bandits.
In
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2012
, pp. 1–9. External Links: Link Cited by: §2.  Explore/exploit schemes for web content optimization. In Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ICDM ’09, Washington, DC, USA, pp. 1–10. External Links: ISBN 9780769538952, Link, Document Cited by: §1.
 Thompson sampling for contextual bandits with linear payoffs. In ICML (3), pp. 127–135. Cited by: §2, §4, Lemma 1.

A neural networks committee for the contextual bandit problem
. In Neural Information Processing  21st International Conference, ICONIP 2014, Kuching, Malaysia, November 36, 2014. Proceedings, Part I, pp. 374–381. External Links: Link, Document Cited by: §2.  Finitetime analysis of the multiarmed bandit problem. Machine Learning 47 (23), pp. 235–256. Cited by: §1.
 The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32 (1), pp. 48–77. External Links: ISSN 00975397, Link, Document Cited by: §1.
 Incorporating behavioral constraints in online AI systems. In The ThirtyThird AAAI Conference on Artificial Intelligence, AAAI 2019, The ThirtyFirst Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27  February 1, 2019, pp. 3–11. External Links: Link, Document Cited by: §2.
 Using multiarmed bandits to learn ethical priorities for online AI systems. IBM J. Res. Dev. 63 (4/5), pp. 1:1–1:13. External Links: Link, Document Cited by: §2.
 Online decisionmaking with highdimensional covariates. Available at SSRN 2661896. Cited by: §2.
 Multiarmed bandit problem with known trend. Neurocomputing 205, pp. 16–21. External Links: Link, Document Cited by: §1.
 Context attentive bandits: contextual bandit with restricted context. In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 1925, 2017, pp. 1468–1475. External Links: Link, Document Cited by: §1, §1, §2, §5, §5, §5, Remark 1.
 A survey on practical applications of multiarmed and contextual bandits. CoRR abs/1904.10040. External Links: Link, 1904.10040 Cited by: §2.
 Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2012, pp. 190–198. External Links: Link Cited by: §2.
 Efficient learning with partially observed attributes. J. Mach. Learn. Res. 12 (null), pp. 2857–2878. External Links: ISSN 15324435 Cited by: §2.
 An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pp. 2249–2257. Cited by: footnote 1.
 Contextual bandits with linear payoff functions.. In AISTATS, G. J. Gordon, D. B. Dunson, and M. Dudik (Eds.), JMLR Proceedings, Vol. 15, pp. 208–214. External Links: Link Cited by: §3.
 Thompson sampling for combinatorial bandits and its application to online feature selection. In AAAI, Workshop Sequential DecisionMaking with Big Data:. Cited by: §2.

Online sparse linear regression
. In COLT, Cited by: §2, §4.  Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society. Series B (Methodological) 41 (2), pp. 148–177. Cited by: §1.
 Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics 6 (1), pp. 4–22. External Links: Link Cited by: §1.

The epochgreedy algorithm for multiarmed bandits with side information
. In Advances in neural information processing systems, pp. 817–824. Cited by: §1, §3.  A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, WWW ’10, USA, pp. 661–670. Cited by: §2.
 Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms.. In WSDM, I. King, W. Nejdl, and H. Li (Eds.), pp. 297–306. External Links: Link Cited by: §2.

Contextual bandit with adaptive feature extraction
. In 2018 IEEE International Conference on Data Mining Workshops (ICDMW), Vol. , pp. 937–944. Cited by: §1.  Linear bandits with feature feedback. AAAI. Cited by: §2.

GloVe: global vectors for word representation.
In
Empirical Methods in Natural Language Processing (EMNLP)
, pp. 1532–1543. External Links: Link Cited by: §5.  Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 461–469. Cited by: §3.
 Revisiting warfarin dosing using machine learning techniques. Computational and mathematical methods in medicine 2015. Cited by: §5.
 Gaussian process optimization in the bandit setting: no regret and experimental design. arXiv preprint arXiv:0912.3995. Cited by: §5.
Comments
There are no comments yet.