1 Introduction
Serverless execution enables scalable and modular deployment of models for contemporary applications, including AI agents. In the context of dialog systems, such modular design entails connecting multiple dialogue agents or "skills"  each trained independently on different or overlapping tasks  to form a unified system with capabilities of each of its constituent skills. The orchestrator or control module for such a system can either be a deterministic module employing “If this then that” (IFTTT) logic or more complex functional programming frameworks such as Amazon Lambda etc, or could in itself be a learnable model employing either supervised or reinforcement learning approaches.
This is a fairly common scenario in contemporary personal home assistant devices such as Amazon Alexa and Google Home, where developers have the ability to integrate their own independently developed skills with the assistant’s core infrastructure. Here, the assistant itself is responsible for invoking skills in response to user input. Invocation of these skills falls into two categories: explicit invocation and implicit invocation. Explicit invocation occurs when the user explicitly specifies the name of the skill they are interested in interacting with. This requires the user to specify the name of the skill along with a set of predefined invocation phrases that trigger the skill googleInvocation , and is an application of the IFTTT logic. Implicit invocation on the other hand, does not provide the assistant with the name of the skill the user is interested in interacting with, and requires the assistant to understand the users query along with the available skills capabilities to select the most appropriate skill to respond with googleInvocation ; amazonInvocation . Implicit invocation has a clear advantage in facilitating more natural conversation and removes the knowledge barrier of skill naming and understanding that explicit invocation requires.
Dialog orchestration models that use implicit invocation tend to follow the apriori approach or the posterior approach. Apriori orchestrator models are built exclusively using features known prior to executing any skills whereas posterior models execute skills to extract supplemental features. By this definition, it is clear that posterior approaches ought to match or beat comparable apriori methods as they use a superset of the features used in the apriori approach. Most recent work has focused on the posterior approach, including submissions to the Alexa prize competition Papaioannou2017 ; Adewale2017
. Both of these approaches have leveraged supervised learning techniques, which necessitate training data and regular updating if deployed live.
Online orchestration models could remove this hurdle, enabling a cold start deployment. Online orchestration also does not require a fixed label space, allowing new skills to be added to the agent in a seamless way. While a multitude of reinforcement learning models are viable orchestration candidates, we investigate the use of contextual bandits for the task. The contextual bandit problem is a variant of the extensively studied multiarmed bandit problem LR85 ; gittins1979bandit ; UCB , where at each iteration, before choosing an arm, the agent observes an dimensional context, or feature vector, and uses it to predict the next best arm to play langford2008epoch ; agarwal2009explore ; auer2002nonstochastic ; AgrawalG13
. Every time an arm is played, a reward value is observed. Over time, the agent’s aim is to collect enough information about the relationship between the context vectors and rewards, so that it can predict the next best arm to play by looking at the corresponding context
langford2008epoch ; AgrawalG13 .Posterior orchestration, online or otherwise, is not without its challenges. Recall that in posterior dialog orchestration, a user’s query is often directed to a number of domain specific skills and the best response is returned. In this case, the preexecution features, i.e features extracted from the query, can be immediately observed, but the set of features or responses from skills, the postexecution features, cannot. For multipurpose dialog systems, like personal home assistants, executing and retrieving features or responses from every skill can be computationally expensive or intractable, with the potential to cause a poor user experience. Moreover, executing skills in some use cases may necessitate api requests associated with actual costs. Thus while posterior dialog orchestration models are in many ways conceptually preferable to apriori approaches, in practice they are associated with an often unaccounted cost. The challenge here is introducing a budget on the number of postexecution features that can be extracted. Some existing supervised posterior orchestration methods recognize this challenge and avoid retrieving all postexecution features. As an example, Kim et. al. Kim2018 present a set of efficient and scalable neural shortlistingreranking models for personal assistants. The shortlisting stage efficiently trims all the skills down to a list of topk candidates, and the reranking stage performs a listwise reranking of the initial topk skills with additional contextual information. Beyond the lack of coldstart support inherent to all supervised approaches, the amount of data necessary to effectively train this kind of neural model based method is a limiting factor for lowdata use cases. Given that online orchestration avoids these pitfalls, we develop a novel bandit algorithm that handles this challenge of limited access to postexecution features.
The goal of our research is to build a dialog orchestration framework which can utilize query and user features along with the conversational context to route the dialog in a multiskill system. Overall, the main contributions of this paper include (1) presenting an online approach to dialog orchestration, (2) a new variant of the context attentive bandit problem, motivated by limitations of posterior dialog orchestration, and (3) an empirical evaluation demonstrating the advantages of our proposed method over a range of datasets and settings.
2 Background
The contextual bandit problem has been extensively studied in the past, and a variety of solutions have been proposed. In LINUCB li2010contextual ; abbasi2011improved ; chu2011contextual
and in Contextual Thompson Sampling (CTS)
AgrawalG13 , a linear dependency is assumed between the expected reward given the context and an action is taken after observing this context; the representation space is modeled using a set of linear predictors. However, the context is assumed to be fully observable, which is not the case in this work.Motivated by dimensionality reduction tasks, AbbasiYadkori et. al. YadkoriPS12 studied a sparse variant of stochastic linear bandits, where only a relatively small and unknown subset of features is relevant to a multivariate function optimization. Similarly, Carpentier & Munos CarpentierM12 also considered highdimensional stochastic linear bandits with sparsity, where components are assumed to be nonzero, and where the dimension of the context is larger than the sampling budget . In Bastani & Bayati bastani2015online
consider a multiarm bandit (MAB) problem with highdimensional covariates, and a new efficient bandit algorithm based on the LASSO estimator is presented. Regret analysis is performed, demonstrating that the proposed algorithm achieves nearoptimal performance in comparison to an oracle that knows all the problem parameters. Still, all above work, unlike ours, assumes full observability of the context variables, which is not the case in many important applications.
Finally, Bouneffouf et. al. BouneffoufRCF17 developed the idea of context attentive bandits  a case of the contextual bandit problem, referred to as contextual bandit with restricted context (CBRC), where observing the whole feature vector at each iteration is impossible, and the agent can only request to see some limited number of those features; the upper bound (budget) on the feature subset is fixed for all iterations, but within this budget, the agent can choose any feature subset of said size. However, in the posterior dialog orchestration application, while the full context may be too costly or impossible to see, some partial observation of the context, e.g. query or user features, can be known to an agent initially, along with the ability to observe unknown context features, up to a certain limit, as in CBRC.
Motivated by the limitations of posterior dialog orchestration, we extend the context attentive bandit to a special case which we call the Context Attentive Bandit with Observations (CABO). In the CABO setting, observing the full context vector at each iteration is impossible, but a small subset of context features, is observable and a fixed number of the unobserved features within a budget can be revealed. The goal here is to leverage the observable features to select the best unknown feature subset at each iteration to maximize overall reward.
3 Problem Setting
We begin by formally defining concepts our novel bandit problem setting builds upon, such as contextual bandit and contextual combinatorial bandit.
The Contextual Bandit Problem. Following Langford & Zhang langford2008epoch , this problem is defined as follows. At each time point (iteration) , an agent is presented with a context (feature vector) before choosing an arm . We denote by the set of features (variables) defining the context. Let denote a reward vector, where is a reward at time associated with the arm . Herein, we will primarily focus on the Bernoulli bandit with binary reward, i.e. . Let denote a policy, mapping a context into an action
. We assume some probability distribution
over the contexts in , and a distribution of the reward, given the context and the action taken in that context. We assume that the expected reward (with respect to the distribution ) is a linear function of the context, i.e. , where is an unknown weight vector associated with the arm ; the agent’s objective is to learn from the data so it can optimize its cumulative reward over time.Contextual Combinatorial Bandit. Our feature subset selection approach builds upon the Contextual Combinatorial Bandit (CCB) problem qin2014contextual , specified as follows. Each arm is associated with the corresponding variable indicating the reward obtained when choosing the th arm at time , for . In the contextual combinatorial bandit setting, the agent sequentially observes a context , selects a subset of arms , from a constrained set of arm subsets , where is the powerset of , and observes a reward associated with the selected subset of arms. Here we define the reward function used to compute as a sum of the outcomes of the arms in , i.e. , although one can also use nonlinear rewards. The objective of the CCB algorithm is to maximize the reward over time. We consider here a stochastic model, where the expectation of observed for an arm is a linear function of the context, i.e. , where is an unknown weight vector (to be learned from the data) associated with the arm . The distributions can be different for each arm. The global rewards
are also random variables, independent and distributed according to some unknown distribution with some expectation
.3.1 CABO: Context Attentive Bandit with Observations
Building off the contextual bandit and contextual combinatorial bandit problems, we formally define a novel type of bandit problem, called Context Attentive Bandit with Observations (CABO).
As mentioned above, will denote a vector of values assigned to an (ordered) set of random context variables, or features, , at time . Let , , , denote a subset of features of size , and let denote a vector from a subspace of , denoted , which is defined as a subspace containing all sparse vectors with features (coordinates) outside of the subset set to zero.
We assume that at each time point the environment generates a feature vector which the agent cannot observe fully. However, unlike the previously introduced CBRC setting BouneffoufRCF17 , the agent has now a partial observation of the context, i.e. it can see a small subset of observed features , where . Given these observations , the agent is allowed to request more features to observe (similar to CBRC setting), up to (desired) features in total, including the initial set , where denotes final set of observed features. Assuming that the unobserved features are all of the same fixed cost, there is a budget of features imposed on the agent. The goal of the agent is to maximize its total reward over time via (1) the optimal choice of the additional observations, given the initial ones, and (2) the optimal choice of a subsequent action based on the resulting extended observation.
Let us now formally define the set of all policies, i.e. possible mappings from agent’s observations to its actions restricted to the proposed problem setting, as the set of the compound functions
(1) 
where

is a function mapping a given subset of features , denoting the set of all subsets of of size , to a vector , denoting the set of all subspaces of , each defined for a corresponding subset of features;

maps the initial set of observed features to the extended set of features to be observed, , ;

is a function mapping the observed extended feature subset into an action (a.k.a. bandit’s arm) , which results into a reward .
The objective of a contextual bandit algorithm is to find an optimal policy , over iterations or time points, so that the total reward is maximized.
4 Methodology
4.1 CATSO: Context Attentive Thompson Sampling with Observations
We propose a novel method for solving the CABO problem, which we name Context Attentive Thompson Sampling with Observations (CATSO), and summarize it in Algorithm 1. The combinatorial task of selecting the best subset of features is treated as a contextual combinatorial bandit (CCB) problem qin2014contextual , and the subsequent decisionmaking (action selection) task as a contextual bandit problem solved by Contextual Thompson Sampling (CTS) AgrawalG13 , respectively.
The algorithm takes the total number of features , initially observed number of features , and the total desired number of features to observe , as inputs. We use to denote our budget, the number of unobserved features to reveal. We will use several stages, up to , to reveal features. When , the observed features are used to select all features as a set, whereas when , the set of features is updated incrementally and used to select each of the
additional features one at a time. The algorithm also requires hyperparameter
, the exploration parameter used in Thompson Sampling.The algorithm iterates over steps, where at each iteration , the values of features in the original observed subset are observed first. The current set of already observed features, , and the corresponding observed context, , is maintained over all stages, and are initialized to and respectively. At each iteration , the vector parameter
is sampled from the corresponding multivariate Gaussian distribution (step 10) for each feature
not yet observed so far, to estimate . Thereafter, at each stage, the best subset of features are selected, , such that where is the number of unknown features to explore at each stage.Once a subset of features is selected using the contextual combinatorial bandit approach, the algorithm switches to the contextual bandit setting to choose an arm based on the context consisting now of a subset of features (steps 1724).
We assume that the expected reward is a linear function of a restricted context,
We assume that reward for choosing arm at time follows a parametric likelihood function , and that the posterior distribution at time , , is given by a multivariate Gaussian distribution , where with the size of the context vectors , and .
At each time point , and for each arm, a dimensional is sampled from , , an arm is chosen such that is maximized (step 20 in the algorithm), a reward is obtained for choosing an arm , and finally the relevant parameters are updated.
4.1.1 CATSO in Nonstationary Setting
Practical posterior dialog orchestration applications motivate the need to consider the possibility of nonstationary unobserved context features. In posterior dialog orchestration, we assume each domain specific skill outputs features pertaining to their query response. In some use cases, each skill could be independently updated at any time, changing these features. As a result, similar queries, which would likely define the observable context
, can elicit vastly different distributions of response features, the unknown context, over time. The main problem with any stationary algorithm is that it gives equal weight to its history. In a nonstationary environment, if there is no specific assumption about how the environment will change, a simple idea is to use a weighting function to lessen the effect of the past on current decisions. Since we are using CTS as our base model and it uses ridge regression, implementation of weighting instances is straightforward. We propose assigning decaying weights to the past examples in the ridge regression. The same kind of weights are also applied in the calculation of the confidence width. Following the notation from Algorithm
1, in this case represents the decay parameter. In order to compute the optimal value, we use GPUCB algorithm srinivas2009gaussian , which is an algorithm that solves the multiarmed bandit problem in continuous space. Computing is done via the following decision rule:with as the GPUCB exploration parameter, mean reward, and the uncertainty. The GPUCB algorithm is initialized with the search space and at each iteration uses the above equation to calculate a different , which is then used by CATSO. More detail on the GPUCB algorithm can be found in srinivas2009gaussian .
5 Experiments
We assess Context Attentive Thompson Sampling with Observations (CATSO) with respect to the current state of the art for context attentive bandits, Thompson Sampling with Restricted Context (TSRC). TSRC solves the contextual bandit with restricted context problem (CBRC) discussed prior, which selects a set of unknown features at each event while assuming no observable features exist initially. For a total number of features , we refer to the observed features as the known context and the unobserved context features as the unknown context. In our use of the TSRC algorithm, at each iteration, the known context is observed, the TSRC decision mechanism independently chooses unknown context features to reveal, and Contextual Thompson Sampling (CTS) is invoked. Empirical evaluation of CATSO and TSRC was performed on publicly available classification datasets and on a propriety corporate dialog orchestration dataset.
Publicly available Covertype^{1}^{1}1https://archive.ics.uci.edu/ml/datasets.html and CNAE9 were featured in the original TSRC paper and Warfarin sharabiani2015revisiting is a historically popular dataset for evaluating bandit methods. The details of these datasets are summarized in Table 1.
Datasets  Instances  Features  Classes 

Covertype  500 000  95  7 
CNAE9  1080  856  9 
Warfarin  5528  93  3 
For the stationary setting, we randomly fix of the context feature space of each dataset to be known at the onset and explore a subset of unknown features. For CATSO, we fix to reflect the stationary setting and choose . For the nonstationary setting, we simulate nonstationarity in the unknown feature space by duplicating each dataset, randomly fixing the known context in the same manner as above, and shuffling the unknown feature set  label pairs. Then we stochastically replace events in the original dataset with their shuffled counterparts, with the probability of replacement increasing uniformly with each additional event. For this nonstationary setting, which we refer to as NCATSO, we again fix , but use as defined by the GPUCB algorithm. We compare NCATSO to Weighted TSRC (WTSRC), the nonstationary version of TSRC also developed by Bouneffouf et al. BouneffoufRCF17
. WTSRC makes updates to its feature selection model based only on recent events, where recent events are defined by a time period, or "window"
. We choose for WTSRC. We report the total average reward across a range of corresponding to various percentages of for each algorithm in each setting in Table 2.The results in Table 2 are promising, with our methodologies outperforming the state of the art in the majority of cases across both settings. The most notable exception is where CATSO sometimes outperforms and other times nearly matches TSRC performance on the CNAE9 dataset. This outcome is somewhat expected, for in the original work on TSRC BouneffoufRCF17 , the mean error rate of TSRC was only lower than randomly fixing a subset of unknown features to reveal for each event on CNAE9. This suggests that the operating premise of TSRC, that some features are more predictive of reward than others, does not hold on this dataset. On top of this assumption, CATSO also assumes that there exist relationships between the known and unknown context features, likely causing a small compounding of error.
Proprietary corporate dialog application Customer Assistant orchestrates 9 domain specific skills which we arbitrarily denote as in the discussion that follows. In this application, example skills lie in the domains of payroll, compensation, travel, health benefits, and so on. Each skill is designed with a multiturn conversation dialog tree. In addition to a textual response to a user query, the skills orchestrated by Customer Assistant also return the following features: an intent, a short string descriptor that categorizes the perceived intent of the query, and a confidence, a real value between 0 and 1 indicating how confident a skill is that its response is relevant to the query. Skills have multiple intents associated with them. The orchestrator uses all the features associated with the query and the candidate responses from all the skills to choose which skill should carry the conversation at a given event.
We accessed the training data for each skill to find example queries Customer Assistant has valid responses for, amounting to 28,412 queries total. Accordingly, we denote the correct class for a query to be the skill it was an example for. For 127 queries common to more than one skill’s training data, one of the skills was randomly assigned as the correct class. We pose each query to all of the skills to extract the associated intents and confidences, and add noise sampled from a (0.1, 0.05) distribution to all of the confidences to avoid builtin biases. We encode each query by averaging 50 dimensional GloVe word embeddings pennington2014glove
for each word in each query and for each skill we create a feature set consisting of its confidence and a onehot encoding of its intent. The skill feature set size for
are 181, 9, 4, 7, 6, 27, 110, 297, and 30 respectively. We concatenate the query features and all of the skill features to form a 721 dimensional context feature vector for each event in this dataset. In contrast to the publicly available datasets, here there is no need for simulation of the known and unknown contexts; in a live setting the query features are immediately calculable or known, whereas the confidence and intent necessary to build a skill’s feature set are unknown until a skill is executed. Because the confidence and intent for a skill are both accessible post execution, we reveal them together. We accommodate this by slightly modifying the objective of CATSO to reveal unknown skill feature sets instead of unknown individual features for each event.We perform a deeper analysis of the Customer Assistant dataset, examining the case where . Recall that when , the known context, in this case the query features, is used to select all additional context features sets at once, whereas when , the known context grows and is used to select each of the additional context feature sets incrementally. Maintaining , for the stationary case we denote these two cases of the CATSO algorithm as CATSO1 and CATSOU respectively and report their performance across various , the number of unknown skill feature sets revealed. Note that when all 9 skill feature sets are revealed, the CATSO and TSRC methods all reduce to simple Contextual Thompson Sampling (CTS) with the full feature set. Similarly, when 0 skill feature sets are revealed, the methods all reduce to CTS with a sparsely represented context of the query features. CTS suffers from this sparsity so we also consider a case we call CTSquery, CTS where the context is exclusively the query features. CTSquery is thus an apriori online approach to dialog orchestration that completely ignores the existence of postexecution features. The results for the stationary case are summarized in Figure 1. CATSOU appears to slightly outperform CATSO1 across all tested and both methods outperform TSRC by a large margin. Also notice that our posterior methods CATSO1 and CATSOU outperform the apriori method CTSquery even under very small postexecution feature budgets, as low as 2 skill feature sets.
For the nonstationary case we simulate nonstationarity in the same manner as the publicly available datasets, except using the natural partition of the query features as the known context and the skill feature sets as the unknown context instead of simulated percentages. We use the GPUCB algorithm for , refer to the and cases as NCATSO1 and NCATSOU, and illustrate their performance alongside WTSRC and CTSquery in Figure 2. Here we observe that NCATSO1 slightly outperforms NCATSOU, and both outperform the WTSRC baseline. Notice that posterior approach NCATSO1 outperforms CTSquery, the apriori approach, when approximately 3 or more skill feature sets are revealed.
6 Conclusions and Future Work
In this paper we consider how to address the challenges of posterior dialog orchestration using an online approach. We formulated CABO, a new variant of context attentive bandits motivated by practical budgets on skill execution and demonstrate that our new bandit algorithm beats the existing state of the art context attentive bandit algorithm on simulated (nondialog) and dialog datasets across stationary and nonstationary settings.
Customer Assistant has now been deployed to over 100,000 users with a thumbs up/down feature that allows users to provide individual feedback to each of the responses. The system also allows Subject Matter Experts (SMEs) to provide explicit labels to each event, enabling human evaluation of all the models.
Theoretical regret bounds on the proposed algorithm will follow in a more theory focused work. Our current algorithm treats multiskill dialog orchestration as a single class problem, where after each query, only one skill response is returned to the user. However, in many use cases, multiple responses ought to be returned to the user. We would like to shift our algorithm to the multiclass setting, perhaps by using the contextual combinatorial bandit approach in the arm selection process in addition to its current role in the unknown context feature selection process. Also, our current formulation assumes that all of the unobserved features are of the same cost, and thus the budget on cost is equivalent to a budget on the number of features. We plan on expanding this notion of budget to accommodate settings where unobserved features have different costs. Other directions for future work include using nonbandit algorithms in the context feature selection stage and exploring nonstationarity in the known context space.
References
 [1] Google. Overview  actions on google  google developers. https://developers.google.com/actions/discovery/, 2018.
 [2] Amazon. Understand how users invoke custom skills. https://developer.amazon.com/docs/customskills/understandinghowusersinvokecustomskills.html, 2018.
 [3] Ioannis Papaioannou, Amanda Cercas Curry, Jose L Part, Igor Shalyminov, Xinnuo Xu, and Yanchao Yu. Alana : Social Dialogue using an Ensemble Model and a Ranker trained on User Feedback. 1st Proceedings of Alexa Prize, pages 1–10, 2017.
 [4] Oluwatosin Adewale, Alex Beatson, Davit Buniatyan, Jason Ge, Mikhail Khodak, Holden Lee, Niranjani Prasad, Nikunj Saunshi, Ari Seff, Karan Singh, Daniel Suo, Cyril Zhang, and Sanjeev Arora. Pixie : A Social Chatbot. Alexa Price Proceedings 2017, pages 1–10, 2017.
 [5] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
 [6] John C Gittins. Bandit processes and dynamic allocation indices. Journal of the Royal Statistical Society: Series B (Methodological), 41(2):148–164, 1979.
 [7] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.

[8]
John Langford and Tong Zhang.
The epochgreedy algorithm for multiarmed bandits with side information.
In Advances in neural information processing systems, pages 817–824, 2008.  [9] Deepak Agarwal, BeeChung Chen, and Pradheep Elango. Explore/exploit schemes for web content optimization. In 2009 Ninth IEEE International Conference on Data Mining, pages 1–10. IEEE, 2009.
 [10] Peter Auer, Nicolo CesaBianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
 [11] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127–135, 2013.
 [12] YoungBum Kim, Dongchan Kim, JooKyung Kim, and Ruhi Sarikaya. A scalable neural shortlistingreranking approach for largescale domain classification in natural language understanding. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 16–24, 2018.
 [13] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
 [14] Yasin AbbasiYadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.

[15]
Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire.
Contextual bandits with linear payoff functions.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pages 208–214, 2011.  [16] Yasin AbbasiYadkori, David Pal, and Csaba Szepesvari. Onlinetoconfidenceset conversions and application to sparse stochastic bandits. In Artificial Intelligence and Statistics, pages 1–9, 2012.
 [17] Alexandra Carpentier and Rémi Munos. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Artificial Intelligence and Statistics, pages 190–198, 2012.
 [18] Hamsa Bastani and Mohsen Bayati. Online decisionmaking with highdimensional covariates. Available at SSRN 2661896, 2015.
 [19] Djallel Bouneffouf, Irina Rish, Guillermo A Cecchi, and Raphaël Féraud. Context attentive bandits: contextual bandit with restricted context. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 1468–1475. AAAI Press, 2017.
 [20] Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 461–469. SIAM, 2014.
 [21] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
 [22] Ashkan Sharabiani, Adam Bress, Elnaz Douzali, and Houshang Darabi. Revisiting warfarin dosing using machine learning techniques. Computational and mathematical methods in medicine, 2015, 2015.

[23]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning.
Glove: Global vectors for word representation.
In
Empirical Methods in Natural Language Processing (EMNLP)
, pages 1532–1543, 2014.
Comments
There are no comments yet.