1 Introduction
MultiArmed Bandit (MAB) problem, is widely studied in probability theory and reinforcement learning which dates back to clinical trial studies by Thompson
[25]. Robbins [21] formulated the setting in 1952: it includes a learner having arms (options/choices) to explore given little knowledge about the properties of each arm. At each step in , the learner chooses an arm and receives a reward from the choice under the purpose to minimize the regret as well as maximize cumulative reward. Binomial bandit is the most common bandit formats by restricting the rewards being binary (). The solution of MAB problem involves balancing between acquiring new knowledge (exploration) and utilizing existing knowledge (exploitation), to make arm selection at each round based on the state of each arm. The upper confidence bound (UCB) algorithm was demonstrated as optimal solution to manage regret bound in the order of O(log(T))[18][17][1][3]. In online experiment, Thompson Sampling (TS) algorithm attracts a lot of attention due to its simplicity at implementation and resistance in batch updating. TS algorithm for binomial bandit could achieve optimal regret bound as well [14].Many modern online applications (e.g. UI Layout) have configuration involving multivariate dimensions to be optimized, such as font size, background color, title text, module location, item image etc., each dimension contains multiple options [12] [20]. In this paper, we call it MultivariateMAB problem. The exploring space faces exponentially exploding number of possible configurations as dimensions are added into the decision making. TS algorithm is reported to convergences slowly to the optimal solution [12] when dealing with MultivariateMAB problem. To speed up convergence, one common enhanced TS solution is to model the expected reward as general linear model (TSGLM) [10][6][23]
by probit/logit link function with
way dimension interaction features. TSGLM gives up the ability to fit certain complex interactions, in exchange for focusing on lowerdimensional parameter space and achieves better solution. However, updating derived posterior sampling algorithm in TSGLM demands imputing the multivariate coefficients and creates computation burden at each iteration
[23][24][12]. To release such burden, Daniel et. al. [12] proposed Hillclimbing multivariate optimization [7] for TSGLM, and recognized it obtained faster convergence speed with polynomial exhaustive parameter space.Different from TSGLM, our proposal framework called "Path Planning" (TSPP) is quite straightforward to combat the curse of dimensionality by a serial processes that operates sequentially and focuses on one dimension at each component process. Further more, it treats arm reward naturally with mway dimension interaction by mdimensional joint distribution. Our novelty includes:
(a) modeling arm selection procedure under tree structure. (b) efficient arm candidates search strategies under decision graph/trees. (c) remarkable convergence performance improvement by straightforward but effective arm space pruning. (d) concise and fast reward function posterior sampling under betabinomial even with way dimension interaction consideration. Compare to TSGLM, TSPP avoids deriving complex and slow posterior sampling in GLM, while still effectively leveraging the way dimension interactions and achieving even better performance by reducing arm space with efficient search strategies.This paper is organized as follows: We first introduce the problem setting and notation; then we explain our approach in details, and further discuss the differences among several variations; we also examine the algorithm performance in simulated study and concludes at the end.
2 MultivariateMAB Problem Formulation
We start with the formulation of contextual multivariate MAB: the sequential selection decision of layout (e.g. web page) , which contains a template with dimensions and each dimension contains options, under context (e.g. user preference) for the purpose of minimizing the expected cumulative regret ().
For each selected layout , a reward shall be received from the environment. Here only binary reward () is discussed, but our approach can be extended to categorical/numeric rewards as well. In the layout template, there exists alternative options for dimension and is denoted as the selected option. We further assume in our following description for simplicity purpose. The chosen layout can be denoted as , and it utilizes the notation referring . Context includes extra environment information that may impact layout’s expected reward (e.g. device type, user segments, etc.).
At each step , the bandit algorithm selects arm from search space with the consideration of the revealed context in order to minimize the cumulative regret over rounds:
where stands for the best possible arm at step . Generally, is on the order under linear payoff settings [11][10][2]. Although the optimal regret of noncontextual Multivariate MAB is on the order [18]. In this paper, we focus on categoricalcontextual multivariate MAB where
are purely categorical variables. By solving multivariate MAB independently for each combination of
(assuming not too many), it is trivial to show that the optimal regret bound of is still . Without loss of generalization, we set context feature as constant and ignore in the following discussion.3 Related Work
3.1 Probabilistic Model for MultivariateMAB
To model the multivariate bandit reward of layout , we denote the features combining and interactions within (possibly nonlinearly) as with length . The could involve only upto way dimension interactions of () instead of capturing all () possible interactions. The linear model with pairwise interactions is as follows:
(1) 
where are fixed but unknown parameter coefficients. In above function, it contains common bias term , weights for each dimension of layout and 2way dimension interactions of layout . The subindexes and are referring dimension and correspondingly.
Under the GLM setting, , where is the success rate of reward and is the link function that can either be the inverse of normal CDF as a probit model or the as a logit model. For given , the likelihood of reward would be or for probit or logit model respectively. The posterior sampling distribution of reward is its likelihood integrates with some fixed prior distribution of weights . Updating the posterior, at step , requires solving GLM for from cumulative historical rewards , which is disturbing and creates computation burden with time.
3.2 Thompson Sampling
Thompson sampling (TS) [22] is widely adapted in solving bandit and reinforcement learning problems to balance between exploitation and exploration. It utilizes common Bayesian techniques to form posterior distribution of rewards, hence allocates traffic to each arm proportional to probability of being best arm under posterior distribution.
Normally we handle binary response as binomial distribution with Beta prior
to form posterior distribution , where and are the number of successes and failures it has been encountered so far at arm , as well as and are prior parameters and would been set as 1 for uniform prior. At selection stage in round , it implicit allocates traffic as follows: simulates a single draw of from posterior () for each arm and the arm out of all arms will be selected. At update stage, it collects reward and the reward is used to update hidden state of selected arm.Practically, to solve MultivariateMAB problem, algorithm MAB directly adopts TS in selection out of arms, while algorithm DMABs decomposes MultivariateMAB into (dimensions) subMABs and implements TS in selection out of arms for each MAB (dimension) independently. We would discuss more details of the two algorithms in following sections.
3.3 MonteCarlo Tree Search
MonteCarlo Tree Search (MCTS)[5][8] is a bestfirst heuristic search algorithm to quickly locate the best leaf node in a tree structure. In game tree problem, it achieves great success especially when the number of leafs is large. Generally, each round of MCTS consists of four steps [9]: selection, expansion, simulation and backpropagation. Essentially, a simplified case only need selection and backpropagation steps. At selection step of MCTS, it starts from root of the tree and adopts a selection policy to select successive child nodes until a leaf node
is reached. The backpropagation step of MCTS involves using the reward to update information (like hidden states) in the nodes on the path from selected leaf to root. In artificial intelligence literature, the most successful (MCTS algorithm) UCT utilizes UCB
[16] as its node selection policy. Applying TS as node selection policy in MCTS (TSMCTS) is not well investigated in literature [13] from our best knowledge.By introducing the hierarchical dimensional structure of bandit arms, we can build tree structure over bandit arms and deploy MCTS with TS techniques for arm selection. We prefer TS node selection policy due to its resistance in performance for batch update situation. Inspired by this idea, we establish TS Path Planning algorithm to solve MultivariateMAB problem.
4 Approach
In this paper, we propose TS path planning algorithm (TSPP) for MultivariateMAB problem to overcome the exponential explosion in arm searching space of MAB
algorithm. Stimulated by MCTS idea, we utilize similar heuristic search strategy to locate best arm under a tree structure. We call such tree structure as “decision tree" which is constructed purely by
dimensions. Notably there are decision trees constructed in different sequential dimension order over same leaf nodes and they assemble a “decision graph". Under a decision tree/graph, the arm selection procedure is decomposed into a serial processes of decision making that operates sequentially and focuses on value selection within one dimension per each process. At each sequential decision process, we would apply TS as successive child nodes (dimension value) selection policy. The sequential order of dimensions (“decision path") is determined by the path planning strategy.Figure 1 shows an example of decision graph, decision tree as well as decision path. Without loss of generality, we assume that dimensions are tagged in order of , which is an arbitrary order of . Decision tree in Figure 0(b) compactly represents the joint probabilities of all arms (leaf nodes) and internal nodes. Here we borrow notations in [15]. and are compact ways to write and respectively. The subindex of refers that this dimension value comes from dimension . The structure of decision tree consists of nodes and directed edges. Each level of the decision tree represents a dimension and each node at that level corresponds to a value for dimension . Directed edges connect parent node to child node where the arrow represents conditional (jointly) relationship. Associated with each node is a jointly probability , where represents all predecessor nodes of in red arrows (Figure 0(b)). We denote as in our example, and as
. Based on the chain rule, the likelihood of arm
is . In practice, could be represented by hidden statesfrom Beta distribution (binary rewards) for node
(given its ) and node’s states could be updated in backpropagation stage (like in MCTS). As in TS, the chance of arm being the best arm depends on , hence it is also partially related with . Instead of sampling directly from posterior distributions of arms, sampling from distribution associated with each node could also provide guidance on value selection for dimension . Figure 0(a) utilizes decision graph to compactly represent decision trees. Once a decision path (red arrow in Figure 0(a)) is determined, decision graph is degenerated to decision tree for detailed view. With such abstraction, we further extended the naive MCTS idea with several other path planning strategies.4.1 TSPP Template
Algorithm 1 provides TSPP template for better understanding of our proposal over big picture. The proposed path planning algorithms utilize different path planning strategies to obtain candidate arm, by navigating the path from one node to next node originating from start to destination of the decision graph and applying TS within selected node (dimension) to pick the best value per each dimension, condition on fixing selected multivariate options in predecessor nodes unchanged. This conditional posterior sampling distribution as mentioned previously would rely on the hidden states of dimension values within current node given predecessor nodes dimensional value choices.
We understand that searching candidates in this way might be stacked in suboptimal arms. To address this issue, we intentionally repeat our candidate searching for () times and reapply TS tricks among these candidates for final arm selection. Once the arm is chosen at step , we would backpropagate the collected reward to update the hidden states of nodes () within all the possible paths from selected leaf to root in decision trees. Here the notation is used to load the node ’s (with prePath ) hidden states in to memory, which corresponds the joint density . It worth to note that any relative order of represents the same joint distribution (with same hidden states). In practice, requires computation complexity. But it could also be implemented in computation complexity for lazy backpropagation with cache memory saving.
4.2 Path Planning Procedure
We propose four path planning procedures for candidates searching: Full Path Finding (FPF), Partial Path Finding (PPF), Destination Shift (DS) and Boosted Destination Shift (BoostedDS). To construct armcandidate under decision graph (figure 0(a)), FPF starts from the root and sequentially optimizes dimensions onebyone in a completely random order, which utilizes the depthfirst search (DFS) strategy. With sticking with topdown flavor while extending the DMABs, PPF utilizes the breathfirst search (BFS) strategy with dimensional joint distribution independence (explained later) for all sub dimensions out of . Finally, inspired by hillclimbing [7] [12], which start from a random initial arm (bottom node in decision graph) and optimize value for one dimension with all other dimension value fixed, such bottomup flavor DS and the advanced version BoostedDS would be discussed. The following explains four methods in details.
Full Path Finding FPF is the direct application of MCTS and describes DFS algorithm in graph search. Starting from top, FPF randomly picks a permutation of dimensions denoting as with equal chance to construct a decision tree, and recursively applies TS policy from nodes on the path from root to leaf in that decision tree. It follows dimensions order to sequentially optimize value for each dimension . The upper index refers the optimized value for target dimension . Since we repeat FPF times, each iteration picks different decision tree (permutation of dimensions) and construct one candidate . The computational complexity for full path finding is with times searching and space complexity is . For lazy backpropagation implementation, the computation and space complexity could be improved to and separately.
Partial Path Finding In contrast, PPF describes a kind of BFS algorithm. A ^{th}partial path finding (PPF) recursively applies TS policy from nodes on prepath up to level in decision graph, then it simultaneously visits the remaining dimensions (unvisited nodes) in parallel at level and apply TS policy correspondingly. Specifically, the DMABs method is equivalent to PPF1, which adopts the dimension independent assumption. The Pseudo code in Algorithm 2 between line 5 and 10 illustrates a PPF2 algorithm, which assumes pairwise joint distribution independence. Mathematically, variables and are conditionally independent give () if and only if , and would call joint distribution of (, ) and (, ) are independent. So pairwise dimensional joint distribution independence means that for , where stands for all dimensions are independent except for dimension . Intuitively, PPF2 assumes pairwise interactions between dimensions, as it draws samples from pairwise dimensional joint distribution. Generally PPF maps up to way interactions in regression model. The optimal computational complexity of PPF2 is with times searching and space complexity is , if we only load all hidden states from top 2 levels of decision graph into memory.
Destination Shift DS randomly picks an initial arm (bottom node in decision graph) and performs Hillclimbing method cycling through all dimensions for rounds. At each round , we randomly choose an dimension to optimize and return the best dimension value based on posterior sampling distribution condition on the rest of the dimension values prefixed (). We then use to generate from by . The computational complexity is and space complexity is .
Boosted Destination Shift BoostedDS utilizes bstTS instead of TS function for value optimization on each target dimension node . It extends our previous intuition that sampling from dimensional joint distribution is 1to1 mapping to mway interaction weights in regression model. Pseudo code in Algorithm 2 between line 17 and 22 describes 2BoostedDS (BoostedDS2) sampling strategy which follows equation 1 with pairwise () interaction assumption. Instead of single draw from arm , at round with target dimension , it sums samples drawing from way joint density () and all pairwise () joint distributions with ( for all subindex and ). Generally, ^{th}BoostedDS (BoostedDSm) would take the sums of drawing samples upto dimensional joint distributions. The computational complexity of BoostedDS2 is and space complexity is if we store all needed hidden states instead.
In summary, FFP utilizes hidden states from same decision tree at each iteration; PFP and BoostedDS only utilizes hidden states on top levels of decision graph; DS utilizes hidden states on leaf nodes. DS and BoostedDS randomly pick an layout to start and keep improving itself dimension by dimension till converge, while FFP and PFP do not randomly guess other dimension values. All four algorithm approximate the process of finding best bandit arm by pruning decision search trees and greedy optimization of sequential process through all dimensions. As the greedy approach significantly reduce search space, hence the converge performances are expected to beat the traditional Thompson sampling method MAB.
5 Empirical Validation and Analysis
We illustrate the performance of our algorithm (FPF, PPF, DS and BoostedDS) on simulated data set, comparing with MVT[12], MAB[12] and DMABs[12] base models mentioned before. Specially, we evaluate (a) the average cumulative regret, (b) the convergence speed and (c) the efficiency of best optimal arm selection among these models under same simulation environment settings. To access fairly appreciative analysis, the mechanism and parameters for generating the simulation data set are completely at random. We would also replicate all algorithms multiple (H) times and take the average to eliminate evaluation bias due to TS probabilistic randomness. Furthermore, we extensively exam the cumulative regret performance of proposed algorithms by varying (1) the relative strength of interaction between dimensions and (2) complexity in arm space (altering and ) to gain comprehensive understanding of our model.
5.1 Simulation Settings
Simulated reward data is generated in Bernoulli simulator with success rate being linear with way dimension interactions:

(2) 
where is scaling variable and , , are control parameters. It is trivial to set . We intentionally generate weights independently from , and set and to control the overall signal to noise ratio as well as related strength among way interactions.
In this paper, we set (pairwise dimension interaction), and in above simulator settings, which yields 1000 possible layouts. To observe the convergence of each model and eliminate the randomness, our simulation is generated with time steps and replications. On each simulation replica and at time step , layout is chosen by each algorithm, and a binary reward is sampled from Bernoulli simulator with success rate , which is coming from Equation 2 with pregenerated randomly weights . We choose and as the same Hill climbing model parameter settings to compare between FPF, PPF2, DS, BoostedDS2 and MVT2 methods.
5.2 Numerical Results
Figure 2 shows the histograms of arm exploration and selection activities for
algorithm as well as distribution of success rate for arms in our simulator. The horizontal axis is the success rate of selected arm, while the vertical axis is the probability density in histogram. The success rate density of Bernoulli simulators is symmetrically distributed, which coincides with our simulation setting. However, the severity of right skewness reveals the efficiency and momentum algorithms recognizing bad performance arms and quickly adjusting search space to best possible arms. Although
MAB is theoretically guaranteed to archive optimal performance in the long run, the histogram graph might empirically explains why MVT2, FPF, PPF2 and BoostedDS2 outperform MAB in many ways. It is worth to mention that the search behavior (performance) of DS is similar to MAB, but DS consists simpler computational complexity (). This concludes us that DS strategy itself starting path planning from bottom has limited improvement on arm heuristic search than MAB. The underline reason could be that only small fraction of arms is explored at early stage and little information is known for each arm, starting from top strategy can resemble dimensional analogues and learn arm’s reward distribution from other arms with similar characteristics. In turns, it helps to rapidly shift to better performed arms. The proposed BoostedDS2 overcomes DS’s issue by using TS samples from top levels. The heavily right skewness in BoostedDS2 histogram confirms our suspects.To recognize the effectiveness of optimal selection, we leverage the average regret, convergence rate and best arm rate. We define convergence rate as proportion of trials with the most selected layout over a moving window with batch size () 1000. We further specify best arm rate as the proportion of trials with the best possible layout in one batch.

where and stand for most often selected layout and best possible layout respectively within a batch. Ideally, we prefer convergence rate and best arm rate both approaching to 1, which means the algorithm converges selection to single arm (convergence) as well as best arm. In practice, a fully converged batch trials almost surely select the same layout (suboptimal) but not necessarily the global optimal layout.
Simulated performance results are displayed in Figure 3 where xaxis is the time steps. Path planning algorithms demonstrate advantages over base models, especially for FPF, PPF2 and BoostedDS2. We see that PPF2 and BoostedDS2 quickly jump to low regret (and high reward) within steps, followed by FPF and MVT2 around steps. Although BoostedDS2 and MVT2 share fastest convergence speed followed by PPF2 then FPF, but FPF holds the highest best arm rate. FPF performance in cumulative regret (and reward) catches up for longer iterations as well. The intuition behind these is that FPF includes the most complex model space with considering full dimension interactions, in which it not only look from top level of decision graph to quickly eliminate bad performed dimension values but also drill down to leaf arms to correct negligence from higher levels. The exponential space complexity or computational complexity proportional to is our concern of FPF compared with PPF2 and BoostedDS2.
In our experiment, PPF2, BoostedDS2 and MVT2 assume models with pairwise interactions in one way or the other, and it happens to be our simulator setting. In practice extra effort is needed for correct modeling the reward function, which is out of this paper’s scope. PPF2 and BoostedDS2 both efficiently achieve lower regret comparing with MVT2. However, PPF2 carries better best arm rate than BoostedDS2. Our takeaway from this is that Hillclimbing strategy contains two drawbacks. First, it is equivalent to a bottomup path planning strategy in our framework, which is not as efficient as topdown strategy as discussed before. BoostedDS2 combats such weakness using TS samples on top levels to mimic sample draw from lower level. Second, Hillclimbing starts with randomly guess other dimension values which easily ends up with good enough arm selection (low regret and high convergence) but not always the best (low best arm rate). In the meanwhile, DMABs struggles in performance as its assumption of independence between dimensions doesn’t match with our simulator.
Although PPF2, BoostedDS2 and MVT2 all share simplified model complexity (both in computation and parameter space), however MVT2 takes longer time period per iteration compared with the other two. Table 1 shows the iteration speed of these algorithms in our implementation. In fact, MVT2 is the slowest algorithm as the heavy computation burden when updating regression coefficients in posterior sampling distribution.
Algorithm  Iteration Speed 

FPF  7.04 it/s 
PPF2  22.39 it/s 
DS  2.01 it/s 
BoostedDS2  1.38 it/s 
MVT2  0.25 it/s 
We further extend our simulation results of average cumulative regret by varying to change the strength of interactions as well as varying and to change space complexity in Fig 4. We skip MVT2 due to the time limitation (MVT2 takes 5 days per experiment). As varies from to 1 with step size , we see the pattern consists with prior result at Fig 3(a). The only exception is DMABs gets dominant regret when interaction strength is weak (), as DMABs’s no interaction assumption close to the truth at that time. DMABs is equivalent with PPF1. So DMABs should perform similarly with PPF2 when interaction strength is weak. Next we analyze the impact on performance with model complexity. We systematically swap in and in at Fig 3(b) and 3(c) respectively. We observe that the relative performance still holds: . Based on these extensive experiments, we assert that our proposed method is superior consistently.
In summary, our simulated results suggest that TSPP has good performance overall for multivariate bandit problem with large search space when dimension hierarchy structure exists. FPF accomplishes the best performance, however PPF2 attracts implementation attention due to its computation efficiency with comparable performance.
6 Conclusions
In this paper, we presented TSPP algorithms taking advantage of the hierarchy dimension structure of bandit arms to quickly find the best arm. It utilizes decision graph/trees to model arm reward success rate with mway dimension interaction, and adopts TS with MCTS for heuristic search of arm selection. Naturally, it is quite straightforward to combat the curse of dimensionality using a serial processes that operates sequentially by focusing on one dimension per each process. Based on our simulation results, it achieves superior results in terms of cumulative regret and converge speed comparing with MVT, MAB and DMABs on large decision space. We listed 4 variations of our algorithm, and concluded that FPF and PPF conduct the best performance. We highlight PPF due to its implementation simplicity but high efficiency in performance.
It is trivial to extend our algorithm to contextual bandit problem with finite categorical context features. But how to extend our algorithm from discrete to continuous contextual variables worth us further exploration. We notice some related work of TSMCTS [4] dealing with continuous reward in this area. Finally, fully understanding the mechanism of using heuristic greedy approach (in our method) to approximate TS from arms is still under investigation.
References
 [1] Rajeev Agrawal. Sample mean based index policies by o (log n) regret for the multiarmed bandit problem. Advances in Applied Probability, 27(4):1054–1078, 1995.

[2]
Shipra Agrawal and Navin Goyal.
Thompson sampling for contextual bandits with linear payoffs.
In
International Conference on Machine Learning
, pages 127–135, 2013.  [3] Peter Auer, Nicolo CesaBianchi, and Paul Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 [4] Aijun Bai, Feng Wu, Zongzhang Zhang, and Xiaoping Chen. Thompson sampling based montecarlo planning in pomdps. In TwentyFourth International Conference on Automated Planning and Scheduling, 2014.
 [5] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
 [6] Sébastien Bubeck, Nicolo CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 [7] George Casella and Roger L Berger. Statistical inference, volume 2. Duxbury Pacific Grove, CA, 2002.
 [8] Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. Montecarlo tree search: A new framework for game ai. 2008.
 [9] Guillaume M JB Chaslot, Mark HM Winands, H JAAP VAN DEN HERIK, Jos WHM Uiterwijk, and Bruno Bouzy. Progressive strategies for montecarlo tree search. New Mathematics and Natural Computation, 4(03):343–357, 2008.
 [10] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
 [11] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 21st Annual Conference on Learning Theory, pages 355–366, 2008.
 [12] Daniel N Hill, Houssam Nassif, Yi Liu, Anand Iyer, and SVN Vishwanathan. An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1813–1821. ACM, 2017.
 [13] Takahisa Imagawa and Tomoyuki Kaneko. Enhancements in monte carlo tree search algorithms for biased game trees. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), pages 43–50. IEEE, 2015.
 [14] Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finitetime analysis. pages 199–213, 2012.
 [15] Mykel J Kochenderfer. Decision making under uncertainty: theory and application. MIT press, 2015.
 [16] Levente Kocsis and Csaba Szepesvári. Bandit based montecarlo planning. In Proceedings of the 17th European Conference on Machine Learning, ECML’06, pages 282–293, Berlin, Heidelberg, 2006. SpringerVerlag.
 [17] Tze Leung Lai et al. Adaptive treatment allocation and the multiarmed bandit problem. The Annals of Statistics, 15(3):1091–1114, 1987.
 [18] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
 [19] Benedict C May, Nathan Korda, Anthony Lee, and David S Leslie. Optimistic bayesian sampling in contextualbandit problems. Journal of Machine Learning Research, 13(Jun):2069–2106, 2012.
 [20] Vivek Nair, Zhe Yu, Tim Menzies, Norbert Siegmund, and Sven Apel. Finding faster configurations using flash. IEEE Transactions on Software Engineering, 2018.
 [21] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
 [22] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
 [23] Steven L Scott. A modern bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
 [24] Steven L Scott. Multiarmed bandit experiments in the online service economy. Applied Stochastic Models in Business and Industry, 31(1):37–45, 2015.
 [25] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
Comments
There are no comments yet.