Efficient Multivariate Bandit Algorithm with Path Planning

09/06/2019 ∙ by Keyu Nie, et al. ∙ 0

In this paper, we solve the arms exponential exploding issue in multivariate Multi-Armed Bandit (Multivariate-MAB) problem when the arm dimension hierarchy is considered. We propose a framework called path planning (TS-PP) which utilizes decision graph/trees to model arm reward success rate with m-way dimension interaction, and adopts Thompson sampling (TS) for heuristic search of arm selection. Naturally, it is quite straightforward to combat the curse of dimensionality using a serial processes that operates sequentially by focusing on one dimension per each process. For our best acknowledge, we are the first to solve Multivariate-MAB problem using graph path planning strategy and deploying alike Monte-Carlo tree search ideas. Our proposed method utilizing tree models has advantages comparing with traditional models such as general linear regression. Simulation studies validate our claim by achieving faster convergence speed, better efficient optimal arm allocation and lower cumulative regret.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-Armed Bandit (MAB) problem, is widely studied in probability theory and reinforcement learning which dates back to clinical trial studies by Thompson

[25]. Robbins [21] formulated the setting in 1952: it includes a learner having arms (options/choices) to explore given little knowledge about the properties of each arm. At each step in , the learner chooses an arm and receives a reward from the choice under the purpose to minimize the regret as well as maximize cumulative reward. Binomial bandit is the most common bandit formats by restricting the rewards being binary (). The solution of MAB problem involves balancing between acquiring new knowledge (exploration) and utilizing existing knowledge (exploitation), to make arm selection at each round based on the state of each arm. The upper confidence bound (UCB) algorithm was demonstrated as optimal solution to manage regret bound in the order of O(log(T))[18][17][1][3]. In online experiment, Thompson Sampling (TS) algorithm attracts a lot of attention due to its simplicity at implementation and resistance in batch updating. TS algorithm for binomial bandit could achieve optimal regret bound as well [14].

Many modern online applications (e.g. UI Layout) have configuration involving multivariate dimensions to be optimized, such as font size, background color, title text, module location, item image etc., each dimension contains multiple options [12] [20]. In this paper, we call it Multivariate-MAB problem. The exploring space faces exponentially exploding number of possible configurations as dimensions are added into the decision making. TS algorithm is reported to convergences slowly to the optimal solution [12] when dealing with Multivariate-MAB problem. To speed up convergence, one common enhanced TS solution is to model the expected reward as general linear model (TS-GLM) [10][6][23]

by probit/logit link function with

-way dimension interaction features. TS-GLM gives up the ability to fit certain complex interactions, in exchange for focusing on lower-dimensional parameter space and achieves better solution. However, updating derived posterior sampling algorithm in TS-GLM demands imputing the multivariate coefficients and creates computation burden at each iteration

[23][24][12]. To release such burden, Daniel et. al. [12] proposed Hill-climbing multivariate optimization [7] for TS-GLM, and recognized it obtained faster convergence speed with polynomial exhaustive parameter space.

Different from TS-GLM, our proposal framework called "Path Planning" (TS-PP) is quite straightforward to combat the curse of dimensionality by a serial processes that operates sequentially and focuses on one dimension at each component process. Further more, it treats arm reward naturally with m-way dimension interaction by m-dimensional joint distribution. Our novelty includes:

(a) modeling arm selection procedure under tree structure. (b) efficient arm candidates search strategies under decision graph/trees. (c) remarkable convergence performance improvement by straightforward but effective arm space pruning. (d) concise and fast reward function posterior sampling under beta-binomial even with -way dimension interaction consideration. Compare to TS-GLM, TS-PP avoids deriving complex and slow posterior sampling in GLM, while still effectively leveraging the -way dimension interactions and achieving even better performance by reducing arm space with efficient search strategies.

This paper is organized as follows: We first introduce the problem setting and notation; then we explain our approach in details, and further discuss the differences among several variations; we also examine the algorithm performance in simulated study and concludes at the end.

2 Multivariate-MAB Problem Formulation

We start with the formulation of contextual multivariate MAB: the sequential selection decision of layout (e.g. web page) , which contains a template with dimensions and each dimension contains options, under context (e.g. user preference) for the purpose of minimizing the expected cumulative regret ().

For each selected layout , a reward shall be received from the environment. Here only binary reward () is discussed, but our approach can be extended to categorical/numeric rewards as well. In the layout template, there exists alternative options for dimension and is denoted as the selected option. We further assume in our following description for simplicity purpose. The chosen layout can be denoted as , and it utilizes the notation referring . Context includes extra environment information that may impact layout’s expected reward (e.g. device type, user segments, etc.).

At each step , the bandit algorithm selects arm from search space with the consideration of the revealed context in order to minimize the cumulative regret over rounds:

where stands for the best possible arm at step . Generally, is on the order under linear payoff settings [11][10][2]. Although the optimal regret of non-contextual Multivariate MAB is on the order [18]. In this paper, we focus on categorical-contextual multivariate MAB where

are purely categorical variables. By solving multivariate MAB independently for each combination of

(assuming not too many), it is trivial to show that the optimal regret bound of is still . Without loss of generalization, we set context feature as constant and ignore in the following discussion.

3 Related Work

3.1 Probabilistic Model for Multivariate-MAB

To model the multivariate bandit reward of layout , we denote the features combining and interactions within (possibly non-linearly) as with length . The could involve only upto -way dimension interactions of () instead of capturing all () possible interactions. The linear model with pairwise interactions is as follows:

(1)

where are fixed but unknown parameter coefficients. In above function, it contains common bias term , weights for each dimension of layout and 2-way dimension interactions of layout . The sub-indexes and are referring dimension and correspondingly.

Under the GLM setting, , where is the success rate of reward and is the link function that can either be the inverse of normal CDF as a probit model or the as a logit model. For given , the likelihood of reward would be or for probit or logit model respectively. The posterior sampling distribution of reward is its likelihood integrates with some fixed prior distribution of weights . Updating the posterior, at step , requires solving GLM for from cumulative historical rewards , which is disturbing and creates computation burden with time.

Daniel et. al. [12] proposed MVT2 by assuming probit model with interactions between dimensions (Equation 1) and employing Hill-climbing multivariate optimization to achieve faster convergence speed.

3.2 Thompson Sampling

Thompson sampling (TS) [22] is widely adapted in solving bandit and reinforcement learning problems to balance between exploitation and exploration. It utilizes common Bayesian techniques to form posterior distribution of rewards, hence allocates traffic to each arm proportional to probability of being best arm under posterior distribution.

Normally we handle binary response as binomial distribution with Beta prior

to form posterior distribution , where and are the number of successes and failures it has been encountered so far at arm , as well as and are prior parameters and would been set as 1 for uniform prior. At selection stage in round , it implicit allocates traffic as follows: simulates a single draw of from posterior () for each arm and the arm out of all arms will be selected. At update stage, it collects reward and the reward is used to update hidden state of selected arm.

Practically, to solve Multivariate-MAB problem, algorithm MAB directly adopts TS in selection out of arms, while algorithm D-MABs decomposes Multivariate-MAB into (dimensions) sub-MABs and implements TS in selection out of arms for each MAB (dimension) independently. We would discuss more details of the two algorithms in following sections.

3.3 Monte-Carlo Tree Search

Monte-Carlo Tree Search (MCTS)[5][8] is a best-first heuristic search algorithm to quickly locate the best leaf node in a tree structure. In game tree problem, it achieves great success especially when the number of leafs is large. Generally, each round of MCTS consists of four steps [9]: selection, expansion, simulation and back-propagation. Essentially, a simplified case only need selection and back-propagation steps. At selection step of MCTS, it starts from root of the tree and adopts a selection policy to select successive child nodes until a leaf node

is reached. The back-propagation step of MCTS involves using the reward to update information (like hidden states) in the nodes on the path from selected leaf to root. In artificial intelligence literature, the most successful (MCTS algorithm) UCT utilizes UCB

[16] as its node selection policy. Applying TS as node selection policy in MCTS (TS-MCTS) is not well investigated in literature [13] from our best knowledge.

By introducing the hierarchical dimensional structure of bandit arms, we can build tree structure over bandit arms and deploy MCTS with TS techniques for arm selection. We prefer TS node selection policy due to its resistance in performance for batch update situation. Inspired by this idea, we establish TS Path Planning algorithm to solve Multivariate-MAB problem.

4 Approach

In this paper, we propose TS path planning algorithm (TS-PP) for Multivariate-MAB problem to overcome the exponential explosion in arm searching space of MAB

algorithm. Stimulated by MCTS idea, we utilize similar heuristic search strategy to locate best arm under a tree structure. We call such tree structure as “decision tree" which is constructed purely by

dimensions. Notably there are decision trees constructed in different sequential dimension order over same leaf nodes and they assemble a “decision graph". Under a decision tree/graph, the arm selection procedure is decomposed into a serial processes of decision making that operates sequentially and focuses on value selection within one dimension per each process. At each sequential decision process, we would apply TS as successive child nodes (dimension value) selection policy. The sequential order of dimensions (“decision path") is determined by the path planning strategy.

Figure 1 shows an example of decision graph, decision tree as well as decision path. Without loss of generality, we assume that dimensions are tagged in order of , which is an arbitrary order of . Decision tree in Figure 0(b) compactly represents the joint probabilities of all arms (leaf nodes) and internal nodes. Here we borrow notations in [15]. and are compact ways to write and respectively. The sub-index of refers that this dimension value comes from dimension . The structure of decision tree consists of nodes and directed edges. Each level of the decision tree represents a dimension and each node at that level corresponds to a value for dimension . Directed edges connect parent node to child node where the arrow represents conditional (jointly) relationship. Associated with each node is a jointly probability , where represents all predecessor nodes of in red arrows (Figure 0(b)). We denote as in our example, and as

. Based on the chain rule, the likelihood of arm

is . In practice, could be represented by hidden states

from Beta distribution (binary rewards) for node

(given its ) and node’s states could be updated in back-propagation stage (like in MCTS). As in TS, the chance of arm being the best arm depends on , hence it is also partially related with . Instead of sampling directly from posterior distributions of arms, sampling from distribution associated with each node could also provide guidance on value selection for dimension . Figure 0(a) utilizes decision graph to compactly represent decision trees. Once a decision path (red arrow in Figure 0(a)) is determined, decision graph is degenerated to decision tree for detailed view. With such abstraction, we further extended the naive MCTS idea with several other path planning strategies.

(a) Decision Graph
(b) Decision Tree under dimensions order
Figure 1: Path Planning Overview

4.1 TS-PP Template

Algorithm 1 provides TS-PP template for better understanding of our proposal over big picture. The proposed path planning algorithms utilize different path planning strategies to obtain candidate arm, by navigating the path from one node to next node originating from start to destination of the decision graph and applying TS within selected node (dimension) to pick the best value per each dimension, condition on fixing selected multivariate options in predecessor nodes unchanged. This conditional posterior sampling distribution as mentioned previously would rely on the hidden states of dimension values within current node given predecessor nodes dimensional value choices.

1:Input D, N, S , ,
2:for step  do
3:     for search  do Candidates Construction Stage
4:          Path Planning Procedure
5:         Sample      
6:     Select Arm
7:     Update History Rewards
8:function Sample(Path= )
9:     
10:     Sample
11:     Return
Algorithm 1 TS-PP Template

We understand that searching candidates in this way might be stacked in sub-optimal arms. To address this issue, we intentionally repeat our candidate searching for () times and re-apply TS tricks among these candidates for final arm selection. Once the arm is chosen at step , we would back-propagate the collected reward to update the hidden states of nodes () within all the possible paths from selected leaf to root in decision trees. Here the notation is used to load the node ’s (with prePath ) hidden states in to memory, which corresponds the joint density . It worth to note that any relative order of represents the same joint distribution (with same hidden states). In practice, requires computation complexity. But it could also be implemented in computation complexity for lazy back-propagation with cache memory saving.

4.2 Path Planning Procedure

We propose four path planning procedures for candidates searching: Full Path Finding (FPF), Partial Path Finding (PPF), Destination Shift (DS) and Boosted Destination Shift (Boosted-DS). To construct arm-candidate under decision graph (figure 0(a)), FPF starts from the root and sequentially optimizes dimensions one-by-one in a completely random order, which utilizes the depth-first search (DFS) strategy. With sticking with top-down flavor while extending the D-MABs, PPF utilizes the breath-first search (BFS) strategy with dimensional joint distribution independence (explained later) for all -sub dimensions out of . Finally, inspired by hill-climbing [7] [12], which start from a random initial arm (bottom node in decision graph) and optimize value for one dimension with all other dimension value fixed, such bottom-up flavor DS and the advanced version Boosted-DS would be discussed. The following explains four methods in details.

Full Path Finding FPF is the direct application of MCTS and describes DFS algorithm in graph search. Starting from top, FPF randomly picks a permutation of dimensions denoting as with equal chance to construct a decision tree, and recursively applies TS policy from nodes on the path from root to leaf in that decision tree. It follows dimensions order to sequentially optimize value for each dimension . The upper index refers the optimized value for target dimension . Since we repeat FPF times, each iteration picks different decision tree (permutation of dimensions) and construct one candidate . The computational complexity for full path finding is with times searching and space complexity is . For lazy back-propagation implementation, the computation and space complexity could be improved to and separately.

1:procedure Full Path Finding
2:     for index in (random order)  do
3:          (tgtDim=, prePath=[]) and      
4:     Constructed Candidate
5:procedure Partial Path Finding
6:     Random Pick Dimension
7:     tgtDim, prePath and
8:     for index and (do
9:          (tgtDim=, prePath=[]) and      
10:     Constructed Candidate
11:procedure Destination Shift
12:     Initial = random layout
13:     for  do
14:          Random Pick Dimension
15:          (tgtDim=, prePath=) and      
16:     Constructed Candidate
17:procedure Boosted Destination Shift
18:     Initial = random layout
19:     for  do
20:          Random Pick Dimension
21:          (tgtDim=, prePath=) and      
22:     Constructed Candidate
23:function TS(tgtDim = , prePath= )
24:     for  do
25:          Path=      
26:     Return
27:function bstTS(tgtDim = , prePath= )
28:     for  do
29:          Path=
30:          for index  do
31:               Path           
32:                
33:     Return
Algorithm 2 Path Planning Procedures

Partial Path Finding In contrast, PPF describes a kind of BFS algorithm. A th-partial path finding (PPF) recursively applies TS policy from nodes on pre-path up to level in decision graph, then it simultaneously visits the remaining dimensions (un-visited nodes) in parallel at level and apply TS policy correspondingly. Specifically, the D-MABs method is equivalent to PPF1, which adopts the dimension independent assumption. The Pseudo code in Algorithm 2 between line 5 and 10 illustrates a PPF2 algorithm, which assumes pairwise joint distribution independence. Mathematically, variables and are conditionally independent give () if and only if , and would call joint distribution of (, ) and (, ) are independent. So pairwise dimensional joint distribution independence means that for , where stands for all dimensions are independent except for dimension . Intuitively, PPF2 assumes pairwise interactions between dimensions, as it draws samples from pairwise dimensional joint distribution. Generally PPF maps up to -way interactions in regression model. The optimal computational complexity of PPF2 is with times searching and space complexity is , if we only load all hidden states from top 2 levels of decision graph into memory.

Destination Shift DS randomly picks an initial arm (bottom node in decision graph) and performs Hill-climbing method cycling through all dimensions for rounds. At each round , we randomly choose an dimension to optimize and return the best dimension value based on posterior sampling distribution condition on the rest of the dimension values prefixed (). We then use to generate from by . The computational complexity is and space complexity is .

Boosted Destination Shift Boosted-DS utilizes bstTS instead of TS function for value optimization on each target dimension node . It extends our previous intuition that sampling from -dimensional joint distribution is 1-to-1 mapping to m-way interaction weights in regression model. Pseudo code in Algorithm 2 between line 17 and 22 describes 2-Boosted-DS (Boosted-DS2) sampling strategy which follows equation 1 with pairwise () interaction assumption. Instead of single draw from arm , at round with target dimension , it sums samples drawing from -way joint density () and all pairwise () joint distributions with ( for all sub-index and ). Generally, th-Boosted-DS (Boosted-DSm) would take the sums of drawing samples upto dimensional joint distributions. The computational complexity of Boosted-DS2 is and space complexity is if we store all needed hidden states instead.

In summary, FFP utilizes hidden states from same decision tree at each iteration; PFP and Boosted-DS only utilizes hidden states on top levels of decision graph; DS utilizes hidden states on leaf nodes. DS and Boosted-DS randomly pick an layout to start and keep improving itself dimension by dimension till converge, while FFP and PFP do not randomly guess other dimension values. All four algorithm approximate the process of finding best bandit arm by pruning decision search trees and greedy optimization of sequential process through all dimensions. As the greedy approach significantly reduce search space, hence the converge performances are expected to beat the traditional Thompson sampling method MAB.

5 Empirical Validation and Analysis

We illustrate the performance of our algorithm (FPF, PPF, DS and Boosted-DS) on simulated data set, comparing with MVT[12], MAB[12] and D-MABs[12] base models mentioned before. Specially, we evaluate (a) the average cumulative regret, (b) the convergence speed and (c) the efficiency of best optimal arm selection among these models under same simulation environment settings. To access fairly appreciative analysis, the mechanism and parameters for generating the simulation data set are completely at random. We would also replicate all algorithms multiple (H) times and take the average to eliminate evaluation bias due to TS probabilistic randomness. Furthermore, we extensively exam the cumulative regret performance of proposed algorithms by varying (1) the relative strength of interaction between dimensions and (2) complexity in arm space (altering and ) to gain comprehensive understanding of our model.

5.1 Simulation Settings

Simulated reward data is generated in Bernoulli simulator with success rate being linear with -way dimension interactions:

(2)

where is scaling variable and , , are control parameters. It is trivial to set . We intentionally generate weights independently from , and set and to control the overall signal to noise ratio as well as related strength among -way interactions.

In this paper, we set (pairwise dimension interaction), and in above simulator settings, which yields 1000 possible layouts. To observe the convergence of each model and eliminate the randomness, our simulation is generated with time steps and replications. On each simulation replica and at time step , layout is chosen by each algorithm, and a binary reward is sampled from Bernoulli simulator with success rate , which is coming from Equation 2 with pre-generated randomly weights . We choose and as the same Hill climbing model parameter settings to compare between FPF, PPF2, DS, Boosted-DS2 and MVT2 methods.

5.2 Numerical Results

Figure 2 shows the histograms of arm exploration and selection activities for

algorithm as well as distribution of success rate for arms in our simulator. The horizontal axis is the success rate of selected arm, while the vertical axis is the probability density in histogram. The success rate density of Bernoulli simulators is symmetrically distributed, which coincides with our simulation setting. However, the severity of right skewness reveals the efficiency and momentum algorithms recognizing bad performance arms and quickly adjusting search space to best possible arms. Although

-MAB is theoretically guaranteed to archive optimal performance in the long run, the histogram graph might empirically explains why MVT2, FPF, PPF2 and Boosted-DS2 outperform -MAB in many ways. It is worth to mention that the search behavior (performance) of DS is similar to -MAB, but DS consists simpler computational complexity (). This concludes us that DS strategy itself starting path planning from bottom has limited improvement on arm heuristic search than -MAB. The underline reason could be that only small fraction of arms is explored at early stage and little information is known for each arm, starting from top strategy can resemble dimensional analogues and learn arm’s reward distribution from other arms with similar characteristics. In turns, it helps to rapidly shift to better performed arms. The proposed Boosted-DS2 overcomes DS’s issue by using TS samples from top levels. The heavily right skewness in Boosted-DS2 histogram confirms our suspects.

Figure 2: Histogram of expected reward for historical arm search.

To recognize the effectiveness of optimal selection, we leverage the average regret, convergence rate and best arm rate. We define convergence rate as proportion of trials with the most selected layout over a moving window with batch size () 1000. We further specify best arm rate as the proportion of trials with the best possible layout in one batch.

where and stand for most often selected layout and best possible layout respectively within a batch. Ideally, we prefer convergence rate and best arm rate both approaching to 1, which means the algorithm converges selection to single arm (convergence) as well as best arm. In practice, a fully converged batch trials almost surely select the same layout (sub-optimal) but not necessarily the global optimal layout.

Simulated performance results are displayed in Figure 3 where x-axis is the time steps. Path planning algorithms demonstrate advantages over base models, especially for FPF, PPF2 and Boosted-DS2. We see that PPF2 and Boosted-DS2 quickly jump to low regret (and high reward) within steps, followed by FPF and MVT2 around steps. Although Boosted-DS2 and MVT2 share fastest convergence speed followed by PPF2 then FPF, but FPF holds the highest best arm rate. FPF performance in cumulative regret (and reward) catches up for longer iterations as well. The intuition behind these is that FPF includes the most complex model space with considering full dimension interactions, in which it not only look from top level of decision graph to quickly eliminate bad performed dimension values but also drill down to leaf arms to correct negligence from higher levels. The exponential space complexity or computational complexity proportional to is our concern of FPF compared with PPF2 and Boosted-DS2.

(a) Average Regret.
(b) Convergence Rate
(c) Best Arm Rate
Figure 3: Performance of algorithms on simulated data with , , and .

In our experiment, PPF2, Boosted-DS2 and MVT2 assume models with pairwise interactions in one way or the other, and it happens to be our simulator setting. In practice extra effort is needed for correct modeling the reward function, which is out of this paper’s scope. PPF2 and Boosted-DS2 both efficiently achieve lower regret comparing with MVT2. However, PPF2 carries better best arm rate than Boosted-DS2. Our take-away from this is that Hill-climbing strategy contains two drawbacks. First, it is equivalent to a bottom-up path planning strategy in our framework, which is not as efficient as top-down strategy as discussed before. Boosted-DS2 combats such weakness using TS samples on top levels to mimic sample draw from lower level. Second, Hill-climbing starts with randomly guess other dimension values which easily ends up with good enough arm selection (low regret and high convergence) but not always the best (low best arm rate). In the meanwhile, D-MABs struggles in performance as its assumption of independence between dimensions doesn’t match with our simulator.

Although PPF2, Boosted-DS2 and MVT2 all share simplified model complexity (both in computation and parameter space), however MVT2 takes longer time period per iteration compared with the other two. Table 1 shows the iteration speed of these algorithms in our implementation. In fact, MVT2 is the slowest algorithm as the heavy computation burden when updating regression coefficients in posterior sampling distribution.

Algorithm Iteration Speed
FPF 7.04 it/s
PPF2 22.39 it/s
DS 2.01 it/s
Boosted-DS2 1.38 it/s
MVT2 0.25 it/s
Table 1: Algorithm Iteration Speed Comparison

We further extend our simulation results of average cumulative regret by varying to change the strength of interactions as well as varying and to change space complexity in Fig 4. We skip MVT2 due to the time limitation (MVT2 takes 5 days per experiment). As varies from to 1 with step size , we see the pattern consists with prior result at Fig 3(a). The only exception is D-MABs gets dominant regret when interaction strength is weak (), as D-MABs’s no interaction assumption close to the truth at that time. D-MABs is equivalent with PPF1. So D-MABs should perform similarly with PPF2 when interaction strength is weak. Next we analyze the impact on performance with model complexity. We systematically swap in and in at Fig 3(b) and 3(c) respectively. We observe that the relative performance still holds: . Based on these extensive experiments, we assert that our proposed method is superior consistently.

In summary, our simulated results suggest that TS-PP has good performance overall for multivariate bandit problem with large search space when dimension hierarchy structure exists. FPF accomplishes the best performance, however PPF2 attracts implementation attention due to its computation efficiency with comparable performance.

(a) Algorithm performance when varies
(b) Algorithm performance when varies
(c) Algorithm performance when varies
Figure 4: Performance of algorithms when D, N and varies. Average regret values are cumulative averaged over number of iterations.

6 Conclusions

In this paper, we presented TS-PP algorithms taking advantage of the hierarchy dimension structure of bandit arms to quickly find the best arm. It utilizes decision graph/trees to model arm reward success rate with m-way dimension interaction, and adopts TS with MCTS for heuristic search of arm selection. Naturally, it is quite straightforward to combat the curse of dimensionality using a serial processes that operates sequentially by focusing on one dimension per each process. Based on our simulation results, it achieves superior results in terms of cumulative regret and converge speed comparing with MVT, -MAB and D-MABs on large decision space. We listed 4 variations of our algorithm, and concluded that FPF and PPF conduct the best performance. We highlight PPF due to its implementation simplicity but high efficiency in performance.

It is trivial to extend our algorithm to contextual bandit problem with finite categorical context features. But how to extend our algorithm from discrete to continuous contextual variables worth us further exploration. We notice some related work of TS-MCTS [4] dealing with continuous reward in this area. Finally, fully understanding the mechanism of using heuristic greedy approach (in our method) to approximate TS from arms is still under investigation.

References

  • [1] Rajeev Agrawal. Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054–1078, 1995.
  • [2] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In

    International Conference on Machine Learning

    , pages 127–135, 2013.
  • [3] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • [4] Aijun Bai, Feng Wu, Zongzhang Zhang, and Xiaoping Chen. Thompson sampling based monte-carlo planning in pomdps. In Twenty-Fourth International Conference on Automated Planning and Scheduling, 2014.
  • [5] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
  • [6] Sébastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • [7] George Casella and Roger L Berger. Statistical inference, volume 2. Duxbury Pacific Grove, CA, 2002.
  • [8] Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. Monte-carlo tree search: A new framework for game ai. 2008.
  • [9] Guillaume M JB Chaslot, Mark HM Winands, H JAAP VAN DEN HERIK, Jos WHM Uiterwijk, and Bruno Bouzy. Progressive strategies for monte-carlo tree search. New Mathematics and Natural Computation, 4(03):343–357, 2008.
  • [10] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
  • [11] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 21st Annual Conference on Learning Theory, pages 355–366, 2008.
  • [12] Daniel N Hill, Houssam Nassif, Yi Liu, Anand Iyer, and SVN Vishwanathan. An efficient bandit algorithm for realtime multivariate optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1813–1821. ACM, 2017.
  • [13] Takahisa Imagawa and Tomoyuki Kaneko. Enhancements in monte carlo tree search algorithms for biased game trees. In 2015 IEEE Conference on Computational Intelligence and Games (CIG), pages 43–50. IEEE, 2015.
  • [14] Emilie Kaufmann, Nathaniel Korda, and Rémi Munos. Thompson sampling: An asymptotically optimal finite-time analysis. pages 199–213, 2012.
  • [15] Mykel J Kochenderfer. Decision making under uncertainty: theory and application. MIT press, 2015.
  • [16] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In Proceedings of the 17th European Conference on Machine Learning, ECML’06, pages 282–293, Berlin, Heidelberg, 2006. Springer-Verlag.
  • [17] Tze Leung Lai et al. Adaptive treatment allocation and the multi-armed bandit problem. The Annals of Statistics, 15(3):1091–1114, 1987.
  • [18] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • [19] Benedict C May, Nathan Korda, Anthony Lee, and David S Leslie. Optimistic bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research, 13(Jun):2069–2106, 2012.
  • [20] Vivek Nair, Zhe Yu, Tim Menzies, Norbert Siegmund, and Sven Apel. Finding faster configurations using flash. IEEE Transactions on Software Engineering, 2018.
  • [21] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
  • [22] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1):1–96, 2018.
  • [23] Steven L Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
  • [24] Steven L Scott. Multi-armed bandit experiments in the online service economy. Applied Stochastic Models in Business and Industry, 31(1):37–45, 2015.
  • [25] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.