1 Introduction
Markov decision processes (MDPs) are a standard model for planning under uncertainty [PutermanPuterman1994]. An MDP is defined by a set of possible agent states , a set of agent actions , a stochastic transition function , and a reward function . Depending on the problem domain and the representation language, the description of the MDP can be either declarative or generative (or mixed). In any case, the description of the MDP is assumed to be concise. While declarative models provide the agents with greater algorithmic flexibility, generative models are more expressive, and both types of models allow for simulated execution of all feasible action sequences, from any state of the MDP. The current state of the agent is fully observable, and the objective of the agent is to act so to maximize its accumulated reward. In the finite horizon setting that will be used for most of the paper, the reward is accumulated over some predefined number of steps .
The desire to handle MDPs with state spaces of size exponential in the size of the model description has led researchers to consider online planning in MDPs. In online planning, the agent, rather than computing a quality policy for the entire MDP before taking any action, focuses only on what action to perform next. The decision process consists of a deliberation phase, aka planning, terminated either according to a predefined schedule or due to an external interrupt, and followed by a recommended action for the current state. Once that action is applied in the real environment, the decision process is repeated from the obtained state to select the next action and so on.
The quality of the action , recommended for state with stepstogo, is assessed in terms of the probability that is suboptimal, and in terms of the (closely related) measure of simple regret . The latter captures the performance loss that results from taking and then following an optimal policy for the remaining steps, instead of following from the beginning [Bubeck MunosBubeck Munos2010]. That is,
where
With a few recent exceptions developed for declarative MDPs [Bonet GeffnerBonet Geffner2012, Kolobov, Mausam, WeldKolobov et al.2012, Busoniu MunosBusoniu Munos2012], most algorithms for online MDP planning constitute variants of what is called MonteCarlo tree search (MCTS). One of the earliest and bestknown MCTS algorithms for MDPs is the sparse sampling algorithm by Kearns, Mansour, and Ng [Kearns, Mansour, NgKearns et al.1999]. Sparse sampling offers a nearoptimal action selection in discounted MDPs by constructing a sampled lookahead tree in time exponential in discount factor and suboptimality bound, but independent of the state space size. However, if terminated before an action has proved to be nearoptimal, sparse sampling offers no quality guarantees on its action selection. Thus it does not really fit the setup of online planning. Several later works introduced interruptible, anytime MCTS algorithms for MDPs, with [Kocsis SzepesváriKocsis Szepesvári2006] probably being the most widely used such algorithm these days. Anytime MCTS algorithms are designed to provide convergence to the best action if enough time is given for deliberation, as well as a gradual reduction of performance loss over the deliberation time [Sutton BartoSutton Barto1998, Péret GarciaPéret Garcia2004, Kocsis SzepesváriKocsis Szepesvári2006, Coquelin MunosCoquelin Munos2007, CazenaveCazenave2009, RosinRosin2011, Tolpin ShimonyTolpin Shimony2012]. While and its successors have been devised specifically for MDPs, some of these algorithms are also successfully used in partially observable and adversarial settings [Gelly SilverGelly Silver2011, SturtevantSturtevant2008, Bjarnason, Fern, TadepalliBjarnason et al.2009, Balla FernBalla Fern2009, Eyerich, Keller, HelmertEyerich et al.2010].
In general, the relative empirical attractiveness of the various MCTS planning algorithms depends on the specifics of the problem at hand and cannot usually be predicted ahead of time. When it comes to formal guarantees on the expected performance improvement over the planning time, very few of these algorithms provide such guarantees for general MDPs, and none breaks the barrier of the worstcase only polynomialrate reduction of simple regret and choiceerror probability over time.
This is precisely our contribution here. We introduce a new MonteCarlo tree search algorithm, , that guarantees exponentialrate reduction of both simple regret and choiceerror probability over time, for general MDPs over finite state spaces. The algorithm is based on a simple and efficiently implementable sampling scheme,
, in which different parts of each sample are dedicated to different competing exploratory objectives. The motivation for this objective decoupling came from a recently growing understanding that the current MCTS algorithms for MDPs do not optimize the reduction of simple regret directly, but only via optimizing what is called cumulative regret, a performance measure suitable for the (very different) setting of reinforcement learning
[Bubeck MunosBubeck Munos2010, Busoniu MunosBusoniu Munos2012, Tolpin ShimonyTolpin Shimony2012, Feldman DomshlakFeldman Domshlak2012]. Our empirical evaluation on some standard MDP benchmarks for comparison between MCTS planning algorithms shows that not only provides superior performance guarantees, but is also very effective in practice and favorably compares to state of the art. We then extend with a variant of “learning by forgetting.” The resulting family of algorithms, , generalizes , improves the exponential factor in the upper bound on its reduction rate, and exhibits even more attractive empirical performance.2 MonteCarlo Planning
, a highlevel scheme for MonteCarlo tree search that gives rise to various specific algorithms for online MDP planning, is depicted in Figure 1. Starting with the current state , performs an iterative construction of a tree rooted at . At each iteration, issues a statespace sample from , expands the tree using the outcome of that sample, and updates information stored at the nodes of . Once the simulation phase is over, uses the information collected at the nodes of to recommend an action to perform in . For compatibility of the notation with prior literature, in what follows we refer to the tree nodes via the states associated with these nodes. Note that, due to the Markovian nature of MDPs, it is unreasonable to distinguish between nodes associated with the same state at the same depth. Hence, the actual graph constructed by most instances of forms a DAG over nodes . By in what follows, we refer to the subset of actions applicable in state .
: [input: ; ] search tree root node while time permits: return 
Numerous concrete instances of have been proposed, with [Kocsis SzepesváriKocsis Szepesvári2006] probably being the most popular such algorithm these days [Gelly SilverGelly Silver2011, SturtevantSturtevant2008, Bjarnason, Fern, TadepalliBjarnason et al.2009, Balla FernBalla Fern2009, Eyerich, Keller, HelmertEyerich et al.2010, Keller EyerichKeller Eyerich2012a]. To give a concrete sense of ’s components, as well as to ground some intuitions discussed later on, below we describe the specific setting of corresponding to the core algorithm, and Figure 2 illustrates the tree construction, with denoting the number of statespace samples.

: The samples are all issued from the root node . The sample ends either when a sink state is reached, that is, , or when . Each node/action pair is associated with a counter and a value accumulator . Both and are initialized to , and then updated by the procedure. Given , the nextonthesample action is selected according to the deterministic policy [Auer, CesaBianchi, FischerAuer et al.2002a], originally proposed for optimal cumulative regret minimization in stochastic multiarmed bandit (MAB) problems [RobbinsRobbins1952]: If for all , then
(1) where . Otherwise, is selected uniformly at random from the still unexplored actions . In both cases, is then sampled according to the conditional probability , induced by the transition function .

: Each statespace sample induces a state trace inside , as well as a state trace outside of . In principle, can be expanded with any prefix of ; a popular choice in prior work appears to be expanding with only the uppermost node . (If is constructed as a DAG, it is expanded with the first node along that leaves .)

: For each node along that is now part of the expanded tree , the counter
is incremented and the estimated
value is updated as(2) where .

: Interestingly, the action recommendation protocol of was never properly specified, and different applications of adopt different decision rules, including maximization of the estimated value, of the augmented estimated value as in Eq. 1, of the number of times the action was selected during the simulation, as well as randomized protocols based on the information collected at the root.











The key property of is that its exploration of the search space is obtained by considering a hierarchy of forecasters, each minimizing its own cumulative regret, that is, the loss of the total reward incurred by exploring the environment [Auer, CesaBianchi, FischerAuer et al.2002a]. Each such pseudoagent forecaster corresponds to a state/stepstogo pair . In that respect, according to Theorem 6 of uct, asymptotically achieves the best possible (logarithmic) cumulative regret. However, as recently pointed out in numerous works [Bubeck MunosBubeck Munos2010, Busoniu MunosBusoniu Munos2012, Tolpin ShimonyTolpin Shimony2012, Feldman DomshlakFeldman Domshlak2012], cumulative regret does not seem to be the right objective for online MDP planning, and this is because the rewards “collected” at the simulation phase are fictitious. Furthermore, the work of bubeck:etal:tcs11 on multiarmed bandits shows that minimizing cumulative regret and minimizing simple regret are somewhat competing objectives. Indeed, the same Theorem 6 of uct claims only a polynomialrate reduction of the probability of choosing a nonoptimal action, and the results of bubeck:etal:tcs11 on simple regret minimization in MABs with stochastic rewards imply that achieves only polynomialrate reduction of the simple regret over time. Some attempts have recently been made to adapt , and based planning in general, to optimizing simple regret in online MDP planning directly, and some of these attempts were empirically rather successful [Tolpin ShimonyTolpin Shimony2012, Hay, Shimony, Tolpin, RussellHay et al.2012]. However, to the best of our knowledge, none of them breaks ’s barrier of the worstcase polynomialrate reduction of simple regret over time.
3 Simple Regret Minimization in MDPs
We now show that exponentialrate reduction of simple regret in online MDP planning is achievable. To do so, we first motivate and introduce a family of algorithms with a twophase scheme for generating state space samples, and then describe a concrete algorithm from this family, , that (1) guarantees that the probability of recommending a nonoptimal action asymptotically convergences to zero at an exponential rate, and (2) achieves exponentialrate reduction of simple regret over time.
3.1 Exploratory concerns in online MDP planning
The work of bubeck:etal:tcs11 on pure exploration in multiarmed bandit (MAB) problems was probably the first to stress that the minimal simple regret can increase as the bound on the cumulative regret is decreases. At a high level, bubeck:etal:tcs11 show that efficient schemes for simple regret minimization in MAB should be as exploratory as possible, thus improving the expected quality of the recommendation issued at the end of the learning process. In particular, they showed that the simple roundrobin sampling of MAB actions, followed by recommending the action with the highest empirical mean, yields exponentialrate reduction of simple regret, while the strategy that balances between exploration and exploitation yields only polynomialrate reduction of that measure. In that respect, the situation with MDPs is seemingly no different, and thus MonteCarlo MDP planning should focus on exploration only. However, the answer to the question of what it means to be “as exploratory as possible” with MDPs is less straightforward than it is in the special case of MABs.
For an intuition as to why the “pure exploration dilemma” in MDPs is somewhat complicated, consider the state/stepstogo pairs as pseudoagents, all acting on behalf of the root pseudoagent that aims at minimizing its own simple regret in a stochastic MAB induced by the applicable actions . Clearly, if an oracle would provide with an optimal action , then no further deliberation would be needed until after the execution of . However, the task characteristics of are an exception rather than a rule. Suppose that an oracle provides us with optimal actions for all pseudoagents but . Despite the richness of this information, in some sense remains as clueless as it was before: To choose between the actions in , needs, at the very least, some ordinal information about the expected value of these alternatives. Hence, when sampling the futures, each nonroot pseudoagent should be devoted to two objectives:

identifying an optimal action , and

estimating the actual value of that action, because this information is needed by the predecessor(s) of in .
Note that both these objectives are exploratory, yet the problem is that they are somewhat competing. In that respect, the choices made by actually make sense: Each sample issued by at is a priori devoted both to increasing the confidence in that some current candidate for is indeed , as well as to improving the estimate of , while as if assuming that . However, while such an overloading of the samples is unavoidable in the “learning while acting” setup of reinforcement learning, this should not necessarily be the case in online planning. Moreover, this sample overloading in comes with a high price: As it was shown by CoquelinM:uai07, the number of samples after which the bounds of on both simple and cumulative regret become meaningful might be as high as hyperexponential in .
3.2 Separation of Concerns at the Extreme
Separating the two aforementioned exploratory concerns is at the focus of our investigation here. Let be a state of an MDP with rewards in , applicable actions at each state, possible outcome states for each action, and finite horizon . First, to get a sense of what separation of exploratory concerns in online planning can buy us, we begin with a MAB perspective on MDPs, with each arm in the MAB corresponding to a “flat” policy of acting for steps starting from the current state . A “flat” policy is a minimal partial mapping from state/stepstogo pairs to actions that fully specifies an acting strategy in the MDP for steps, starting at . Sampling such an arm is straightforward as prescribes precisely which action should be applied at every state that can possibly be encountered along the execution of . The reward of such an arm is stochastic, with support , and the number of arms in this schematic MAB is .
Now, consider a simple algorithm, , which systematically samples each ”flat” policy in a loop, and updates the estimation of the corresponding arm with the obtained reward. If stopped at iteration , the algorithm recommends , where is the arm/policy with best empirical value . By the iteration of this algorithm, each arm will be sampled at least times. Therefore, using the Hoeffding’s inequality, the probability that the chosen arm is suboptimal in our MAB is bounded by
(3) 
where , and thus the expected simple regret can be bounded as
(4) 
Note that uses each sample to update the estimation of only a single policy . However, recalling that arms in our MAB problem are actually compound policies, the same sample can in principle be used to update the estimates of all policies that are consistent with in the sense that, for , is defined and it is defined as . The resulting algorithm, , generates samples by choosing the actions along them uniformly at random, and uses the outcome of each sample to update all the policies consistent with it. Note that sampling the arms in cannot be done systematically as in because the set of policies updated at each iteration is stochastic.
Since the sampling is uniform, the probability of any policy to be updated by the sample issued at any iteration of is . For an arm , let denote the number of samples issued at the iterations of that are consistent with the policy . The probability that , the best empirical arm after iterations, is suboptimal is bounded by
(5) 
Each of the two terms on the righthand side can be bounded as:
(6) 
where and are by the Hoeffding inequality. In turn, similarly to Eq. 4, the simple regret for is bounded by
(7) 
Since is a trivial upperbound on , the bound in Eq. 7 becomes effective only when , that is, for
(8) 
Note that this transition period length is still much better than that of , which is hyperexponential in . Moreover, unlike in , the rate of the simple regret reduction is then exponential in the number of iterations.
3.3 Twophase sampling and
While both the simple regret convergence rate, as well as the length of the transition period of , are more attractive than those of , this in itself is not much of a help: requires explicit reasoning about arms, and thus it cannot be efficiently implemented. However, it does show the promise of separation of concerns in online planning. We now introduce an family of algorithms, referred to as , that allows utilizing this promise to a large extent.
The instances of the family vary along four parameters: switching point function , exploration policy, estimation policy, and update policy. With respect to these four parameters, the components in are as follows.

Similarly to , each node/action pair is associated with variables and . However, while counters are initialized to , value accumulators are schematically initialized to .

: Each iteration of corresponds to a single statespace sample of the MDP, and these samples are all issued from the root node . The sample ends either when a sink state is reached, that is, , or when . The generation of is done in two phases: At iteration , the actions at states are selected according to the exploration policy of the algorithm, while the actions at states are selected according to its estimation policy.

: is expanded with the suffix of state sequence that is new to .

: The recommended action is chosen uniformly at random among the actions maximizing .
In what follows, for , the th iteration of will be called iteration if . At a high level, the two phases of sample generation respectively target the two exploratory objectives of online MDP planning: While the sample prefixes aim at exploring the options, the sample suffixes aim at improving the value estimates for the current candidates for . In particular, this separation allows us to introduce a specific instance, ,^{1}^{1}1Short for Best Recommendation with Uniform Exploration; the name is carried on from our first presentation of the algorithm in [Feldman DomshlakFeldman Domshlak2012], where “estimation” was referred to as “recommendation.” that is tailored to simple regret minimization. The setting of is described below, and Figure 3 illustrates its dynamics.

The switching point function is
(9) that is, the depth of exploration is chosen by a roundrobin on , in reverse order.

At state , the exploration policy samples an action uniformly at random, while the estimation policy samples an action uniformly at random, but only among the actions that maximize .

For a sample issued at iteration , only the state/action pair immediately preceding the switching state along is updated. That is, the information obtained by the second phase of is used only for improving the estimate at state , and is not pushed further up the sample. While that may appear wasteful and even counterintuitive, this locality of update is required to satisfy the formal guarantees of discussed below.









Before we proceed with the formal analysis of , a few comments on it, as well as on the sampling scheme in general, are in place. First, the template of is rather general, and some of its parametrizations will not even guarantee convergence to the optimal action. This, for instance, will be the case with a (seemingly minor) modification of to purely uniform estimation policy. In short, should be parametrized with care. Second, while in what follows we focus on , other instances of may appear to be empirically effective as well with respect to the reduction of simple regret over time. Some of them, similarly to , may also guarantee exponentialrate reduction of simple regret over time. Hence, we clearly cannot, and do not, claim any uniqueness of in that respect. Finally, some other families of MCTS algorithms, more sophisticated that , can give rise to even more (formally and/or empirically) efficient optimizers of simple regret. The set of algorithms that we discuss later on is one such example.
4 Upper Bounds on Simple Regret Reduction Rate with
For the sake of simplicity, in our formal analysis of we assume uniqueness of the optimal policy ; that is, at each state and each number of stepstogo, there is a single optimal action, and it is . Let be the graph obtained by after iterations, and let denote the accumulated value for at depth . For all state/stepstogo pairs , is a randomized strategy, uniformly choosing among actions maximizing . We also use some additional auxiliary notation.

, i.e., the maximal number of actions per state.

, i.e., the likelihood of the least likely (but still possible) outcome of an action in our problem.

, i.e., the smallest difference between the value of the optimal and a secondbest action at a state with just one steptogo.
Our key result on the algorithm is Theorem 1 below. The proof of Theorem 1, as well as of several required auxiliary claims, is given in Appendix A. Here we outline only the key issues addressed by the proof, and provide a highlevel flow of the proof in terms of a few central auxiliary claims.
Theorem 1
Let be called on a state of an MDP with rewards in and finite horizon . There exist pairs of parameters , dependent only on , such that, after iterations of , we have simple regret bounded as
(10) 
and choiceerror probability bounded as
(11) 
In particular, these bounds hold for
(12) 
and
(13) 
Before we proceed any further, some discussion of the statements in Theorem 1 are in place. First, the parameters and in the bounds established by Theorem 1 are problemdependent: in addition to the dependance on the horizon and the choice branching factor (which is unavoidable), the parameters and also depend on the distribution parameters and . While it is possible that this dependence can be partly alleviated, bubeck:etal:tcs11 showed that distributionfree exponential bounds on the simple regret reduction rate cannot be achieved even in MABs, that is, even in singlesteptogo MDPs (see Remark 2 of bubeck:etal:tcs11, which is based on a lower bound on the cumulative regret established by auer:etal:siam02). Second, the specific parameters and provided by Eqs. 12 and 13 are worstcase for MDPs with parameters , , and , and the bound in Eq. 10 becomes effective after
iterations, for some small constant . While there is still some gap with this transition period length and the transition period length of the theoretical algorithm (see Eq. 8), this gap is not that large.^{2}^{2}2Some of this gap can probably be eliminated by more accurate bounding in the numerous bounding steps towards the proof of Theorem 1. However, all such improvements we tried made the already lengthy proof of Theorem 1 even more involved.
The proof of Lemma 2 below constitutes the crux of the proof of Theorem 1. Once we have proven this lemma, the proof of Theorem 1 stems from it in a moreorless direct manner.
Lemma 2
Let be called on a state of an MDP with rewards in and finite horizon . For each , there exist parameters , dependent only on , such that, for each state reachable from in steps and any , it holds that
(14) 
In particular, these bounds hold for
(15) 
and
(16) 
The proof for Lemma 2 is by induction on . Starting with the induction basis for , it is easy to verify that, by the ChernoffHoeffding inequality,
(17) 
that is, the assertion is satisfied with and . Now, assuming the claim holds for , below we outline the proof for , relegating the actual proof in full detail to Appendix A.
In the proof for , it is crucial to note the invalidity of applying the ChernoffHoeffding bound directly, as was done in Eq. 17. There are two reasons for this.
 (F1)

For , is an unbiased estimator of , that is, . In contrast, the estimates inside the tree (at nodes with ) are biased. This bias stems from possibly being based on numerous suboptimal choices in the subtree rooted in .
 (F2)

For , the summands accumulated by are independent. This is not so for , where the accumulated reward depends on the selection of actions in subsequent nodes, which in turn depends on previous rewards.
However, we show that these deficiencies of can still be overcome through a novel modification of the seminal HoeffdingAzuma inequality.
Lemma 3 (Modified HoeffdingAzuma inequality)
Let
be a sequence of random variables with support
and . If , and(18) 
for some and , then, for all , it holds that
(19)  
(20) 
Together with Lemma 4 below, the inequalities provided by Lemma 3 allow us to prove the induction hypothesis in the proof of the central Lemma 2. Note that the specific bound in Lemma 3 is selected so to maximize the exponent coefficient. For any , the probabilities of interest in Eqs. 1920 can also be bounded by
for further details, we refer the reader to Discussion 14 in Appendix A.
Definition 1
Let be an MDP with rewards in , planned for initial state and finite horizon . Let be a state reachable from with steps still to go, let be an action applicable in , and let be a policy induced by running on until exactly samples have finished their exploration phase with applying action at with steps still to go. Given that,

is a random variable, corresponding to the reward obtained by taking at , and then following for the remaining steps.

is the event in which is sampled along the optimal actions at each of the choice points delegated to .

Lemma 4
Together with a modified version of the HoeffdingAzuma bound in Lemma 3, the bounds established in Lemma 4 allow us to derive concentration bounds for around as in Lemma 5 below, which serves the key building block for proving the induction hypothesis in the proof of Lemma 2.
Lemma 5
Let be called on a state of an MDP with rewards in and finite horizon . For each state reachable with steps still to go, each action applicable, and any , it holds that
(23) 
5 Learning With Forgetting and
When we consider the evolution of action value estimates in over time (as well as in all other MonteCarlo algorithms for online MDP planning), we can see that, in internal nodes these estimates are based on biased samples that stem from the selection of nonoptimal actions at descendant nodes. This bias tends to shrink as more samples are accumulated down the tree. Consequently, the estimates become more accurate, the probability of selecting an optimal action increases accordingly, and the bias of ancestor nodes shrinks in turn. An interesting question in this context is: shouldn’t we weigh differently samples obtained at different stages of the sampling process? Intuition tells us that biased samples still provide us with valuable information, especially when they are all we have, but the value of this information decreases as we obtain more and more accurate samples. Hence, in principle, putting more weight on samples with smaller bias could increase the accuracy of our estimates. The key question, of course, is which of all possible weighting schemes are both reasonable to employ and preserve the exponentialrate reduction of expected simple regret.
Here we describe , an algorithm that generalizes by basing the estimates only on the fraction of most recent samples. We discuss the value of this addition both from the perspective of the formal guarantees, as well as from the perspective of empirical prospects. differs from in two points:

In addition to the variables and , each node/action pair in is associated with a list of rewards, collected at each of the samples that are responsible for the current estimate .

When a sample is issued at iteration , and updates the variables at , that update is done not according to Eq. 2 as in , but according to:
(24)
Theorem 6
Let be called on a state of an MDP with rewards in and finite horizon . There exist pairs of parameters , dependent only on , such that, after iterations of , we have simple regret bounded as
(25) 
and choiceerror probability bounded as
(26) 
The proof for Theorem 6 follows from Lemma 7 below similarly to the way Theorem 1 follows from Lemma 2. Note that in Theorem 6 we do not provide explicit expressions for the constants and as we did in Theorem 1 (for ). This is because the expressions that can be extracted from the recursive formulas in this case do not bring much insight. However, we discuss the potential benefits of choosing in the context of our proof of Theorem 6.
Lemma 7
Let be called on a state of an MDP with rewards in and finite horizon . For each , there exist parameters , dependent only on , such that, for each state reachable from in steps and any , it holds that
(27) 
The proof for Lemma 7 is by induction, following the same line of the proof for Lemma 2. In fact, it deviates from the latter only in the application of the modified HoeffdingAzuma inequality, which has to be further modified to capture the partial sums as in .
Lemma 8 (Modified HoeffdingAzuma inequality for partial sums)
Let be a sequence of random variables with support and . If , and
(28) 
for some and , then, for all , it holds that
(29)  
(30) 
Considering the benefits of “sample forgetting” as in , let us compare the bound in Lemma 8 to the bound
Comments
There are no comments yet.