# Simple Regret Optimization in Online Planning for Markov Decision Processes

We consider online planning in Markov decision processes (MDPs). In online planning, the agent focuses on its current state only, deliberates about the set of possible policies from that state onwards and, when interrupted, uses the outcome of that exploratory deliberation to choose what action to perform next. The performance of algorithms for online planning is assessed in terms of simple regret, which is the agent's expected performance loss when the chosen action, rather than an optimal one, is followed. To date, state-of-the-art algorithms for online planning in general MDPs are either best effort, or guarantee only polynomial-rate reduction of simple regret over time. Here we introduce a new Monte-Carlo tree search algorithm, BRUE, that guarantees exponential-rate reduction of simple regret and error probability. This algorithm is based on a simple yet non-standard state-space sampling scheme, MCTS2e, in which different parts of each sample are dedicated to different exploratory objectives. Our empirical evaluation shows that BRUE not only provides superior performance guarantees, but is also very effective in practice and favorably compares to state-of-the-art. We then extend BRUE with a variant of "learning by forgetting." The resulting set of algorithms, BRUE(alpha), generalizes BRUE, improves the exponential factor in the upper bound on its reduction rate, and exhibits even more attractive empirical performance.

## Authors

• 3 publications
• 10 publications
• ### Online Convex Optimization in Adversarial Markov Decision Processes

We consider online learning in episodic loop-free Markov decision proces...
05/19/2019 ∙ by Aviv Rosenberg, et al. ∙ 0

• ### Large Scale Markov Decision Processes with Changing Rewards

We consider Markov Decision Processes (MDPs) where the rewards are unkno...
05/25/2019 ∙ by Adrian Rivera Cardoso, et al. ∙ 0

• ### Planning in Markov Decision Processes with Gap-Dependent Sample Complexity

We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algo...
06/10/2020 ∙ by Anders Jonsson, et al. ∙ 10

• ### Exploration--Exploitation in MDPs with Options

While a large body of empirical results show that temporally-extended ac...
03/25/2017 ∙ by Ronan Fruit, et al. ∙ 0

• ### Navigating to the Best Policy in Markov Decision Processes

We investigate the classical active pure exploration problem in Markov D...
06/05/2021 ∙ by Aymen Al Marjani, et al. ∙ 24

• ### Monte-Carlo Planning: Theoretically Fast Convergence Meets Practical Efficiency

Popular Monte-Carlo tree search (MCTS) algorithms for online planning, s...
09/26/2013 ∙ by Zohar Feldman, et al. ∙ 0

• ### Metareasoning for Planning Under Uncertainty

The conventional model for online planning under uncertainty assumes tha...
05/03/2015 ∙ by Christopher H. Lin, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Markov decision processes (MDPs) are a standard model for planning under uncertainty [PutermanPuterman1994]. An MDP is defined by a set of possible agent states , a set of agent actions , a stochastic transition function , and a reward function . Depending on the problem domain and the representation language, the description of the MDP can be either declarative or generative (or mixed). In any case, the description of the MDP is assumed to be concise. While declarative models provide the agents with greater algorithmic flexibility, generative models are more expressive, and both types of models allow for simulated execution of all feasible action sequences, from any state of the MDP. The current state of the agent is fully observable, and the objective of the agent is to act so to maximize its accumulated reward. In the finite horizon setting that will be used for most of the paper, the reward is accumulated over some predefined number of steps .

The desire to handle MDPs with state spaces of size exponential in the size of the model description has led researchers to consider online planning in MDPs. In online planning, the agent, rather than computing a quality policy for the entire MDP before taking any action, focuses only on what action to perform next. The decision process consists of a deliberation phase, aka planning, terminated either according to a predefined schedule or due to an external interrupt, and followed by a recommended action for the current state. Once that action is applied in the real environment, the decision process is repeated from the obtained state to select the next action and so on.

The quality of the action , recommended for state with steps-to-go, is assessed in terms of the probability that is sub-optimal, and in terms of the (closely related) measure of simple regret . The latter captures the performance loss that results from taking and then following an optimal policy for the remaining steps, instead of following from the beginning [Bubeck  MunosBubeck  Munos2010]. That is,

 ΔH[s,a]=QH(s,π∗(s,H))−QH(s,a),

where

 QH(s,a)=Es′[R(s,a,s′)+QH−1(s′,π∗(s′,H−1))].

With a few recent exceptions developed for declarative MDPs [Bonet  GeffnerBonet  Geffner2012, Kolobov, Mausam,  WeldKolobov et al.2012, Busoniu  MunosBusoniu  Munos2012], most algorithms for online MDP planning constitute variants of what is called Monte-Carlo tree search (MCTS). One of the earliest and best-known MCTS algorithms for MDPs is the sparse sampling algorithm by Kearns, Mansour, and Ng [Kearns, Mansour,  NgKearns et al.1999]. Sparse sampling offers a near-optimal action selection in discounted MDPs by constructing a sampled lookahead tree in time exponential in discount factor and suboptimality bound, but independent of the state space size. However, if terminated before an action has proved to be near-optimal, sparse sampling offers no quality guarantees on its action selection. Thus it does not really fit the setup of online planning. Several later works introduced interruptible, anytime MCTS algorithms for MDPs, with  [Kocsis  SzepesváriKocsis  Szepesvári2006] probably being the most widely used such algorithm these days. Anytime MCTS algorithms are designed to provide convergence to the best action if enough time is given for deliberation, as well as a gradual reduction of performance loss over the deliberation time [Sutton  BartoSutton  Barto1998, Péret  GarciaPéret  Garcia2004, Kocsis  SzepesváriKocsis  Szepesvári2006, Coquelin  MunosCoquelin  Munos2007, CazenaveCazenave2009, RosinRosin2011, Tolpin  ShimonyTolpin  Shimony2012]. While and its successors have been devised specifically for MDPs, some of these algorithms are also successfully used in partially observable and adversarial settings [Gelly  SilverGelly  Silver2011, SturtevantSturtevant2008, Bjarnason, Fern,  TadepalliBjarnason et al.2009, Balla  FernBalla  Fern2009, Eyerich, Keller,  HelmertEyerich et al.2010].

In general, the relative empirical attractiveness of the various MCTS planning algorithms depends on the specifics of the problem at hand and cannot usually be predicted ahead of time. When it comes to formal guarantees on the expected performance improvement over the planning time, very few of these algorithms provide such guarantees for general MDPs, and none breaks the barrier of the worst-case only polynomial-rate reduction of simple regret and choice-error probability over time.

This is precisely our contribution here. We introduce a new Monte-Carlo tree search algorithm, , that guarantees exponential-rate reduction of both simple regret and choice-error probability over time, for general MDPs over finite state spaces. The algorithm is based on a simple and efficiently implementable sampling scheme,

, in which different parts of each sample are dedicated to different competing exploratory objectives. The motivation for this objective decoupling came from a recently growing understanding that the current MCTS algorithms for MDPs do not optimize the reduction of simple regret directly, but only via optimizing what is called cumulative regret, a performance measure suitable for the (very different) setting of reinforcement learning

[Bubeck  MunosBubeck  Munos2010, Busoniu  MunosBusoniu  Munos2012, Tolpin  ShimonyTolpin  Shimony2012, Feldman  DomshlakFeldman  Domshlak2012]. Our empirical evaluation on some standard MDP benchmarks for comparison between MCTS planning algorithms shows that not only provides superior performance guarantees, but is also very effective in practice and favorably compares to state of the art. We then extend with a variant of “learning by forgetting.” The resulting family of algorithms, , generalizes , improves the exponential factor in the upper bound on its reduction rate, and exhibits even more attractive empirical performance.

## 2 Monte-Carlo Planning

, a high-level scheme for Monte-Carlo tree search that gives rise to various specific algorithms for online MDP planning, is depicted in Figure 1. Starting with the current state , performs an iterative construction of a tree rooted at . At each iteration, issues a state-space sample from , expands the tree using the outcome of that sample, and updates information stored at the nodes of . Once the simulation phase is over, uses the information collected at the nodes of to recommend an action to perform in . For compatibility of the notation with prior literature, in what follows we refer to the tree nodes via the states associated with these nodes. Note that, due to the Markovian nature of MDPs, it is unreasonable to distinguish between nodes associated with the same state at the same depth. Hence, the actual graph constructed by most instances of forms a DAG over nodes . By in what follows, we refer to the subset of actions applicable in state .

Numerous concrete instances of have been proposed, with  [Kocsis  SzepesváriKocsis  Szepesvári2006] probably being the most popular such algorithm these days [Gelly  SilverGelly  Silver2011, SturtevantSturtevant2008, Bjarnason, Fern,  TadepalliBjarnason et al.2009, Balla  FernBalla  Fern2009, Eyerich, Keller,  HelmertEyerich et al.2010, Keller  EyerichKeller  Eyerich2012a]. To give a concrete sense of ’s components, as well as to ground some intuitions discussed later on, below we describe the specific setting of corresponding to the core algorithm, and Figure 2 illustrates the tree construction, with denoting the number of state-space samples.

• : The samples are all issued from the root node . The sample ends either when a sink state is reached, that is, , or when . Each node/action pair is associated with a counter and a value accumulator . Both and are initialized to , and then updated by the procedure. Given , the next-on-the-sample action is selected according to the deterministic policy [Auer, Cesa-Bianchi,  FischerAuer et al.2002a], originally proposed for optimal cumulative regret minimization in stochastic multi-armed bandit (MAB) problems [RobbinsRobbins1952]: If for all , then

 ai+1=\operatornamewithlimitsargmaxa[ˆQ(si,a)+c√logn(si)n(si,a)], (1)

where . Otherwise, is selected uniformly at random from the still unexplored actions . In both cases, is then sampled according to the conditional probability , induced by the transition function .

• : Each state-space sample induces a state trace inside , as well as a state trace outside of . In principle, can be expanded with any prefix of ; a popular choice in prior work appears to be expanding with only the upper-most node . (If is constructed as a DAG, it is expanded with the first node along that leaves .)

• : For each node along that is now part of the expanded tree , the counter

is incremented and the estimated

-value is updated as

 ˆQ(si,ai+1)←ˆQ(si,ai+1)+Ri−ˆQ(si,ai+1)n(si,ai+1), (2)

where .

• : Interestingly, the action recommendation protocol of was never properly specified, and different applications of adopt different decision rules, including maximization of the estimated -value, of the augmented estimated -value as in Eq. 1, of the number of times the action was selected during the simulation, as well as randomized protocols based on the information collected at the root.

The key property of is that its exploration of the search space is obtained by considering a hierarchy of forecasters, each minimizing its own cumulative regret, that is, the loss of the total reward incurred by exploring the environment [Auer, Cesa-Bianchi,  FischerAuer et al.2002a]. Each such pseudo-agent forecaster corresponds to a state/steps-to-go pair . In that respect, according to Theorem 6 of uct, asymptotically achieves the best possible (logarithmic) cumulative regret. However, as recently pointed out in numerous works [Bubeck  MunosBubeck  Munos2010, Busoniu  MunosBusoniu  Munos2012, Tolpin  ShimonyTolpin  Shimony2012, Feldman  DomshlakFeldman  Domshlak2012], cumulative regret does not seem to be the right objective for online MDP planning, and this is because the rewards “collected” at the simulation phase are fictitious. Furthermore, the work of bubeck:etal:tcs11 on multi-armed bandits shows that minimizing cumulative regret and minimizing simple regret are somewhat competing objectives. Indeed, the same Theorem 6 of uct claims only a polynomial-rate reduction of the probability of choosing a non-optimal action, and the results of bubeck:etal:tcs11 on simple regret minimization in MABs with stochastic rewards imply that achieves only polynomial-rate reduction of the simple regret over time. Some attempts have recently been made to adapt , and -based planning in general, to optimizing simple regret in online MDP planning directly, and some of these attempts were empirically rather successful [Tolpin  ShimonyTolpin  Shimony2012, Hay, Shimony, Tolpin,  RussellHay et al.2012]. However, to the best of our knowledge, none of them breaks ’s barrier of the worst-case polynomial-rate reduction of simple regret over time.

## 3 Simple Regret Minimization in MDPs

We now show that exponential-rate reduction of simple regret in online MDP planning is achievable. To do so, we first motivate and introduce a family of algorithms with a two-phase scheme for generating state space samples, and then describe a concrete algorithm from this family, , that (1) guarantees that the probability of recommending a non-optimal action asymptotically convergences to zero at an exponential rate, and (2) achieves exponential-rate reduction of simple regret over time.

### 3.1 Exploratory concerns in online MDP planning

The work of bubeck:etal:tcs11 on pure exploration in multi-armed bandit (MAB) problems was probably the first to stress that the minimal simple regret can increase as the bound on the cumulative regret is decreases. At a high level, bubeck:etal:tcs11 show that efficient schemes for simple regret minimization in MAB should be as exploratory as possible, thus improving the expected quality of the recommendation issued at the end of the learning process. In particular, they showed that the simple round-robin sampling of MAB actions, followed by recommending the action with the highest empirical mean, yields exponential-rate reduction of simple regret, while the strategy that balances between exploration and exploitation yields only polynomial-rate reduction of that measure. In that respect, the situation with MDPs is seemingly no different, and thus Monte-Carlo MDP planning should focus on exploration only. However, the answer to the question of what it means to be “as exploratory as possible” with MDPs is less straightforward than it is in the special case of MABs.

For an intuition as to why the “pure exploration dilemma” in MDPs is somewhat complicated, consider the state/steps-to-go pairs as pseudo-agents, all acting on behalf of the root pseudo-agent that aims at minimizing its own simple regret in a stochastic MAB induced by the applicable actions . Clearly, if an oracle would provide with an optimal action , then no further deliberation would be needed until after the execution of . However, the task characteristics of are an exception rather than a rule. Suppose that an oracle provides us with optimal actions for all pseudo-agents but . Despite the richness of this information, in some sense remains as clueless as it was before: To choose between the actions in , needs, at the very least, some ordinal information about the expected value of these alternatives. Hence, when sampling the futures, each non-root pseudo-agent should be devoted to two objectives:

1. identifying an optimal action , and

2. estimating the actual value of that action, because this information is needed by the predecessor(s) of in .

Note that both these objectives are exploratory, yet the problem is that they are somewhat competing. In that respect, the choices made by actually make sense: Each sample issued by at is a priori devoted both to increasing the confidence in that some current candidate for is indeed , as well as to improving the estimate of , while as if assuming that . However, while such an overloading of the samples is unavoidable in the “learning while acting” setup of reinforcement learning, this should not necessarily be the case in online planning. Moreover, this sample overloading in comes with a high price: As it was shown by CoquelinM:uai07, the number of samples after which the bounds of on both simple and cumulative regret become meaningful might be as high as hyper-exponential in .

### 3.2 Separation of Concerns at the Extreme

Separating the two aforementioned exploratory concerns is at the focus of our investigation here. Let be a state of an MDP with rewards in , applicable actions at each state, possible outcome states for each action, and finite horizon . First, to get a sense of what separation of exploratory concerns in online planning can buy us, we begin with a MAB perspective on MDPs, with each arm in the MAB corresponding to a “flat” policy of acting for steps starting from the current state . A “flat” policy is a minimal partial mapping from state/steps-to-go pairs to actions that fully specifies an acting strategy in the MDP for steps, starting at . Sampling such an arm is straightforward as prescribes precisely which action should be applied at every state that can possibly be encountered along the execution of . The reward of such an arm is stochastic, with support , and the number of arms in this schematic MAB is .

Now, consider a simple algorithm, , which systematically samples each ”flat” policy in a loop, and updates the estimation of the corresponding arm with the obtained reward. If stopped at iteration , the algorithm recommends , where is the arm/policy with best empirical value . By the iteration of this algorithm, each arm will be sampled at least times. Therefore, using the Hoeffding’s inequality, the probability that the chosen arm is sub-optimal in our MAB is bounded by

 P{^μπ,n>^μπ∗,n}=P{^μπ,n−^μπ∗,n−(−Δπ)≥Δπ}≤exp⎛⎜⎝−⌊nKBH⌋Δ2π2H2⎞⎟⎠, (3)

where , and thus the expected simple regret can be bounded as

 Ern≤HKBHexp⎛⎜⎝−⌊nKBH⌋d22H2⎞⎟⎠. (4)

Note that uses each sample to update the estimation of only a single policy . However, recalling that arms in our MAB problem are actually compound policies, the same sample can in principle be used to update the estimates of all policies that are consistent with in the sense that, for , is defined and it is defined as . The resulting algorithm, , generates samples by choosing the actions along them uniformly at random, and uses the outcome of each sample to update all the policies consistent with it. Note that sampling the arms in cannot be done systematically as in because the set of policies updated at each iteration is stochastic.

Since the sampling is uniform, the probability of any policy to be updated by the sample issued at any iteration of is . For an arm , let denote the number of samples issued at the iterations of that are consistent with the policy . The probability that , the best empirical arm after iterations, is sub-optimal is bounded by

 P{^μπ,n>^μπ∗,n}≤P{^μπ,n−μπ≥Δπ2}+P{^μπ∗,n−μπ∗≥Δπ2}. (5)

Each of the two terms on the right-hand side can be bounded as:

 P{^μπ,n−μπ≥Δπ2}≤P{Nπ,n≤n2KH}+P{Nπ,n>n2KH,^μπ,n−μπ≥Δπ2}(†)≤e−n2K2H+n∑i=n2KH+1P{Nπ,n=i}P{^μπ,n−μπ≥Δπ2∣∣∣Nπ,n=i}≤e−n2K2H+P{^μπ,n−μπ≥Δπ2∣∣∣Nπ,n=n2KH+1}n∑i=n2KH+1P{Nπ,n=i}≤e−n2K2H+P{^μπ,n−μπ≥Δπ2∣∣∣Nπ,n=n2KH+1}(‡)≤e−n2K2H+e−nΔ2π4KHH2≤2e−nΔ2π4K2HH2, (6)

where and are by the Hoeffding inequality. In turn, similarly to Eq. 4, the simple regret for is bounded by

 Ern≤4HKBHe−nd24K2HH2. (7)

Since is a trivial upper-bound on , the bound in Eq. 7 becomes effective only when , that is, for

 n>(K2B)H⋅4(Hd)2logK. (8)

Note that this transition period length is still much better than that of , which is hyper-exponential in . Moreover, unlike in , the rate of the simple regret reduction is then exponential in the number of iterations.

### 3.3 Two-phase sampling and BRUE

While both the simple regret convergence rate, as well as the length of the transition period of , are more attractive than those of , this in itself is not much of a help: requires explicit reasoning about arms, and thus it cannot be efficiently implemented. However, it does show the promise of separation of concerns in online planning. We now introduce an family of algorithms, referred to as , that allows utilizing this promise to a large extent.

The instances of the family vary along four parameters: switching point function , exploration policy, estimation policy, and update policy. With respect to these four parameters, the components in are as follows.

• Similarly to , each node/action pair is associated with variables and . However, while counters are initialized to , value accumulators are schematically initialized to .

• : Each iteration of corresponds to a single state-space sample of the MDP, and these samples are all issued from the root node . The sample ends either when a sink state is reached, that is, , or when . The generation of is done in two phases: At iteration , the actions at states are selected according to the exploration policy of the algorithm, while the actions at states are selected according to its estimation policy.

• : is expanded with the suffix of state sequence that is new to .

• : For each state , the update policy of the algorithm prescribes whether it should be updated. If should be updated, then the counter is incremented and the estimated -value is updated according to Eq. 2 (p. 2).

• : The recommended action is chosen uniformly at random among the actions maximizing .

In what follows, for , the -th iteration of will be called -iteration if . At a high level, the two phases of sample generation respectively target the two exploratory objectives of online MDP planning: While the sample prefixes aim at exploring the options, the sample suffixes aim at improving the value estimates for the current candidates for . In particular, this separation allows us to introduce a specific instance, ,111Short for Best Recommendation with Uniform Exploration; the name is carried on from our first presentation of the algorithm in [Feldman  DomshlakFeldman  Domshlak2012], where “estimation” was referred to as “recommendation.” that is tailored to simple regret minimization. The setting of is described below, and Figure 3 illustrates its dynamics.

• The switching point function is

 σ(n)=H−((n−1)modH), (9)

that is, the depth of exploration is chosen by a round-robin on , in reverse order.

• At state , the exploration policy samples an action uniformly at random, while the estimation policy samples an action uniformly at random, but only among the actions that maximize .

• For a sample issued at iteration , only the state/action pair immediately preceding the switching state along is updated. That is, the information obtained by the second phase of is used only for improving the estimate at state , and is not pushed further up the sample. While that may appear wasteful and even counterintuitive, this locality of update is required to satisfy the formal guarantees of discussed below.

Before we proceed with the formal analysis of , a few comments on it, as well as on the sampling scheme in general, are in place. First, the template of is rather general, and some of its parametrizations will not even guarantee convergence to the optimal action. This, for instance, will be the case with a (seemingly minor) modification of to purely uniform estimation policy. In short, should be parametrized with care. Second, while in what follows we focus on , other instances of may appear to be empirically effective as well with respect to the reduction of simple regret over time. Some of them, similarly to , may also guarantee exponential-rate reduction of simple regret over time. Hence, we clearly cannot, and do not, claim any uniqueness of in that respect. Finally, some other families of MCTS algorithms, more sophisticated that , can give rise to even more (formally and/or empirically) efficient optimizers of simple regret. The set of algorithms that we discuss later on is one such example.

## 4 Upper Bounds on Simple Regret Reduction Rate with BRUE

For the sake of simplicity, in our formal analysis of we assume uniqueness of the optimal policy ; that is, at each state and each number of steps-to-go, there is a single optimal action, and it is . Let be the graph obtained by after iterations, and let denote the accumulated value for at depth . For all state/steps-to-go pairs , is a randomized strategy, uniformly choosing among actions maximizing . We also use some additional auxiliary notation.

• , i.e., the maximal number of actions per state.

• , i.e., the likelihood of the least likely (but still possible) outcome of an action in our problem.

• , i.e., the smallest difference between the value of the optimal and a second-best action at a state with just one step-to-go.

Our key result on the algorithm is Theorem 1 below. The proof of Theorem 1, as well as of several required auxiliary claims, is given in Appendix A. Here we outline only the key issues addressed by the proof, and provide a high-level flow of the proof in terms of a few central auxiliary claims.

###### Theorem 1

Let be called on a state of an MDP with rewards in and finite horizon . There exist pairs of parameters , dependent only on , such that, after iterations of , we have simple regret bounded as

 EΔH[s,πBn(s0,H)]≤Hc⋅e−c′n, (10)

and choice-error probability bounded as

 P{πBn(s0,H)≠π∗(s0,H)}≤c⋅e−c′n. (11)

In particular, these bounds hold for

 c=4K3H2−2H(H!)3∏H−1h=1(h!)424H−116(H−1)2d2H2−4H+2p3H2−3H, (12)

and

 c′=3d2H−2p2H−12H16H−1(H!)2K2H. (13)

Before we proceed any further, some discussion of the statements in Theorem 1 are in place. First, the parameters and in the bounds established by Theorem 1 are problem-dependent: in addition to the dependance on the horizon and the choice branching factor (which is unavoidable), the parameters and also depend on the distribution parameters and . While it is possible that this dependence can be partly alleviated, bubeck:etal:tcs11 showed that distribution-free exponential bounds on the simple regret reduction rate cannot be achieved even in MABs, that is, even in single-step-to-go MDPs (see Remark 2 of bubeck:etal:tcs11, which is based on a lower bound on the cumulative regret established by auer:etal:siam02). Second, the specific parameters and provided by Eqs. 12 and 13 are worst-case for MDPs with parameters , , and , and the bound in Eq. 10 becomes effective after

 n>ln(c)c′=O[(KHpd)εH2]

iterations, for some small constant . While there is still some gap with this transition period length and the transition period length of the theoretical algorithm (see Eq. 8), this gap is not that large.222Some of this gap can probably be eliminated by more accurate bounding in the numerous bounding steps towards the proof of Theorem 1. However, all such improvements we tried made the already lengthy proof of Theorem 1 even more involved.

The proof of Lemma 2 below constitutes the crux of the proof of Theorem 1. Once we have proven this lemma, the proof of Theorem 1 stems from it in a more-or-less direct manner.

###### Lemma 2

Let be called on a state of an MDP with rewards in and finite horizon . For each , there exist parameters , dependent only on , such that, for each state reachable from in steps and any , it holds that

 P{ˆQh(s,a)−Qh(s,a)≥d2∣∣∣nh(s,a)=t}≤che−c′ht,P{ˆQh(s,a)−Qh(s,a)≤−d2∣∣∣nh(s,a)=t}≤che−c′ht. (14)

In particular, these bounds hold for

 ch=K2Hh+h2−2H−1(h!)3∏h−1i=1(i!)424h−116(h−1)2d2(h−1)2⋅p2Hh+h2−2H−h, (15)

and

 c′h=3d2(h−1)pH+h−116h−1(h!)2KH+h−1. (16)

The proof for Lemma 2 is by induction on . Starting with the induction basis for , it is easy to verify that, by the Chernoff-Hoeffding inequality,

 P{∣∣ˆQ1(s,a)−Q1(s,a)∣∣≥d2∣∣∣n(s,a)=t}≤2e−d22t, (17)

that is, the assertion is satisfied with and . Now, assuming the claim holds for , below we outline the proof for , relegating the actual proof in full detail to Appendix A.

In the proof for , it is crucial to note the invalidity of applying the Chernoff-Hoeffding bound directly, as was done in Eq. 17. There are two reasons for this.

(F1)

For , is an unbiased estimator of , that is, . In contrast, the estimates inside the tree (at nodes with ) are biased. This bias stems from possibly being based on numerous sub-optimal choices in the sub-tree rooted in .

(F2)

For , the summands accumulated by are independent. This is not so for , where the accumulated reward depends on the selection of actions in subsequent nodes, which in turn depends on previous rewards.

However, we show that these deficiencies of can still be overcome through a novel modification of the seminal Hoeffding-Azuma inequality.

###### Lemma 3 (Modified Hoeffding-Azuma inequality)

Let

be a sequence of random variables with support

and . If , and

 P{E[Xi|X1,…,Xi−1]≠μ}≤cpe−cei, (18)

for some and , then, for all , it holds that

 P{t∑i=1Xi≥μt+tδ} ≤ [1+cp2h2δ2c2e]⋅e−3δ2ce2h2t, (19) P{t∑i=1Xi≤μt−tδ} ≤ [1+cp2h2δ2c2e]⋅e−3δ2ce2h2t. (20)

Together with Lemma 4 below, the inequalities provided by Lemma 3 allow us to prove the induction hypothesis in the proof of the central Lemma 2. Note that the specific bound in Lemma 3 is selected so to maximize the exponent coefficient. For any , the probabilities of interest in Eqs. 19-20 can also be bounded by

 [1+cpce(1−β)e−ce(1−β)2h2]e−3δ2ceβ2h2t;

for further details, we refer the reader to Discussion 14 in Appendix A.

###### Definition 1

Let be an MDP with rewards in , planned for initial state and finite horizon . Let be a state reachable from with steps still to go, let be an action applicable in , and let be a policy induced by running on until exactly samples have finished their exploration phase with applying action at with steps still to go. Given that,

• is a random variable, corresponding to the reward obtained by taking at , and then following for the remaining steps.

• is the event in which is sampled along the optimal actions at each of the choice points delegated to .

###### Lemma 4

Let be an MDP with rewards in , planned for initial state and finite horizon . Let be a state reachable from with steps still to go, and be an action applicable in . Considering and as in Definition 1, for any , if Lemma 2 holds for horizon , then

 P{¬Et,h+1(s,a)} ≤ 2Kh(2+ch)e−pc′h6Kt, (21) δt,h+1(s,a) ≤ 2Kh2(2+ch)e−pc′h6Kt. (22)

Together with a modified version of the Hoeffding-Azuma bound in Lemma 3, the bounds established in Lemma 4 allow us to derive concentration bounds for around as in Lemma 5 below, which serves the key building block for proving the induction hypothesis in the proof of Lemma 2.

###### Lemma 5

Let be called on a state of an MDP with rewards in and finite horizon . For each state reachable with steps still to go, each action applicable, and any , it holds that

 P{∣∣ˆQh+1(s,a)−Qh+1(s,a)∣∣≥d2∣∣∣nh+1(s,a)=t}≤(3456⋅K3(h+1)3chd2p2c′2h)e−d2pc′h16(h+1)2K. (23)

## 5 Learning With Forgetting and BRUE(α)

When we consider the evolution of action value estimates in over time (as well as in all other Monte-Carlo algorithms for online MDP planning), we can see that, in internal nodes these estimates are based on biased samples that stem from the selection of non-optimal actions at descendant nodes. This bias tends to shrink as more samples are accumulated down the tree. Consequently, the estimates become more accurate, the probability of selecting an optimal action increases accordingly, and the bias of ancestor nodes shrinks in turn. An interesting question in this context is: shouldn’t we weigh differently samples obtained at different stages of the sampling process? Intuition tells us that biased samples still provide us with valuable information, especially when they are all we have, but the value of this information decreases as we obtain more and more accurate samples. Hence, in principle, putting more weight on samples with smaller bias could increase the accuracy of our estimates. The key question, of course, is which of all possible weighting schemes are both reasonable to employ and preserve the exponential-rate reduction of expected simple regret.

Here we describe , an algorithm that generalizes by basing the estimates only on the fraction of most recent samples. We discuss the value of this addition both from the perspective of the formal guarantees, as well as from the perspective of empirical prospects. differs from in two points:

• In addition to the variables and , each node/action pair in is associated with a list of rewards, collected at each of the samples that are responsible for the current estimate .

• When a sample is issued at iteration , and updates the variables at , that update is done not according to Eq. 2 as in , but according to:

 n(x)←n(x)+1,L(x)[n(x)]←k−1∑i=σ(n)−1R(si,ai+1,si+1),ˆQ(x)←1⌈α⋅n(x)⌉n(x)∑i=n(x)−⌈α⋅n(x)⌉L(x)[i]. (24)
###### Theorem 6

Let be called on a state of an MDP with rewards in and finite horizon . There exist pairs of parameters , dependent only on , such that, after iterations of , we have simple regret bounded as

 EΔH[s,πBn(s0,H)]≤Hc⋅e−c′n, (25)

and choice-error probability bounded as

 P{πBn(s0,H)≠π∗(s0,H)}≤c⋅e−c′n. (26)

The proof for Theorem 6 follows from Lemma 7 below similarly to the way Theorem 1 follows from Lemma 2. Note that in Theorem 6 we do not provide explicit expressions for the constants and as we did in Theorem 1 (for ). This is because the expressions that can be extracted from the recursive formulas in this case do not bring much insight. However, we discuss the potential benefits of choosing in the context of our proof of Theorem 6.

###### Lemma 7

Let be called on a state of an MDP with rewards in and finite horizon . For each , there exist parameters , dependent only on , such that, for each state reachable from in steps and any , it holds that

 P{ˆQh(s,a)−Qh(s,a)≥d2∣∣∣nh(s,a)=t}≤che−c′ht,P{ˆQh(s,a)−Qh(s,a)≤−d2∣∣∣nh(s,a)=t}≤che−c′ht. (27)

The proof for Lemma 7 is by induction, following the same line of the proof for Lemma 2. In fact, it deviates from the latter only in the application of the modified Hoeffding-Azuma inequality, which has to be further modified to capture the partial sums as in .

###### Lemma 8 (Modified Hoeffding-Azuma inequality for partial sums)

Let be a sequence of random variables with support and . If , and

 P{E[Xi|X1,…,Xi−1]≠μ}≤cpe−cei, (28)

for some and , then, for all , it holds that

 P⎧⎨⎩t∑i=t−⌈αt⌉Xi≥μt+tδ⎫⎬⎭ ≤ [1+cpce(1−α)e−ce(1−α)2t]e−3δ2ce2h2αt, (29) P⎧⎨⎩t∑i=t−⌈αt⌉Xi≤μt−tδ⎫⎬⎭ ≤ [1+cpce(1−α)e−ce(1−α)2t]e−3δ2ce2h2αt. (30)

Considering the benefits of “sample forgetting” as in , let us compare the bound in Lemma 8 to the bound

 e