1 Introduction
In a Markov Decision Process (MDP), an agent observes its current state from a state space and picks an action from an action space , before transitioning to a next state drawn from a transition kernel and receiving a bounded reward drawn from a reward kernel . The agent must act so as to optimise its expected cumulative discounted reward , also called expected return, where is the discount factor. In Online Planning [14], we do not consider that these transition and reward kernels are known as in Dynamic Programming [1], but rather only assume access to the MDP through a generative model (e.g. a simulator) which yields samples of the next state and reward when queried. Finally, we consider a fixedbudget setting where the generative model can only be called a maximum number of times, called the budget .
MonteCarlo Tree Search (MCTS) algorithms were historically motivated by the application of computer Go, and made a first appearance in the CrazyStone software [8]. They were later reformulated in the setting of MultiArmed Bandits by [12] with their Upper Confidence bounds applied to Trees (UCT) algorithm. Despite its popularity, UCT has been shown to suffer from several limitations: its sample complexity can be at least doublyexponential for some problems (e.g. when a narrow optimal path is hidden in a suboptimal branch), which is much worse than uniform planning [7]. The Sparse Sampling algorithm of [11] achieves better worstcase performance, but it is still nonpolynomial and doesn’t adapt to the structure of the MDP. In stark contrast, the Optimistic Planning for Deterministic systems (OPD) algorithm considered by [10] in the case of deterministic transitions and rewards exploits the structure of the cumulative discounted reward to achieve a problemdependent polynomial bound on sample complexity. A similar line of work in a deterministic setting is that of SOOP and OPC by [3, 4] though they focus on continuous action spaces. OPD was later extended to stochastic systems with the OpenLoop Optimistic Planning (OLOP) algorithm introduced by [2] in the openloop setting: we only consider sequences of actions independently of the states that they lead to. This restriction in the space of policies causes a loss of optimality, but greatly simplifies the planning problem in the cases where the state space is large or infinite. More recent work such as St0p [15] and TrailBlazer [9]
focus on the probably approximately correct (PAC) framework: rather than simply recommending an action to maximise the expected rewards, they return an
approximation of the value at the root that holds with high probability. This highly demanding framework puts a severe strain on these algorithms that were developed for theoretical analysis only and cannot be applied to real problems.Contributions
The goal of this paper is to study the practical performances of OLOP when applied to numerical problems. Indeed, OLOP was introduced along with a theoretical sample complexity analysis but no experiment was carriedout. Our contribution is threefold:

First, we show that in our experiments OLOP is overly pessimistic, especially in the lowbudget regime, and we provide an intuitive explanation by casting light on an unintended effect that alters the behaviour of OLOP.

Second, we circumvent this issue by leveraging modern tools from the bandits literature to design and analyse a modified version with tighter upperconfidence bounds called KLOLOP. We show that we retain the asymptotic regret bounds of OLOP while improving its performances by an order of magnitude in numerical experiments.

Third, we provide a time and memory efficient implementation of OLOP and KLOLOP, bringing an exponential speedup that allows to scale these algorithms to high sample budgets.
The paper is structured as follows: in section 2, we present OLOP, give some intuition on its limitations, and introduce KLOLOP, whose sample complexity is further analysed in section 3. In section 4, we propose an efficient implementation of the two algorithms. Finally in section 6, we evaluate them in several numerical experiments.
1.0.1 Notations
Throughout the paper, we follow the notations from [2] and use the standard notations over alphabets: a finite word of length represents a sequence of actions . Its prefix of length is denoted . denotes the set of infinite sequences of actions. Two finite sequences and can be concatenated as , the set of finite and infinite suffixes of are respectively such that and defined likewise, and the empty sequence is .
During the planning process, the agent iteratively selects sequences of actions until it reaches the allowed budget of actions. More precisely, at time during the sequence, the agent played and receives a reward
. We denote the probability distribution of this reward as
, and its mean as , where is the current state.After this exploration phase, the agent selects an action so as to minimise the simple regret , where and refers to the value of a sequence of actions , that is, the maximum expected discounted cumulative reward one may obtain after executing :
(1) 
2 KullbackLeibler OpenLoop Optimistic Planning
In this section we present KLOLOP, a combination of the OLOP algorithm of [2] with the tighter KullbackLeibler upper confidence bounds from [5]. We first frame both algorithms in a common structure before specifying their implementations.
2.1 General structure
First, following OLOP, the total sample budget is split in trajectories of length in the following way:
The lookahead tree of depth is denoted .
Then, we introduce some useful definitions. Consider episode . For any and , let
be the number of times we played an action sequence starting with , and the sum of rewards collected at the last transition of the sequence :
The empirical mean reward of is if , and otherwise. Here, we provide a more general form for upper and lower confidence bounds on these empirical means:
(2)  
(3) 
where is an interval, is a divergence on and is a nondecreasing function. They are left unspecified for now and their particular implementations and associated properties will be discussed in the following sections.
These upperbounds for intermediate rewards finally enable us to define an upper bound for the value of the entire sequence of actions :
(4) 
where comes from upperbounding by one every rewardtogo in the sum (1), for . In [2], there is an extra step to "sharpen the bounds" of sequences by taking:
(5) 
The general algorithm structure is shown in Algorithm 1. We now discuss two specific implementations that differ in their choice of divergence and nondecreasing function . They are compared in Table 1.
Algorithm  OLOP  KLOLOP 

Interval  [0, 1]  
Divergence  
2.2 Olop
2.3 An unintended behaviour
From the definition of as an upperbound of the value of the sequence , we expect increasing sequences to have nonincreasing upperbounds. Indeed, every new action encountered along the sequence is a potential loss of optimality. However, this property is only true if the upperbound defined in (2) belongs to the reward interval .
Lemma 1
(Monotony of along a sequence)

If it holds that for all , then for any the sequence is nonincreasing, and we simply have .

Conversely, if for all , then for any the sequence is nondecreasing, and we have .
Proof
We prove the first proposition, and the same reasoning applies to the second. For and , we have by (4):
We can conclude that is nonincreasing and that . ∎
Yet, the ChernoffHoeffding bounds used in OLOP start in the regime – initially – and can remain in this regime for a long time especially in the nearoptimal branches where is close to one.
Under these circumstances, the Lemma 1 has a drastic effect on the search behaviour. Indeed, as long as a subtree under the root verifies for every sequence , then all these sequences share the same Bvalue . This means that OLOP cannot differentiate them and exploit information from their shared history as intended, and behaves as uniform sampling instead. Once the early depths have been explored sufficiently, OLOP resumes its intended behaviour, but the problem is only shifted to deeper unexplored subtrees.
This consideration motivates us to leverage the recent developments in the MultiArmed Bandits literature, and modify the upperconfidence bounds for the expected rewards so that they respect the reward bounds.
2.4 KlOlop
We propose a novel implementation of Algorithm 1 where we leverage the analysis of the klUCB algorithm from [5]
for multiarmed bandits with general bounded rewards. Likewise, we use the Bernoulli KullbackLeibler divergence defined on the interval
by:with, by convention, and for . This divergence and the corresponding bounds are illustrated in Figure 1.
and can be efficiently computed using Newton iterations, as for any the function is strictly convex and increasing (resp. decreasing) on the interval [p, 1] (resp. [0, p]).
Moreover, we use the constant function . This choice is justified in the end of section 5. Because is lower than , the Figure 1 shows that the bounds are tighter and hence less conservative than that of OLOP, which should increase the performance, provided that their associated probability of violation does not invalidate the regret bound of OLOP.
Remark 1 (Upper bounds sharpening)
The introduction of the Bvalues was made necessary in OLOP by the use of ChernoffHoeffding confidence bounds which are not guaranteed to belong to [0, 1]. On the contrary, we have in KLOLOP that by construction. By Lemma 1, the upper bounds sharpening step in line 1 of Algorithm 1 is now superfluous as we trivially have for all .
3 Sample complexity
We say that if there exist such that . Let us denote the proportion of nearoptimal nodes as:
Theorem 3.1 (Sample complexity)
We show that KLOLOP enjoys the same asymptotic regret bounds as OLOP. More precisely, for any , KLOLOP satisfies:
4 Time and memory complexity
After having considered the sample efficiency of OLOP and KLOLOP, we now turn to study their time and memory complexities. We will only mention the case of KLOLOP for ease of presentation, but all results easily extend to OLOP.
The Algorithm 1 requires, at each episode, to compute and store in memory of the reward upperbounds and Uvalues of all nodes in the tree . Hence, its time and memory complexities are
(6) 
The curse of dimensionality brought by the branching factor
and horizon makes it intractable in practice to actually run KLOLOP in its original form even for small problems. However, most of this computation and memory usage is wasted, as with reasonable sample budgets the vast majority of the tree will not be actually explored and hence does not hold any valuable information.We propose in Algorithm 2 a lazy version of KLOLOP which only stores and processes the explored subtree, as shown in Figure 2, while preserving the inner workings of the original algorithm.
Theorem 4.1 (Consistency)
Property 1 (Time and memory complexity)
Algorithm 2 has time and memory complexities of:
The corresponding complexity gain compared to the original Algorithm 1 is:
which highlights that only a subtree corresponding to the sample budget is processed instead of the search whole tree .
Proof
At episode , we compute and store in memory of the reward upperbounds and Uvalues of all nodes in the subtree . Moreover, the tree is constructed iteratively by adding K nodes at most L times at each episode from 0 to . Hence, . This yields directly . ∎
5 Proof of Theorem 3.1
We follow stepby step the pyramidal proof of [2], and adapt it to the KullbackLeibler upper confidence bound. The adjustments resulting from the change of confidence bounds are highlighted. The proofs of lemmas which are not significantly altered are listed in the Supplementary Material.
We start by recalling their notations. Let and such that . Considering sequences of actions of length , we define the subset of nearoptimal sequences and the subset of suboptimal sequences that were nearoptimal at depth :
By convention, . From the definition of , we have that for any , there exists a constant C such that for any ,
Hence, we also have .
Now, for , with , , we define the set of suffixes of in that have been played at least a certain number of times:
and the random variable:
Lemma 2 (Regret and suboptimal pulls)
The following holds true:
The rest of the proof is devoted to the analysis of the term . The next lemma describes under which circumstances a suboptimal sequence of actions in can be selected.
Lemma 3 (Conditions for suboptimal pull)
Assume that at step we select a suboptimal sequence : there exist such that . Then, it implies that one of the following propositions is true:
(UCB violation) 
or
(LCB violation) 
or
(Large CI) 
Proof
As and because the Uvalues are monotonically increasing along sequences of actions (see Remark 1 and Lemma 1), we have . Moreover, by Algorithm 1, we have and , so and finally .
Assume that (UCB violation) is false, then:
(7) 
Assume that (LCB violation) is false, then:
(8) 
By taking the difference (7)  (8),
But , so , which yields (Large CI) and concludes the proof. ∎
In the following lemma, for each episode we bound the probability of (UCB violation) or (LCB violation) by a desired confidence level , whose choice we postpone until the end of this proof. For now, we simply assume that we picked a function that satisfies . We also denote .
Lemma 4 (Boundary crossing probability)
The following holds true, for any and ,
Proof
Since , we have,
In order to bound this quantity, we reduce the question to the application of a deviation inequality. For all , we have on the event that . Therefore, for all , by definition of :
As is continuous on , we have by letting that:
Since d is nondecreasing on ,
We have thus shown the following inclusion:
Decomposing according to the values of yields:
We now apply the deviation inequality provided in Lemma 2 of Appendix A in [5]: , provided that ,
By choosing , it comes
The same reasoning gives: . ∎
Lemma 5 (Confidence interval length and number of plays)
Proof
We start by providing an explicit upperbound for the length of the confidence interval
. By Pinsker’s inequality:Hence for all ,
And thus, for all , by definition of and :
Lemma 6
Let and . Then implies that either equation (UCB violation) or (LCB violation) is satisfied or the following proposition is true:
(11) 
Lemma 7
Let and . Then the following holds true,
Lemma 8
Let . The following holds true,
Comments
There are no comments yet.