Tree-search based algorithms recently encountered a real success at solving sequential, highly combinatorial problems such as the challenging game of Go (Enzenberger et al., 2010; Silver et al., 2016). Such algorithms use a generative model of the environment to simulate episodes starting from the current state of the agent (Sutton, 1991; Sutton and Barto, 1998). This allows the exploration of reachable states and actions and results in the construction of an (unbalanced) scenario tree, that aims at identifying promising branches with a limited computational budget. When the computational budget is exhausted, the recommended action at the root node is applied and a new tree is built in the resulting state. This results overall in a closed loop control process.
We are interested in stochastic problems with large state spaces (e.g. continuous) with a short decision time (budget). In this setting, open loop planning algorithms have proven to be successful (Bubeck and Munos, 2010) and even to outperform (Weinstein and Littman, 2012) the standard approaches that consider closed loop policy trees such as UCT (Kocsis and Szepesvári, 2006). They seek for optimal sequences of actions (plans) rather than optimal policies despite the sub-optimal nature of a plan in stochastic environments. Indeed, computing the latter prevents feed-back on the explored states but allows to break the complexity of the state space exploration. Given a tree computed by an open loop planning algorithm, we propose to keep the sub-tree reached by the application of the recommended action and to directly use it as the main tree for the subsequent time step, without re-planning. What motivates this approach is sparing the computational cost of tree building for subsequent time steps, hence reducing the number of calls to the simulator. The interest of this can be seen in two ways. On one hand it is a way of reducing energy consumption for systems with low computational resources (Wilson et al., 2014, 2016). On the other hand, the saved computational time can be re-invested into other tasks. Particularly, this approach is adapted for low level control (i.e. high frequency) where sub-sequent tree developments is cumbersome. In this framework, Perez et al. (2012a) and Heusner (2011) considered keeping the tree in deterministic environments but observed a negative impact as the sub-trees were systematically kept without analysis. Moreover, they lose the aforementioned computational gain by refining the sub-trees.
In this paper, we study the impact of using the subsequent sub-trees as main trees for the next action steps without further re-planning. We claim that in lowly-stochastic environments, the reached performance is comparable to algorithms systematically discarding the tree. Our contribution is threefold. (1) We introduce a new algorithm called OLTA (Section 3), performing a systematic analysis of the sub-tree and taking the decision of re-planning or not at each time step. (2) We upper bound the probability of selecting a suboptimal action within a sub-tree, the sense of optimality being defined in an open loop fashion (Section 4). Additionally, we show that this upper bound decays logarithmically with the sub-tree depth. (3) We show in our experiments the benefit of applying such a method both in terms of performance and computational cost saving (Section 5).
2.1 Markov Decision Process
We model the planning problem as a Markov Decision Process (MDP) where an agent sequentially takes actions with the general goal of maximizing the cumulative return fed back by the environment(Puterman, 2014). We refer to the state space as and the action space as . We suppose the number of actions to be finite with , thus we write . We also consider that the available actions are independent of the state the agent lies in. The state transition function is stochastic and we note the probability of reaching state after taking action in state . The reward model is denoted by and refers to the scalar reward received while performing the transition . We assume that this reward function is deterministic. Finally, we suppose the horizon of the MDP is infinite and we note the discount factor which represents the importance of the subsequent collected rewards.
2.2 Tree Representation
When a generative model of the MDP is available, it becomes possible to use it within planning algorithms. Tree-search algorithms use this model in order to build a tree of what may possibly occur in the current situation of the agent (Sutton, 1991; Sutton and Barto, 1998; Silver et al., 2008). In the stochastic setting with potentially infinitely many states, we use a tree structure similar to the one used by Bubeck and Munos (2010). The tree built at each time step consists in a look-ahead search of the possible outcomes while following some action plan starting from the current state of the agent . Thus, the root node of the tree is labelled by the unique state . The edges correspond to the available actions, being the branching factor of the tree. The tree itself conforms to an ensemble of action sequences, or plans, originating from its root node.
We emphasize the fact that this tree structure implies that we search for a state-independent optimal sequence of actions (open loop plan) which is in general sub-optimal compared to a state-dependent policy search. The THTS family of algorithms in particular (Keller and Helmert, 2013) defines trees with chance and decision nodes while our structure does not apply an equality operator on the sampled states. Following Bubeck and Munos (2010); Weinstein and Littman (2012), we argue that closed-loop application of the first action in optimal open loop plans, although theoretically suboptimal, can be competitive with these methods in practice, while being more sample-efficient.
Since the transition model is stochastic, the non-root nodes are not labelled by a unique state. Instead, every such node is associated to a state distribution resulting from the application of the action plan leading to the considered node and starting from . During the exploration, we consider saving all sampled states at each non-root node. A comprehensive illustration of such a tree can be found in Figure 1. This approach extends straightforwardly to Partially Observable Markov Decision Processes (POMDP) (Silver and Veness, 2010).
Given a tree-search, open loop planning algorithm, we call the tree at depth , that is the sub-tree resulting from the application of the first recommended actions. Hence denotes the whole tree, the tree starting from the node reached by the application of the first recommended action and so on.
2.3 Open Loop UCT
For the sake of clarity and in order to clearly separate the tree building properties from the open loop execution presented in the next section, we define an open loop planning algorithm utilizing the presented tree structure that we call Open Loop UCT (OLUCT). The difference between UCT and OLUCT is that OLUCT is not provided with an equality operator over states. Within the THTS terminology, this means that decision and chance nodes do not correspond to a single state but to the state distribution reachable by the action plan leading to the node. Hence decision and chance nodes are associated to the state distribution which makes OLUCT an open loop planning algorithm. The fundamental consequence is that an action value within our tree is computed w.r.t. the parent node’s state distribution rather than a single state.
Apart from this, OLUCT uses the same exploration procedure as UCT. Within a node, we note
the estimated expected return of actionafter samples of this action. is the number of trials of action up to time of the OLUCT procedure. An Upper Confidence Bound (UCB) strategy (Auer et al., 2002) is applied at each node where each action is seen as an arm of a bandit problem. The tree policy selects the action with the highest UCB:
where is an exploration term ensuring that all actions will be sampled infinitely often.The parameter drives the exploration-exploitation trade-off. The OLUCT tree building procedure is detailed in Algorithm 1.
3 OLTA (Open Loop Tree-search Algorithm)
In order to control the execution of open loop plans, we propose a new algorithm called OLTA (Algorithm 2). It relies on a generic open loop planning algorithm to generate a tree, rooting from the current state. For the next execution time step, it decides either to use the sub-tree reached by the recommended action or to trigger a re-planning by building a new tree. If no re-planning is triggered, then the recommended action of the sub-tree is applied without using the additional information of the new state observed after the transition. This results in an open loop control process and spares the cost of developing a new tree starting at this state. The intuition behind OLTA is that several consecutive recommended actions in an optimal branch of the tree can be reliable, despite the randomness of the environment. A major example of such a case is low-level control, where consecutive sampled states are close to each other.
In this paper, for its performance and simplicity, we chose to implement OLUCT as the open loop planning algorithm utilized by OLTA. However, any other algorithm generating trees as described in Section 2.2 could be used in the same way (e.g. OLOP (Bubeck and Munos, 2010), or HOLOP (Weinstein and Littman, 2012)).
One important feature of OLTA is the so-called “decisionCriterion”, based on which the agent decides to either use the first sub-tree following the recommended action, or to re-build a new tree from the current state. The decision is based on a comparison with the characteristics of the resulting sub-tree and the current state of the agent. In the next section, we discuss different decision criteria, leading to the consideration of a family of different algorithms.
3.2 Decision Criterion
The simplest implementation of the decision criterion is to keep the sub-tree only if its root node is fully expanded. This means that each action has been sampled at least once. We call the resulting algorithm Plain OLTA. It naively trusts the value estimates of the sub-tree, thus applies the whole plan of recommended actions at each depth until it reaches a partially expanded node. Therefore, Plain OLTA is expected to perform better in deterministic environments. In stochastic cases however, those estimates may be biased because of the different sources of uncertainty within the MDP (reward function, state transition function and action selection). For this reason, we seek more robust criteria to base the decision on.
A natural way to decide whether to keep the sub-tree or not is to track if the recommended action is optimal w.r.t. the new state of the agent. Here we make an important distinction between a state-wise optimal action and a node-wise optimal action. The first one is the action recommended by the optimal policy in a specific state. We note it , with the optimal state-action value function. In order to define the second one, we introduce
, the state random variable at the root node of. Its distribution results from the application of the first recommended actions starting from , so . The node-wise optimal action maximizes the expected return given the state distribution of the node. We note it where is the optimal action value function w.r.t. the state distribution at the root node of , that is . Following Bellemare et al. (2017), a distributional Bellman equation can be expressed in terms of three sources of randomness that are: the stochastic reward function; the random return; and the transition operator with , and . Mathematically, we have the following distributional Bellman equations:
with and . Unfortunately, at the root node of for , open loop tree-search algorithms do not estimate but . The bias introduced by the state distribution implies that in the general case we have no guarantee that . The risk is that the set of possible realizations of can include states where is sub-optimal, in which case the resulting return evaluations would weight in favour of a different action than . In other words — introducing the notion of domination domain for an action as — if is not included in , then the risk of the recommended action to be state-wise sub-optimal is increased. Conversely, if is included in the domination domain of , then the optimal action will be selected given that the budget is “big enough” w.r.t. the chosen tree-search algorithm’s performance. Consequently, one should base the decision criterion on the analysis of and the action domination domains. To compute these domains, Rachelson and Lagoudakis (2010) use the properties of Lipschitz-MDPs. Although the following discussion is inspired by this work, the consideration of Lipschitz-MDPs is out of the scope of this paper. We discuss below the construction of decision criteria that will be illustrated in Section 5.
Current state analysis & POMDP setting. The current state of the agent can be compared to the empirical state distribution at the root node of the sub-tree. If is large, then the value estimators are related to the locality of the state space the agent lies in. If not, then the node-wise optimal action may not be state-wise optimal. This consideration supposes to identify a state-metric for which two close states have a high chance to be in the same action domination domain. Alternatively, in the case of a POMDP, a belief distribution on the current state is available instead of the current state itself (Kaelbling et al., 1998). In such a case, a direct comparison between this distribution and can be performed (e.g. with a Wasserstein metric). Note that making use of the current state of the agent makes the algorithm closed-loop, by definition. We use the terminology ”open-loop” in order to distinguish OLTA from classical closed-loop Tree Search algorithms that systematically re-plan, rooting from the current state (e.g. OLOP (Bubeck and Munos, 2010), performs closed-loop execution).
State distribution analysis. The dispersion and multi-modality of could motivate not to re-use a sub-tree. A high dispersion involves the possibility that does not belong to a single action domination domain and a re-planning should be triggered. The same consideration applies in terms of multi-modality. Conversely, a narrow, mono-modal, state distribution is a good hint for to be comprised into a single action domination domain.
Return distribution analysis. A widespread or a multi-modal return distribution for the recommended action in a node may indicate a strong dependency on the region of the state space we lie in. If
covers different action domination domains, each of these domains may contribute a different return distribution to the node’s return estimates, thus inducing a high variance on this distribution or even a multi-modality. In this case, it could be beneficial to trigger the re-planning. Alternatively, even after re-planning, widespread or multi-modal return distributions can naturally arise as a result of the MDP’s reward and transition models.
We do not provide a unique generic method to base the decision criterion on. Indeed, we believe that it is a strongly problem-dependent issue and that efficient heuristics can be built accordingly. However, the analysis of the state and return distributions constitute promising indicators and we exemplify their use in the experiments of the last section.
4 Theoretical Analysis
In this section, we demonstrate that the algorithm asymptotically provides node-wise optimal actions for any sub-tree of depth . We first derive an upper bound on the failure probability that converges towards zero when the initial budget of the algorithm goes to infinity. Then, we characterize the loss of performance guarantees between subsequent depths and show a logarithmic decay of the upper bound. The demonstration unfolds as follows: first we write a lower bound for the number of trials of the actions at the root of in Lemma 1; then we write an upper bound on the failure probability given a known budget at depth in Lemma 2; finally we derive a recursive relation between the upper bounds of subsequent trees that leads to our result in Theorem 1.
We note the budget used to develop i.e. the number of times the first recommended actions have been selected by the tree policy. We note the number of times the action at the root node of has been selected by the OLUCT tree policy after expansions of . Similarly, denotes the estimate of the return of the action at depth after expansions of the sub-tree . We write the index of the action chosen by the tree policy at depth after expansions of . We have:
The recommended action at depth given a budget is . Following Kocsis and Szepesvári (2006), we assume that the empirical estimates converge and write and . Then, we define for , where we note the index of the node-wise optimal action at the root node of . We make the assumption that only one action is optimal in a given node. The minimum return difference between a suboptimal action and the optimal one at depth is .
Lower bound for the number of trials. For any sub-tree developed with a budget , there exist a constant such that for all . Furthermore, we have the following sequence of lower bounds for the budget with the ceiling function:
The first result is borrowed from Kocsis and Szepesvári (2006) where they show it for a generic bandit problem. The extension to our case with a given budget is straightforward. The sequence of lower bounds can be derived by observing that and applying the previous lower bound. ∎
Upper bound on the failure probability at depth given the budget . For any sub-tree developed with a budget we have the following upper bound on the failure probability, conditioned by the budget :
Let us first bound the failure probability with the probability of overestimating a suboptimal action and underestimating the optimal one up to .
From now on, the proof breaks to the analysis of one of the two terms on the right of the last inequality since both can be considered the same way. Let us consider the first term:
Where we first write the joint probability, then apply Hoeffding’s inequality, followed by Lemma 1 and the fact that a convex combination is upper bounded by its higher element. Similarly to Kocsis and Szepesvári (2006), we shall assume that the UCT constant is appropriately chosen for the tail inequalities to be verified. ∎
Upper bound on the failure probability at depth . For an initial budget of and for any sub-tree developed with a budget , we have the following recursive relation for the upper bound on the failure probability, conditioned by the initial budget :
Additionally, for any depth given the initial budget :
Where with and .
This result shows a logarithmic decay between the upper bounds on the failure probability of two subsequent trees. Asymptotically, at any depth, this upper bound converges towards zero. An illustration can be found in Figure 2 for several different depths. This result highlights the fact that the deeper the sub-tree is, the less one can rely on the recommended action at the root node. However, we should note that these upper bounds are derived without making further hypotheses on the MDP and express a worst-case value. Practically, depending on the problem, subsequent sub-trees could be highly relevant w.r.t. the current state of the agent. We show in the next section that equal performances to OLUCT can be reached with a smaller computational budget and number of calls to the generative model.
5 Empirical Analysis
We compared OLUCT with OLTA on a discrete 1D track environment†† Code available at: 111 https://github.com/erwanlecarpentier/1dtrack.git and a continuous Physical Travelling Salesman Problem222 https://github.com/erwanlecarpentier/flatland.git (PTSP) (Perez et al., 2012b). We implemented five decision criteria, leading to five variations of OLTA.
5.1 Heuristic decision criteria
A relevant decision criterion w.r.t. the treated problem allows OLTA to discard a sub-tree when its first recommended action may not be state-wise optimal given the current state of the agent. We implemented five different tests to base this decision on, and evaluated them independently, which led to the following variations of OLTA.
Plain OLTA. The simplest decision criterion that discards a sub-tree only if its root-node is not fully expanded.
State Distribution Modality (SDM-OLTA). Test whether the empirical state distribution is multi-modal or not. If yes, discard the tree if the current state of the agent does not belong to a majority mode. We define a majority mode by a mode comprising more than of the sampled states.
State Distribution Variance (SDV-OLTA). Test whether the empirical state distribution variance is above a certain threshold . Discard the tree if it is the case. For multi-dimensional state spaces such as in the PTSP, the Variance-Mean-Ratio (VMR) is considered for the different orders of magnitude to be comparable.
State Distance to State Distribution (SDSD-OLTA). Compute the Mahalanobis distance (De Maesschalck et al., 2000) of the current state from the empirical state distribution. Discard the tree if it is above a selected threshold .
Return Distribution Variance (RDV-OLTA). Test whether the empirical return distribution variance is above a certain threshold . Discard the tree if it is the case.
A more selective decision criterion can easily be derived by combining the previously described decision criteria and discarding the tree if one of them recommends to do so.
5.2 1D Track Environment
The 1D track environment (Figure 3), is a 1D discrete world where an agent can either go right or left. The initial state is the “middle” state . The reward is everywhere except for the transition to the two terminal states and for which it is . The action space is . We introduce a transition misstep probability which is the probability to end up in the opposite state after taking an action, for : and . The same applies for the action. If , the optimal policy is to go left at , to act randomly at and to go right at . The simulation settings are: ; (budget); ; (simulation horizon for ); ; . The decision criteria parameters were tuned to: ; ; ; . We generated episodes for each value of and recorded 3 performance measures: loss (number of time steps to termination); computational cost (measured computation time); and number of calls to the generative model. We display two different graphs of the loss, the second one highlights the relative performance between OLTA and OLUCT.
The motivation behind the use of such a benchmark is to test open loop control in a highly stochastic environment where feedback of the current state is highly informative about the optimal action. First, notice that the parameters are tuned so that the OLUCT algorithm can easily find the optimal action and that the derived plan at the root node of is optimal. In case of misstep for the first action, OLTA has to guess that a re-planning should be triggered while OLUCT does it systematically. However, the difficulty for OLTA is to guess that a misstep occurred and to act accordingly. As seen on Figure 4, the non-plain OLTA and OLUCT achieved a very comparable loss. Plain-OLTA had a weaker performance due to its systematic re-use of the sub-trees. Notice that some variations of OLTA such as SDV-OLTA achieved a better mean loss than OLUCT for some values of . Due to the high variance, this observation cannot lead to the conclusion that OLTA can outperform OLUCT. However, this emphasizes the fact that the performance are very similar. In terms of both computational cost and number of calls to the generative model, OLTA widely outperforms OLUCT. As increases, this computational gain vanishes and catches up with OLUCT for SDM-OLTA and SDV-OLTA. This accounts for the discriminative power of their decision criteria that discard more trees. RDV-OLTA and SDSD-OLTA kept a lower computational cost while reasonably matching the performance of OLUCT. Obviously, the computational cost of Plain-OLTA stays low. The apparent similarity between the number of calls to the generative model and the computational cost proves that computing our decision criteria is less expensive than re-planning.
5.3 Physical Travelling Salesman Problem
|Trajectory derived by an OLUCT algorithm in our PTSP setting. The starting point is displayed in red, the waypoints in green and the walls in grey.|
The PTSP is a continuous navigation problem in which an agent must reach all the waypoints within a maze (Figure 5). The state of the agent is i.e. the 2D position, orientation and velocity. The action space is which consists of the increment, decrement or no-change of the orientation. The reward is when a waypoint is reached for the first time, for a wall crash and otherwise. The simulation terminates when the agent reaches all the waypoints or a time limit. The walls cannot be crossed and the orientation is flipped when a crash occurs. We introduce a misstep probability
which is the probability for another action to be undertaken instead of the current one. A Gaussian noise of standard deviationis added to each component of the resulting state from a transition. The simulation settings are: ; ; ; (initial tree budget); that applies no orientation variation; (simulation horizon for ); ; . The provided map is the one depicted in Figure 5 with three waypoints. The different decision criteria parameters were tuned to: ; ; . We reserve the development of SDM-OLTA in the continuous case for future work. We generated episodes for each transition misstep probability and recorded the same performance measures as in the 1D track case. The results are presented in Figure 6.
OLUCT, SDSD-OLTA and RDV-OLTA achieved a comparable loss for every , which shows that our method is applicable to larger scale problems than the 1D track environment. SDV-OLTA reached a lower level of performance. Plain OLTA still realized the highest loss since it is highly sensitive to the stochasticity of the environment. In terms of both computational cost and number of calls to the generative model, the same trade-off between performance and computational cost is observed. Plain OLTA and SDV-OLTA considerably lowered the number of calls at the cost of the performance while SDSD-OLTA and RDV-OLTA realized a better compromise. The number of calls to the generative model and the computational cost are quite similar, meaning that — even with the higher dimensionality of the PTSP compared to the 1D track — the cost incurred by the decision criteria computation is negligible in comparison to the one incurred by the re-planning procedure. Notice that SDV-OLTA achieved a good cost-performance trade-off in the 1D track environment while not in the PTSP relatively to the other algorithms. This is explained by the decision criteria’s sensitivity to parameter tuning and by the problem-dependent relevance of such a criterion. For the sake of completeness, we also generated experiments on the continuous 1D track and the discrete PTSP. The results are available in the Appendix of this paper. We chose to only illustrate the discrete 1D track and the continuous PTSP for the theoretical interest of the first one and the complexity of the second one.
We introduced OLTA, a new class of tree-search algorithms performing open loop control by re-using subsequent sub-trees of a main tree built with the OLUCT algorithm. A decision criterion based on the analysis of the current sub-tree allows the agent to efficiently determine if the latter can be exploited. Practically, OLTA can achieve the same level of performance as OLUCT given that the decision criterion is well designed. Furthermore, the computational cost is strongly lowered by decreasing the number of calls to the generative model. This saving is the main interest of the approach and can be exploited in two ways: it decreases the energy consumption which is relevant for critical systems with low resources such as Unmanned Vehicles or Satellites; It allows a system to re-allocate the computational effort to other tasks rather than controlling the robot. We emphasize the fact that this method is generic and can be combined with any other tree-search algorithm than OLUCT. Open questions include building non problem-dependent decision criteria, e.g. by making more restrictive hypothesis on the considered class of MDPs, but also applying the method to other benchmarks and other open loop planners.
This research was supported by the Occitanie region, France.
- Auer et al.  Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
- Bellemare et al.  Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887, 2017.
- Bubeck and Munos  Sébastien Bubeck and Rémi Munos. Open loop optimistic planning. In COLT, 2010.
- De Maesschalck et al.  Roy De Maesschalck, Delphine Jouan-Rimbaud, and Désiré L. Massart. The Mahalanobis distance. Chemometrics and intelligent laboratory systems, 50(1):1–18, 2000.
- Enzenberger et al.  Markus Enzenberger, Martin Muller, Broderick Arneson, and Richard Segal. Fuego - an open-source framework for board games and Go engine based on Monte Carlo tree search. IEEE Transactions on Computational Intelligence and AI in Games, 2(4):259–270, 2010.
- Heusner  Manuel Heusner. UCT for pac-man. Bachelor thesis, Univ. of Basel, 2011.
- Kaelbling et al.  Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1):99–134, 1998.
- Keller and Helmert  Thomas Keller and Malte Helmert. Trial-based heuristic tree search for finite horizon MDPs. In ICAPS, 2013.
- Kocsis and Szepesvári  Levente Kocsis and Csaba Szepesvári. Bandit based Monte Carlo planning. In ECML, volume 6, pages 282–293. Springer, 2006.
- Perez et al. [2012a] Diego Perez, Philipp Rohlfshagen, and Simon M. Lucas. Monte Carlo tree search: Long-term versus short-term planning. In Computational Intelligence and Games (CIG), 2012 IEEE Conference on, pages 219–226. IEEE, 2012.
- Perez et al. [2012b] Diego Perez, Philipp Rohlfshagen, and Simon M. Lucas. The physical travelling salesman problem: WCCI 2012 competition. In Evolutionary Computation (CEC), 2012 IEEE Congress on, pages 1–8. IEEE, 2012.
- Puterman  Martin L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Rachelson and Lagoudakis  Emmanuel Rachelson and Michail G. Lagoudakis. On the locality of action domination in sequential decision making. In ISAIM, 2010.
- Silver and Veness  David Silver and Joel Veness. Monte Carlo planning in large POMDPs. In Advances in neural information processing systems, pages 2164–2172, 2010.
- Silver et al.  David Silver, Richard S. Sutton, and Martin Müller. Sample-based learning and search with permanent and transient memories. In Proceedings of the 25th international conference on Machine Learning, pages 968–975. ACM, 2008.
Silver et al. 
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George
Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda
Panneershelvam, Marc Lanctot, et al.
Mastering the game of Go with deep neural networks and tree search.Nature, 529(7587):484–489, 2016.
- Sutton and Barto  Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1998.
- Sutton  Richard S. Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bulletin, 2(4):160–163, 1991.
- Weinstein and Littman  Ari Weinstein and Michael L. Littman. Bandit-based planning and learning in continuous-action markov decision processes. In ICAPS, 2012.
- Wilson et al.  Mark A. Wilson, James McMahon, and David W. Aha. Bounded expectations for discrepancy detection in goal-driven autonomy. In AI and Robotics: Papers from the AAAI Workshop, 2014.
- Wilson et al.  Mark A. Wilson, James McMahon, Artur Wolek, David W. Aha, and Brian H. Houston. Toward goal reasoning for autonomous underwater vehicles: Responding to unexpected agents. In Goal Reasoning: Papers from the IJCAI Workshop, 2016.
We provide the readers with several additional experiments in similar settings as presented earlier. The distinction is essentially based on the transition from discrete to continuous state space and vice-versa. In the paper, we chose to mainly present the two extreme cases of the 1D track and the continuous PTSP for the theoretical interest of the first one and the complexity of the second one.
Continuous 1D track
In order to test the algorithm on a more complex setting than the discrete 1D track, we extended the latter to the continuous case. A comprehensive illustration of the environment is provided in Figure 7.
The width of the track is , the action space is still with a magnitude of for each action. The agent starts at the middle state . The reward is zero everywhere except for the two terminal states whose values are and where it is . To the transition misstep probability presented earlier, we added a Gaussian noise to the resulting state after each transition. The simulations are performed with the following settings: ; ; (initial tree budget); ; (simulation horizon for ); ; ; and the generative model is the true model. The different decision criteria parameters were selected empirically and set to the following values: ; ; . As in the continuous PTSP case, we reserve the development of SDM-OLTA in the continuous case for future work. We generated episodes for each transition misstep probability. The results are presented in Figure 8. Again, a logarithm is applied for display purposes.
As in the discrete case, OLTA achieves comparable loss as vanilla OLUCT. Particularly, SDV-OLTA performs as well as OLUCT on the whole range of misstep probabilities. In this setting, SDSD-OLTA and RDV-OLTA achieved an intermediate loss between Plain-OLTA and OLUCT. In terms of computational cost, two behaviours are observed. In the case of SDV-OLTA, the computational gain is relevant for low transition misstep probabilities and catches up with OLUCT as the latter increases. This allows the algorithm to achieve the same score as OLUCT. In the case of SDSD-OLTA and RDV-OLTA, the computational gain seems to be constant on the whole range of transition misstep probabilities. However, the reached lower performance accounts for the fact that, as for Plain-OLTA, the decision criteria do not adapt well to the stochasticity increasing, causing the algorithms to discard less trees than needed. Notice that the computational cost achieved by the SDSD-OLTA algorithm is greater for than for . This is due to the fact that its criterion computes the distance between the current state of the agent and the empirical mean of the state distribution normalised by the variance. This difference comes from the fact that the distribution is mono-modal for and bi-modal otherwise. Indeed, in the latter case, the variance increases causing the normalisation to decrease the value of the computed distance. Additionally, the empirical mean does not correspond to the mean of a mode but a point between the two mode means, which interacts in the opposite way: increasing the computed distance. In this setting, the interaction of the two mechanisms results in less sub-trees approvals for . In the discrete case, it does not occur since the current state lies exactly on the mean for because no Gaussian noise is added to the state transition. As a result, the distance is always zero.
Discrete Physical Travelling Salesman Problem
We restricted the PTSP to the discrete case. The resulting problem is a grid-world navigation problem as illustrated in Figure 9.
As in the continuous case, the state of the agent is characterized by , respectively the position in the 2D grid-world, the orientation and the velocity. In our case, we set the velocity to 1 so that the agent only has access to adjacent cells. The action space is , each action being the direction of the next adjacent cell reached by the agent. The reward is set to when a waypoint is reached for the first time and to elsewhere. We did not penalize the crashes of the agent in the discrete setting because, due to the agility provided by the action space, this would result in being stuck in cells far away from the walls. We introduce the same misstep probability as in the continuous PTSP which is the probability for another action to be undertaken instead of the current one. The simulations are performed with the following settings: ; ; (initial tree budget); that applies no orientation variation; (simulation horizon for ); ; . The provided map is the one depicted in Figure 9 with six waypoints. The different decision criteria parameters were selected empirically and set as follows: ; ; ; . Additionally, we provided Plain OLTA with the ability to discard a sub-tree if the recommended action was not available i.e.leading to a wall. We generated episodes for each transition misstep probability. The results are presented in Figure 10.
As in the continuous case, OLTA achieves comparable loss as vanilla OLUCT. Particularly, SDV-OLTA and SDSD-OLTA had a very similar performance on most of the range of misstep probabilities. SDM-OLTA, Plain OLTA and RDV-OLTA achieved poorer performance but still comparable given the high variance of the losses. In terms of computational cost, all the variations of OLTA outperform OLUCT with an approximately constant gain. For each one of them, the consequence of this gain was the increasing of the achieved loss, so that each algorithm attained a different compromise between performance and computational cost gain.