Simplified Belief-Dependent Reward MCTS Planning with Guaranteed Tree Consistency

05/29/2021 ∙ by Ori Sztyglic, et al. ∙ Technion 0

Partially Observable Markov Decision Processes (POMDPs) are notoriously hard to solve. Most advanced state-of-the-art online solvers leverage ideas of Monte Carlo Tree Search (MCTS). These solvers rapidly converge to the most promising branches of the belief tree, avoiding the suboptimal sections. Most of these algorithms are designed to utilize straightforward access to the state reward and assume the belief-dependent reward is nothing but expectation over the state reward. Thus, they are inapplicable to a more general and essential setting of belief-dependent rewards. One example of such reward is differential entropy approximated using a set of weighted particles of the belief. Such an information-theoretic reward introduces a significant computational burden. In this paper, we embed the paradigm of simplification into the MCTS algorithm. In particular, we present Simplified Information-Theoretic Particle Filter Tree (SITH-PFT), a novel variant to the MCTS algorithm that considers information-theoretic rewards but avoids the need to calculate them completely. We replace the costly calculation of information-theoretic rewards with adaptive upper and lower bounds. These bounds are easy to calculate and tightened only by the demand of our algorithm. Crucially, we guarantee precisely the same belief tree and solution that would be obtained by MCTS, which explicitly calculates the original information-theoretic rewards. Our approach is general; namely, any converging to the reward bounds can be easily plugged-in to achieve substantial speedup without any loss in performance.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 POMDPs

Figure 1: Illustration of our approach. The circles denote the belief nodes, and the rectangles represent the belief-action nodes. Rollouts, emanating from each belief node, are indicated by dashed lines finalized with triangles. (a) The simulation starts from the root of the tree, but at node it can not continue due to an overlap of the child nodes (colored red) bounds. (b) One of the red colored belief-action nodes is chosen, and resimplification is triggered from it down the tree to the leaves (shaded green area in the tree). The beliefs and rollouts inside the green area (colored by light brown) undergo resimplification if decided so. This procedure results in tighter bounds. (c) After the bounds got tighter, nothing prevents the SITH-PFT from continuing down from node guaranteeing the Tree Consistency. If needed, additional resimplifications can be commenced.

POMDPs have proven to be a celebrated mathematical framework for planning under uncertainty [Kurniawati et al. (2008); Silver and Veness (2010); Ye et al. (2017); Sunberg and Kochenderfer (2018); Garg et al. (2019)]. During the planning session, an agent is given some goal and attempts to find the optimal action to execute. Since the agent operates under uncertainty, it maintains a belief over the state and reasons about its evolution while planning. The standard way to represent a general belief distribution is by a set of weighted particles. In a finite horizon setting, the agent performs planning with a fixed number of steps ahead of time. Equipped with motion and observation models, the agent has to consider every possible realization of the future observations for every available reactive action sequence (policy) of the length of the horizon. In a sampled form, this abundance of possible realizations of action observation pairs constitutes a belief tree. Building the full belief tree is intractable since each node in the tree repeatedly branches with all possible actions and all possible observations. The number of nodes grows exponentially with the horizon. Additionally, the number of possible states grows exponentially with the state space dimension, and consequently, an adequate representation of the belief requires more particles. Those last two problems are known as the curse of history and the curse of dimensionality respectively. MCTS based algorithms tackle those problems by (a) building the belief tree incrementally and revealing only the “promising” parts of the tree, and (b) representing the belief as a fixed size set of weighted state samples (particles). An inherent part of MCTS based algorithms is the Upper Confidence Bound (UCB) technique Kocsis and Szepesvári (2006) designed to balance exploration and exploitation while building the belief tree. This technique assumes that calculating the reward over the belief node does not pose any computational difficulty. Information-theoretic rewards violate this assumption.

1.2 Related Work

Incorporation of information-theoretic reward into POMDP is a long standing effort. Earlier attempts such as Dressel and Kochenderfer (2017) were tackling offline solvers. Monte Carlo Tree Search made a significant breakthrough in overcoming the course of history. However, when the reward is a general function of the belief, the origin of the computational burden is shifted towards the reward calculation. Moreover, belief-dependent reward prescribes the complete set of belief particles at each node in the belief tree. Therefore, algorithms such as POMCP Silver and Veness (2010), and its numerous predecessors are inapplicable since they simulate each time a single particle down the tree when expanding it. DESPOT based algorithms behave similarly Ye et al. (2017), with the DESPOT- as an exception Garg et al. (2019). DESPOT- simulates a complete set of particles. However, this algorithm depends on

-vectors. In particular, as in other DESPOT-like algorithms, the belief tree is determinized. Therefore, sibling belief nodes have identical particles and are distinct solely by the weights. DESPOT-

leverages this regard and uses the

-vectors to efficiently approximate the lower bound of the value function of the sibling belief nodes without expanding them. Since DESPOT-like methods are based on gap heuristic search

Kochenderfer et al. (2022), this lower bound is an essential part of the exploration strategy. In other words, the DESPOT- tree is built using -vectors, such that they are an indispensable part of the algorithm. Note that an integral part of this approach is that the reward is state dependent, and the reward over the belief is merely expectation over the state reward. DESPOT- does not support belief-dependent rewards since it contradicts the application of the -vectors. The only approach posing no restrictions on the structure of belief-dependent reward and not suffering from limiting assumptions is Particle Filter Tree (PFT). The idea behind PFT is to apply MCTS over Belief-MDP. Sunberg and Kochenderfer (2018) augmented PFT with Double Progressive Widening and coined the name PFT-DPW. PFT-DPW utilizes the UCB strategy and maintains a complete belief particle set at each belief tree node. Recently, Fischer and Tas (2020)

presented Information Particle Filter Tree (IPFT), a method to incorporate information-theoretic rewards into PFT. The IPFT planner is remarkably fast. It simulates small subsets of particles sampled from the root of the belief tree and averages entropies calculated over these subsets. However, differential entropy estimated from a small-sized particle set can be significantly biased. This bias is unpredictable and unbounded, therefore, severely impairs the performance of the algorithm. In other words, celerity comes at the expense of quality. Often times the policy defined by this algorithm is suboptimal.

Fischer and Tas (2020) provide guarantees solely for the asymptotic case, i.e, the number of state samples (particles) tends to infinity. Asymptotically their algorithm behaves precisely as the PFT-DPW in terms of running speed and performance. Yet, in practice the performance of IPFT in terms of optimality can degrade severely compared to PFT-DPW. Moreover, Fischer and Tas (2020) does not provide any reliable study of comparison of IPFT against PFT-DPW with an information-theoretic reward. Prompted by this insight, we chose the PFT-DPW as our baseline approach, which we aim to accelerate. In contrast to IPFT, our approach explicitly guarantees an identical solution to PFT-DPW with information-theoretic reward, for any size of particle set representing the belief and serving as input to PFT-DPW.

The computational burden incurred by the complexity of POMDP planning inspired many research works to focus on approximations of the problem, e.g., Hoerger et al. (2019). Typically, approximation based planners show asymptotical guarantees, e.g., the convergence of the algorithms. Recently, the novel paradigm of simplification has appeared in literature Zhitnikov and Indelman (2021); Sztyglic and Indelman (2021); Elimelech and Indelman (2018). The simplification is concerned with carefully replacing the nonessential elements of the decision making problem and quantifying the impact of this relaxation. Specifically, simplification methods are accompanied by stringent guarantees.

1.3 Contribution

We provide a novel algorithmic framework based on converging bounds on a belief-dependent reward. Our method is guaranteed to yield the same action and belief tree as the most general algorithm suitable for such belief-dependent rewards (PFT-DPW). The proposed technique is applicable with any converging to the reward bounds. In this paper, we focus on information-theoretic rewards, in particular, differential entropy.

2 Background

2.1 POMDPs with belief-dependent rewards

POMDP is a tuple where are state, action, and observation spaces with the momentary state, action, and observation, respectively, is the stochastic transition model from the past state to the subsequent through action , is the stochastic observation model, is the discount factor, is the belief over the initial state (prior), and is the reward operator. Let denote history of actions and observations obtained by the agent up to time instance and the prior belief. The posterior belief is given by . The policy is a mapping from belief to action spaces . The policy for consecutive steps ahead is denoted by . The decision making goal is to find the optimal policy maximizing


where is the Bayesian belief update method. Bellman form representation of (1) is , where


In our generalized formulation, the reward is a function of two subsequent beliefs, an action and an observation. Specifically, our reward is


where is state and action dependent reward, and is the expectation with regard to the state. is an information-theoretic reward, which in general can be dependent on consecutive beliefs and the elements relating them (e.g. information gain). is an estimator of our information-theoretic reward weighted by . Yet, since such estimators do not commonly have a closed-form expression for non-parametric beliefs represented by a set of samples, one has to consider an estimator of (e.g., Boers et al. (2010)). As shall be seen, our chosen estimator requires also previous belief , chosen action , and received observation . Depending on the estimation method, the inputs can vary. Using the structure of (3),


where is induced by state dependent rewards and by the information-theoretic rewards. They are constituted by elements of the form and , respectively. The element is easy to calculate, thus out of our focus, whereas the is computationally expensive to compute. From here on, for the sake of clarity, we will use the notation and interchangeably.

2.2 MCTS over Belief-MDP (PFT)

In this section, we outline the UCB based MCTS over Belief-MDP. The algorithm constructs the policy tree by executing multiple simulations. Each simulation adds a single belief node to the belief tree or terminates by terminal state or action. To steer towards more deep and beneficial simulations, MCTS chooses action at each belief node according to following rule


where is the visitation count of belief node defined by the history , is the visitation count of the belief-action node, is the exploration parameter and, is the approximation of the belief-action value function for node obtained by simulations. When the action is selected, a question arises either to open a new branch in terms of observation and posterior belief or to continue through one of the existing branches. In continuous spaces, this is resolved by the Progressive Widening technique Sunberg and Kochenderfer (2018). If a new branch is expanded, an observation is created from state drawn from the belief .

3 Approach

SITH-PFT (Alg. 1) follows the same algorithmic baseline as PFT. We adhere to the conventional notations Sunberg and Kochenderfer (2018) and denote by a generative model receiving as input the belief , an action and an observation , and producing the posterior belief and the mean reward over the state . For belief update, we use a particle filter based on belief samples. Instead of calculating the immediate information-theoretic rewards and the corresponding function estimates, we calculate low-cost lower and upper bounds over the information-theoretic rewards and corresponding bounds , over the function. These bounds are adaptive and can be tightened on demand. We call the process of tightening “resimplification”.

1:procedure Plan(belief: )
3:     for  do
4:         Simulate(, , )
5:     end for
6:     return Action Selection(, ) called with nullified exploration constant
7:end procedure
8:procedure Simulate(belief: , depth: , history: )
9:     if   then
10:         return
11:     end if
12:      Action Selection(, )
13:     if  then
14:          sample from , generate from
16:         Calculate initial for based on minimal simp. level
18:          Rollout(, , )
19:     else
20:          sample uniformly from
21:          Simulate(, , )
22:     end if
23:     if deepest resimplification depth  then accounting for updated deeper in the tree bounds. See section 3.4.3
24:         reconstruct
25:     end if
31:     return
32:end procedure
Algorithm 1 SITH-PFT

In turn, these bounds induce bounds over UCB. As we discuss in detail next, an essential aspect of our approach is using these bounds to achieve the exact same action selection as UCB without exactly calculating the function and UCB. To this end we present a novel action selection method (Alg. 2). Crucially, by tightening the bounds only to a minimal needed extent (Alg. 3), we guarantee the same tree connectivity and calculated optimal action compared to PFT-DPW, but faster. We devote the subsequent section to the bounds and explain how they pertain to SITH-PFT.

3.1 Information theoretic bounds

In the setting of continuous state space and nonparametric belief represented by weighted particles , the estimation of differential entropy is not a simple task. Typically, such estimators’ complexity is squared in the number of particles Fischer and Tas (2020); Boers et al. (2010). We use Boers et al. (2010) as a reward function and utilize the bounds over it, developed by Sztyglic and Indelman (2021). The bounds can be tightened on demand incrementally without an overhead. Namely, after a few bounds-tightening iterations they are just the reward itself and the entire calculation is time-equivalent to calculating the original reward. We define the bounds over the minus differential entropy estimator for as (see supplementary 6.1 for the full terms)


where is the discrete level of simplification . Higher levels of simplification correspond to tighter, and lower levels of simplification correspond to looser bounds. , are the simplification level corresponding sets of indices. Specifically, are each represented as a set of weighted particles. We keep track over the indices of particles that were chosen for the bounds calculation. Namely, and . Each subsequent level (low to high) defines a larger set of indices. Sometimes the bounds are not close enough to select the same action as UCB. In this case, our modified action selection routine triggers the resimplification process. When resemplification is carried out, new indices are drawn from the sets and respectively, and added to the sets and . This operation promotes the simplification level to and defines and . Importantly, increasing simplification level is done incrementally (as introduced by Sztyglic and Indelman (2021)). Thus, when we refine the bounds (Alg 3 lines 3,12,18), from simplification level all the way to (worst case scenario) the time complexity is equivalent to calculation of . When , it holds that . Importantly, by caching the shared calculations of the two bounds, we never repeat the calculation of these values and obtain maximal speedup. The immediate bounds (6) induce bounds over . In MCTS, the approximation is a mean over simulations. Each simulation yields a sum of discounted cumulative rewards. Therefore, if we replace the reward with the bounds from (6) we will get corresponding discounted cumulative upper and lower bounds. Averaging the simulations, in the same manner (Alg. 1 lines 29-30), yields


3.2 UCB bounds

Since the MCTS tree is built upon (5), using (4) and (7) we denote UCB upper and lower bounds as


3.3 Guaranteed belief tree consistency

1:procedure Action Selection(, )
2:     while true do
3:         Status, Select Best(, )
4:         if Status then
5:              break
6:         else
7:              for all  do
8:                  Resimplify(, )
9:              end for
10:              reconstruct
11:         end if
12:     end while
13:     return a
14:end procedure
15:procedure Select Best(, )
16:     Status true
18:     gap
19:     child-to-resimplify
20:     for all  children of  do
21:         if   then
22:              Status false
23:              if  gap then
24:                  gap
25:                  child-to-resimplify
26:              end if
27:         end if
28:     end for
29:     return Status, child-to-resimplify
30:end procedure
Algorithm 2 Action Selection

In this section, we define the Tree Consistency and explain and prove the equivalence of our algorithm to PFT-DPW.

Definition 1 (Tree consistent algorithms).

Consider two algorithms, constructing a belief tree. Assume every common sampling operation for the two algorithms uses the same seed. The two algorithms are tree consistent if two belief trees constructed by the algorithms are identical in terms of actions, observations, and visitation counts.

Our approach leans on a specific action selection procedure inside the tree, which differs from the PFT. At every belief node we mark as a candidate action the one that maximizes the lower bound as such


If , , there is no overlap (Fig. 1 (c)) and we can announce is identical to , i.e., the action that would be returned by PFT using (5) and the tree consistency was not compromised. Else, the bounds need to be tightened, so we may guarantee the tree consistency. We examine the siblings of , fulfilling (Fig. 1 (a)). Our next step is to tighten the bounds via resimplification (Fig. 1 (b)) until there is no overlap. When some sibling nodes have overlapping bounds, we strive to avoid tightening all of them at once since fewer resimplifications lead to a greater speedup.

1:procedure Resimplify(, )
2:     if  is a leaf then
3:         Refine()
4:         Resimplify Rollout(, )
5:         return
6:     end if
8:     for all  do
9:         Resimplify(, )
10:     end for
11:     reconstruct
12:     Refine()
13:     Resimplify Rollout(, )
14:     return
15:end procedure
16:procedure Resimplify Rollout(, )
17:      find weakest link in rollout
18:     Refine()
19:end procedure
20:procedure Refine()
21:     if (12) holds for , refine its and promote its simplification level
22:end procedure
Algorithm 3 Resimplification

Thus, among them we pick a single node that induces the biggest “gap”, denoted by , between its bounds (see Alg. 2 lines 20-28), where


Further, we tighten the bounds down the branch of the chosen node (see Alg. 2 lines 7-9) for every member of , the set of children of . Since the bounds converge to the actual information reward we can guarantee the algorithm will pick a single action after a finite number of “bounds-tightening” iterations (resimplification); thus, tree consistency is assured. In the following section, we delve into the resimplification procedure.

3.4 Resimplification

In this section, we explain how resimplification is done. The algorithmic scheme is formulated in a general manner. However, it is guided by a specific strategy meant to minimize the number of times we tighten the bounds (as mentioned in Sec. 3.3). We denote this strategy as Resimplification Strategy. We assume this strategy satisfies two conditions to guarantee tree consistency.

Assumption 1 (Convergence).

When using a converging strategy, each call to resimplify on the children of , tightens the bounds (unless they are already equal).

Assumption 2 (Finite-time).

When using a finite-time strategy, after a finite number of calls to resimplify on the children of , it holds .

3.4.1 Resimplification algorithmic scheme

Consider a belief-action node at level with . Assume the algorithm chooses it for bounds tightening, as described in Sec. 3.3 and Alg. 2 line 3. All tree nodes that is an ancestor to them, contribute their immediate bounds to calculation. Thus, to tighten , we can potentially choose any candidate nodes in the subtree of . Every child belief node of is sent to the resimplification routine (Alg. 2 lines 7-9), which performs four tasks. Firstly, it chooses the action (Alg 3 line 7) that will participate in the subsequent resimplification call and sends all its children beliefs nodes to the recursive call down the tree (Alg. 3 line 8-10). Secondly, it refines the belief node according to the specific resimplification strategy (Alg 3 lines 3,12,18). Thirdly, it reconstructs , once all the children belief nodes of have returned from the resimplification routine (Alg 3 line 11). Fourthly, it engages the rollout resimplification routine according to the specific resimplification strategy (Alg 3 lines 4, 13). Upon completion of this resimplification call initiated at , we get tighter immediate bounds of some of descendant belief nodes (including rollouts nodes). Accordingly, all of descendant belief-action nodes bounds () were updated.

3.4.2 Specific resimplification strategy

Specifically, we decide to refine of a belief node with depth if


where corresponds to the gap (11) of the belief-action node that initially triggered resimplifcation in Alg. 2 line 24. The explanation to (12) resimplification strategy is rather simple. The right hand side of (12) is the mean gap per depth/level in the sub-tree with as its root and spreading downwards to the leaves. Naturally, some of the nodes in this subtree have above the mean gap, and some under. We wish to locate and refine all the ones above. For the left-hand side of (12); the rewards are accumulated and discounted according to their depth. Thus, when comparing node with depth to belief node with depth , we must account for the relative proper discount factor. Note the depth identified with the root is as seen in Alg. 1 line 4, and the leafs are distinguished by depth .

(c) IPFT
Figure 2: 2D Continuous Light Dark. The agent starts from an initial unknown location and is given an initial belief. The goal is to get to location (circled in red) and execute the terminal action. Near the beacon (white light) the observations are less noisy. We consider multi-objective function, accounting for the distance to the goal and the differential entropy approximation (with the minus sign for reward notation). Executing the terminal action inside the red circle gives the agent a large positive reward but executing it outside it, will yield a large negative reward.

For each rollout originating from the tree belief node, we find the rollout node with the biggest fulfilling (12) term locally in the rollout and resimplify it (Alg 3 lines 4,13). To choose the action to continue resimplifcaiton down the tree, we take the action corresponding to the belief-action node with the largest gap weighted by its visitation count (Alg 3 line 7). With this strategy, we aim to leave the belief tree at the lowest possible simplification levels whilst still guarantee tree consistency.

3.4.3 Reconstructing the bounds

If the action selection procedure triggered a resimplification, it modified the bounds through the tree. Since the resimplifcation works recursively, it reconstructs the belief-action node bounds coming back from the recursion (Alg. 3 line 11). Similarly, the action dismissing procedure reconstructs , and of the belief-action node at which the action dismissing is performed (Alg. 2 line 10). Moreover, on the way back from the simulation, we shall update the ancestral belief-action nodes of the tree. Specifically, we are required to reconstruct each and higher than the deepest starting point of the resimplification (Alg. 1 line 23-25). Reconstruction is essentially a double loop. To reconstruct we first query for all belief children nodes . We then query all belief-action nodes that are children to the , i.e., . The possibly modified immediate bounds and are taken from nodes and the , bounds are taken from the nodes. Importantly, each of the bounds is weighted according to the proper visitation count.

3.5 Guarantees

Assuming a converging and finite-time resimplification strategy, the following theorems are satisfied:

Theorem 1.

The SITH-PFT and PFT are Tree Consistent Algorithms.

Theorem 2.

The SITH-PFT provides the same solution as PFT.

Theorem 3.

The specific resimplificaiton strategy from Sec. 3.4.2 is a converging and finite-time resimplification strategy.

See full proofs of the theorems and time complexity analysis using the specific bounds in the supplementary 6.2,6.2.5. Note other resimplification strategies are possible, see supplementary 6.2.6.

4 Experiments

In the continuous setting with information-theoretic rewards, many common POMDP benchmarks (e.g., rock sampling, laser tag) are inadequate. We turn to the challenging Continuous Light Dark Problem with a few modifications. We extend it to a 2D domain and place a single “light beacon” in the continuous world. The agent’s goal is to get to location and execute the terminal action - . Executing it within a small radius from will give the agent a reward of 200, and executing it outside the radius will yield a negative reward of -200. The agent can move in eight evenly spread directions . The multi-objective reward function is . Motion, observation, and initial belief are , , respectively. is the 2D location of the beacon and all covariance matrices are diagonal (i.e. ). Implementation is built upon the JuliaPOMDP package collection Egorov et al. (2017). The code is attached alongside the supplementary manuscript. Extensive experiments confirm the advantage of our approach. We experiment with ten different configurations (rows of Table 1) that differ in (number of particles), (simulation depth), and iter (number of simulation iterations per planning session).

(, , #iter.) Algorithm planning time [sec]
(50, 30, 200) PFT-DPW
(50, 50, 500) PFT-DPW
(100, 30, 200) PFT-DPW
(100, 50, 500) PFT-DPW
(200, 30, 200) PFT-DPW
(200, 50, 500) PFT-DPW
(400, 30, 200) PFT-DPW
(400, 50, 500) PFT-DPW
(600, 30, 200) PFT-DPW
(600, 50, 500) PFT-DPW
Table 1: Runtimes of SITH-PFT versus PFT-DPW. The rows are different configurations of the number of belief particles , maximal tree depth

, and the number of iterations per planning session. Reported values are averaged over 25 simulations 10 planning sessions each, and presented with the standard errors. In all simulations SITH-PFT and PFT-DPW declared

identical actions as optimal and exhibited identical belief trees in terms of connectivity and visitation counts.

Each scenario comprises planning sessions i.e. the agent performs up to planning-action executing iterations. We repeat each of the experiments 25 times. In all different configurations, we obtained significant speedup while achieving the exact same solution compared to PFT. Results are summed up in Table 1. An illustration can be found in Fig. 2. Note that SITH-PFT 1(a) yields identical to PFT solution 1(b) while IPFT demonstrates severely degraded behavior. We remind the purpose of our work is to speed up the PFT approach when coupled with information-theoretic reward. Hence, due to space constraints and since the two algorithms produce identical belief trees and action at the end of each planning session, there is no point reporting the algorithms identical performances (apart from planning time). For our simulations, we used an 8 cores Intel(R) Xeon(R) CPU E5-1620 v4 with 128 GB of RAM working at 3.50GHz.

5 Conclusions

We presented a novel method to accelerate information-theoretic reward planning. Our approach is applicable with any converging to the reward bounds. We provide thorough proofs that our method is entirely equivalent to PFT-DPW, yielding the same solution and belief tree in each planning step. Our experiments demonstrate that the technique is paramount in terms of computation time compared to PFT-DPW. In the worst-case scenario, the computation time is approaching the baseline. The limitation of our algorithm is that it leans on converging bounds, which are not trivial to derive and specific for a particular reward function. In addition, it requires slightly more caching than the baseline.


This research was supported by the Israel Science Foundation (ISF) and by a donation from the Zuckerman Fund to the Technion Center for Machine Learning and Intelligent Systems (MLIS).


  • Boers et al. [2010] Y. Boers, H. Driessen, A. Bagchi, and P. Mandal. Particle filter based entropy. In 2010 13th International Conference on Information Fusion, pages 1–8, 2010. doi: 10.1109/ICIF.2010.5712013.
  • Dressel and Kochenderfer [2017] Louis Dressel and Mykel J. Kochenderfer. Efficient decision-theoretic target localization. In Laura Barbulescu, Jeremy Frank, Mausam, and Stephen F. Smith, editors, Proceedings of the Twenty-Seventh International Conference on Automated Planning and Scheduling, ICAPS 2017, Pittsburgh, Pennsylvania, USA, June 18-23, 2017, pages 70–78. AAAI Press, 2017. URL
  • Egorov et al. [2017] Maxim Egorov, Zachary N. Sunberg, Edward Balaban, Tim A. Wheeler, Jayesh K. Gupta, and Mykel J. Kochenderfer. POMDPs.jl: A framework for sequential decision making under uncertainty. Journal of Machine Learning Research, 18(26):1–5, 2017. URL
  • Elimelech and Indelman [2018] Khen Elimelech and Vadim Indelman. Simplified decision making in the belief space using belief sparsification. Intl. J. of Robotics Research, 12 2018. Conditionally accepted.
  • Fischer and Tas [2020] Johannes Fischer and Omer Sahin Tas. Information particle filter tree: An online algorithm for pomdps with belief-based rewards on continuous domains. In Intl. Conf. on Machine Learning (ICML), Vienna, Austria, 2020.
  • Garg et al. [2019] Neha P Garg, David Hsu, and Wee Sun Lee. Despot-: Online pomdp planning with large state and observation spaces. In Robotics: Science and Systems (RSS), 2019.
  • Hoerger et al. [2019] Marcus Hoerger, Hanna Kurniawati, and Alberto Elfes. Multilevel monte-carlo for solving pomdps online. In Proc. International Symposium on Robotics Research (ISRR), 2019.
  • Kochenderfer et al. [2022] M. Kochenderfer, T. Wheeler, and K. Wray. Algorithms for Decision Making. MIT Press, 2022.
  • Kocsis and Szepesvári [2006] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In European conference on machine learning, pages 282–293. Springer, 2006.
  • Kurniawati et al. [2008] H. Kurniawati, D. Hsu, and W. S. Lee. SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems (RSS), volume 2008, 2008.
  • Silver and Veness [2010] David Silver and Joel Veness. Monte-carlo planning in large pomdps. In Advances in Neural Information Processing Systems (NIPS), pages 2164–2172, 2010.
  • Sunberg and Kochenderfer [2018] Zachary Sunberg and Mykel Kochenderfer. Online algorithms for pomdps with continuous state, action, and observation spaces. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 28, 2018.
  • Sztyglic and Indelman [2021] Ori Sztyglic and Vadim Indelman. Online pomdp planning via simplification. arXiv preprint arXiv:2105.05296, 2021.
  • Ye et al. [2017] Nan Ye, Adhiraj Somani, David Hsu, and Wee Sun Lee. Despot: Online pomdp planning with regularization. JAIR, 58:231–266, 2017.
  • Zhitnikov and Indelman [2021] Andrey Zhitnikov and Vadim Indelman. Probabilistic loss and its online characterization for simplified decision making under uncertainty. arXiv preprint arXiv:2105.05789, 2021.

6 Supplementary

6.1 Information theoretic bounds

In this paper we consider the differential entropy approximation by Boers et al. [2010]. The approximation is w.r.t. belief and assumes the form (see full expression in Sec. 6.2.5). Further, we consider bounds over this approximation developed by Sztyglic and Indelman [2021] upholding (6). Specifically,


where .

6.2 Proofs

6.2.1 Assumptions

For the following proofs (Secs. 6.2.2 and 6.2.3) assume we are using a converging and finite-time resimplification strategy that satisfies Assumptions 1,2.

6.2.2 Proof for Theorem 1


We provide proof by induction on the belief trees structure.
Base: Consider an initial given belief node . No actions were taken and no observations were made. Thus, both PFT tree and SITH-PFT trees contain a single identical belief node and the claim holds.
Induction hypothesis: Assume we are given two identical trees with nodes, generated by PFT and a SITH-PFT. The trees uphold the terms of Definition 1.
Induction step: Assume by contradiction that in the next simulation (expanding the belief tree by one belief node by definition) different nodes were added to the trees. Thus, we got different trees.
Two different scenarios are possible:

Case 1.

The same action-observation sequence was chosen in both trees, but different nodes were added.

Case 2.

Different action-observation sequences were chosen for both trees and thus, we got different trees structure.

Case 1 is not possible. Since the Induction hypothesis holds, the last action was taken from the same node denoted shared and identical to both trees. Next the same observation model is sampled for a new observation and a new belief node is added with a rollout emanating from it. The new belief nodes and the rollout are identical for both trees since both algorithms use the same randomization seed and the same observation and motion models.

Case 2 must be true since we showed Case 1 is false. There are two possible scenarios such that different action-observation sequences were chosen:

Case 2.1.

At some point of the actions-observations sequence, different observations were chosen.

Case 2.2.

At some point of the actions-observations sequence, PFT chose action while SITH-PFT chose a different action, , or even got stuck without picking any action.

Case 2.1 is not possible since if new observations were made, they are the same one by reasons contradicting Case 1 . If we draw existing observations (choose some observation branch down the tree) the same observations are drawn since they are drawn with the same random seed and from the same observations “pool”. It is the same “pool” since the Induction hypothesis holds.

Case 2.2 must be true since we showed Case 2.1 is false, i.e., when both algorithms are at the identical node denoted as PFT chooses action , while SITH-PFT chooses a different action, , or even got stuck without picking any action. Specifically, PFT chooses action and SITH-PFT’s candidate action is .
Three different scenarios are possible:

Case 2.2.1.

the bounds over were tight enough and was chosen such that .

Case 2.2.2.

SITH-PFT is stuck in an infinite loop. It can happen if the bounds over , and at least one of its sibling nodes , are not tight enough. However, all of the tree nodes are at the maximal simplification level. Hence, resimplification is triggered over and over without it changing anything.

Case 2.2.1 is not possible since the bounds are analytical (always true) and converge to the actual reward () for the maximal simplification level.

Case 2.2.2 is not possible. If the bounds are not close enough to make a decision, resimplification is triggered. Each time some node - sibling to and maybe even itself is chosen in Select Best to over-go resimplification. According to Assumption. 1 and Assumption. 2, after some finite number of iterations for all of the sibling nodes (including ) it holds and some action can be picked. If different actions have identical values we choose one by the same rule UCB picks actions with identical values (e.g. lower index/random).

Now, since Case 2.2.2 is false, after some finite number of resimplification iterations, SITH-PFT will stop with bounds sufficient enough to make a decision. And since Case 2.2.1 is false it holds that . Thus we get a contradiction and the proof is complete. ∎

6.2.3 Proof for Theorem 2:


Since the same tree is built according to Theorem 1, the only modification now is the final criteria at the end of the planning session at the root of the tree: . Note we can set the exploration constant of UCB to and we get that UCB is just the function. Thus if the bounds are not tight enough at the root to decide on an action, resimplification will be repeatedly called until SITH-PFT can make a decision. The action will be identical to the one chosen by UCB at PFT from similar arguments mentioned in the proof of Theorem 1, 6.2.2. Note that additional final criteria for action selection could be introduced, but it would not matter since tree consistency is kept according to Theorem 1 and the bounds converge to the actual immediate rewards and estimations. ∎

6.2.4 Proof for Theorem 3

We now prove the resimplification strategy described in section 3.4.2 is converging and finite-time resimplification strategy.

Proof: Converging resimplification strategy.

Consider the condition for refinement of the bounds (12). Since is the mean gap over all the nodes that are the descendants to , some of the nodes are above this mean gap, and some are under (accounting for the discount factor). We refine all the ones that are above. Further, for each descendant rollout, we refine one rollout node that is above the mean gap. If each time we refine all descendants belief nodes that are above the mean gap and one rollout node per descendant rollout (if it satisfies (12)), after one iteration the mean gap must decrease since there exists a node above the mean gap that got tighter. If there is no such node above the mean gap that means all the values are the same throughout the sub-tree and those values must be zero since the immediate bounds converge. Thus, the mean gap (and consequentially so does ) is getting smaller in each iteration unless it is already zero. ∎

Proof: Finite-time resimplification strategy.

Similar to previous proof, in each iteration there exists a node above the mean gap that is chosen for refinement. There are no nodes above the gap only if throughout the sub-tree all the values are zero. This happens after a finite number of iterations since there is a finite number of nodes and a finite number of simplification levels. Since the bounds converge, at the maximal simplification level it holds . Thus, after all nodes in the sub-tree got to the maximal simplification level it holds and consequentially so does ). ∎

6.2.5 Time complexity analysis

We turn to analyze the time complexity of our method using the chosen bounds (6.1). We assume the significant bottleneck is querying the motion and observation models respectively. Assume the belief is approximated by a set of weighted particles,


Consider the Boers et al. [2010] differential entropy approximation for belief at time ,


Denote the time complexity to query the observation and motion models a single time as respectively. It is clear from (14), (15) (term a) and, (16) (term b) that:


Since we share calculation between the bounds, the bounds’ time complexity, for some level of simplification , based on Sztyglic and Indelman [2021], is:


where is the size of the particles subset that is currently used for the bounds calculations, e.g.  ( is as in (6.1)) and denotes the immediate upper and lower bound using simplification level . Further, we remind the simplification levels are discrete, finite, and satisfy


Now, assume we wish to tighten and move from simplification level to . Since the bounds are updated incrementally (as introduced by Sztyglic and Indelman [2021]), when moving from simplification level to the only additional data we are missing are the new values of the observation and motion models for the newly added particles. Thus, we get that the time complexity of moving from one simplification level to another is:


where denotes the time complexity of updating the bounds from one simplification level to the following one. Note the first term from (18), , is not present in (20). This term has nothing to do with simplification level and it is calculated linearly over all particles . Thus, it is calculated once at the beginning (initial/lowest simplification level).

We can now deduce using (18) and (20)


Finally, using (17), (18), (19), (20), and (21), we come to the conclusion that if at the end of a planning session, a node’s simplification level was than the time complexity saved for that node is


This makes perfect sense since if we had to resimplify all the way to the maximal level we get and by substituting in (22) we saved no time at all.

To conclude, the total speedup of the algorithm is dependent on how many belief nodes’ bounds were not resimplified to the maximal level. The more nodes we had at the end of a planning session with lower simplification levels, the more speedup we get according to (22).

6.2.6 Additional resimplification strategies

We note that the proofs for Theorems 1, 2 depends on our resimplification strategy 3.4.2. That is, additional strategies can be introduced as long as they satisfy Assumptions 1, and 2. To clarify, a simple example of a converging and finite-time resimplification strategy would be to refine the bounds of all nodes (belief tree nodes and rollout nodes) that are descendants to the belief-action node that was chosen for resimplification at Select Best procedure. Naturally, there will always be a node that got tightened (unless all bounds are already equal); thus, Assumption. 1 is satisfied. Further, after a finite time, all nodes in the sub-tree got to the maximal level of simplification, and the bounds converged. Thus, Assumption. 2 is satisfied. Note that using this brute-force strategy can result in many unnecessary resimplifications. So, the potential speed-up may decrease but in the worst case, SITH-PFT will still yield the same time complexity as PFT.