Online solver based on Monte Carlo tree search for POMDPs with continuous state, action, and observation spaces.
Online solvers for partially observable Markov decision processes have been applied to problems with large discrete state spaces, but continuous state, action, and observation spaces remain a challenge. This paper begins by investigating double progressive widening (DPW) as a solution to this challenge. However, we prove that this modification alone is not sufficient because the belief representations in the search tree collapse to a single particle causing the algorithm to converge to a policy that is suboptimal regardless of the computation time. The main contribution of the paper is to propose a new algorithm, POMCPOW, that incorporates DPW and weighted particle filtering to overcome this deficiency and attack continuous problems. Simulation results show that these modifications allow the algorithm to be successful where previous approaches fail.READ FULL TEXT VIEW PDF
Markov decision processes (MDPs) and partially observable MDPs (POMDPs) ...
Partially observable Markov decision processes (POMDPs) with continuous ...
This paper presents the first ever approach for solving
Planning under partial obervability is essential for autonomous robots. ...
Online solvers for partially observable Markov decision processes have
This paper presents a new method to learn online policies in continuous
Partially-Observable Markov Decision Processes (POMDPs) are typically so...
Online solver based on Monte Carlo tree search for POMDPs with continuous state, action, and observation spaces.
The partially observable Markov decision process (POMDP) is a flexible mathematical framework for representing sequential decision problems [Littman, Cassandra, and Kaelbling1995, Thrun, Burgard, and Fox2005]. Once a problem has been formalized as a POMDP, a wide range of solution techniques can be used to solve it. In a POMDP, at each step in time, an agent selects an action causing the state to change stochastically to a new value based only on the current state and action. The agent seeks to maximize the expectation of the reward, which is a function of the state and action. However, the agent cannot directly observe the state, and makes decisions based only on observations that are stochastically generated by the state.
Many offline methods have been developed to solve small and moderately sized POMDPs [Kurniawati, Hsu, and Lee2008]. Solving larger POMDPs generally requires the use of online methods [Silver and Veness2010, Somani et al.2013, Kurniawati and Yadav2016]. One widely used online algorithm is partially observable Monte Carlo planning (POMCP) [Silver and Veness2010], which is an extension to Monte Carlo tree search that implicitly uses an unweighted particle filter to represent beliefs in the search tree.
POMCP and other online methods can accomodate continuous state spaces, and there has been recent work on solving problems with continuous action spaces [Seiler, Kurniawati, and Singh2015]. However, there has been less progress on problems with continuous observation spaces. This paper presents two similar algorithms which address the challenge of solving POMDPs with continuous state, action, and observation spaces. The first is based on POMCP and is called partially observable Monte Carlo planning with observation widening (POMCPOW). The second solves the belief-space MDP and is called particle filter trees with double progressive widening (PFT-DPW).
There are two challenges that make tree search difficult in continuous spaces. The first is that, since the probability of sampling the same real number twice from a continuous random variable is zero, the width of the planning trees explodes on the first step, causing them to be too shallow to be useful (seeFig. 1). POMCPOW and PFT-DPW resolve this issue with a technique called double progressive widening (DPW) [Couëtoux et al.2011]. The second issue is that, even when DPW is applied, the belief representations used by current solvers collapse to a single state particle, resulting in overconfidence. As a consequence, the solutions obtained resemble QMDP policies, and there is no incentive for information gathering. POMCPOW and PFT-DPW overcome this issue by using the observation model to weight the particles used to represent beliefs.
This paper proceeds as follows: Section 2 provides an overview of previous online POMDP approaches. Section 3 provides a brief introduction to POMDPs and Monte Carlo tree search. Section 4 presents several algorithms for solving POMDPs on continuous spaces, discusses theoretical and practical aspects of their behavior. Section 5 then gives experimental validation of the algorithms.
Considerable progress has been made in solving large POMDPs. Initially, exact offline solutions to problems with only a few discrete states, actions, and observations were sought by using value iteration and taking advantage of the convexity of the value function [Kaelbling, Littman, and Cassandra1998]
, although solutions to larger problems were also explored using Monte Carlo simulation and interpolation between belief states[Thrun1999]. Many effective offline planners for discrete problems use point based value iteration, where a selection of points in the belief space are used for value function approximation, [Kurniawati, Hsu, and Lee2008]. Offline solutions for problems with continuous state and observation spaces have also been proposed [Bai, Hsu, and Lee2014, Brechtel, Gindele, and Dillmann2013].
There are also various solution approaches that are applicable to specific classes of POMDPs, including continuous problems. For example, platt2010belief ̵̃platt2010belief simplify planning in large domains by assuming that the most likely observation will always be received, which can provide an acceptable approximation in some problems with unimodal observation distributions. morere2016bayesian ̵̃morere2016bayesian solve a monitoring problem with continuous spaces with a Gaussian process belief update. hoey2005solving ̵̃hoey2005solving propose a method for partitioning large observation spaces without information loss, but demonstrate the method only on small state and action spaces that have a modest number of conditional plans. Other methods involve motion-planning techniques [Melchior and Simmons2007, Prentice and Roy2009, Bry and Roy2011]. In particular, agha2011firm ̵̃agha2011firm present a method to take advantage of the existence of a stabilizing controller in belief space planning. van2012motion ̵̃van2012motion perform local optimization with respect to uncertainty on a pre-computed path, and indelman2015planning ̵̃indelman2015planning devise a hierarchical approach that handles uncertainty in both the robot’s state and the surrounding environment.
General purpose online algorithms for POMDPs have also been proposed. Many early online algorithms focused on point-based belief tree search with heuristics for expanding the trees[Ross et al.2008]. The introduction of POMCP [Silver and Veness2010] caused a pivot toward the simple and fast technique of using the same simulations for decision-making and using beliefs implicitly represented as unweighted collections of particles. Determinized sparse partially observable tree (DESPOT) is a similar approach that attempts to achieve better performance by analyzing only a small number of random outcomes in the tree [Somani et al.2013]. Adaptive belief tree (ABT) was designed specifically to accommodate changes in the environment without having to replan from scratch [Kurniawati and Yadav2016].
These methods can all easily handle continuous state spaces [Goldhoorn et al.2014], but they must be modified to extend to domains with continuous action or observation spaces. Though DESPOT has demonstrated effectiveness on some large problems, since it uses unweighted particle beliefs in its search tree, it struggles with continuous information gathering problems as will be shown in Section 5. ABT has been extended to use generalized pattern search for selecting locally optimal continuous actions, an approach which is especially effective in problems where high precision is important [Seiler, Kurniawati, and Singh2015], but also uses unweighted particle beliefs. Continuous observation Monte Carlo tree search (COMCTS) constructs observation classification trees to automatically partition the observation space in a POMCP-like approach, however it did not perform much better than a Monte Carlo rollout approach in experiments [Pas2012].
Although research has yielded effective solution techniques for many classes of problems, there remains a need for simple, general purpose online POMDP solvers that can handle continuous spaces, especially continuous observation spaces.
This section reviews mathematical formulations for sequential decision problems and some existing solution approaches. The discussion assumes familiarity with Markov decision processes [Kochenderfer2015], particle filtering [Thrun, Burgard, and Fox2005], and Monte Carlo tree search [Browne et al.2012], but reviews some details for clarity.
The Markov decision process (MDP) and partially observable Markov decision process (POMDP) can represent a wide range of sequential decision making problems. In a Markov decision process, an agent takes actions that affect the state of the system and seeks to maximize the expected value of the rewards it collects [Kochenderfer2015]. Formally, an MDP is defined by the 5-tuple , where is the state space, is the action space, is the transition model, is the reward function, and is the discount factor. The transition model can be encoded as a set of probabilities, specifically denotes the probability that the system will transition to state given that action is taken in state . In continuous problems,
is defined by probability density functions.
In a POMDP, the agent cannot directly observe the state. Instead, the agent only has access to observations that are generated probabilistically based on the actions and latent true states. A POMDP is defined by the 7-tuple , where , , , , and have the same meaning as in an MDP. Additionally, , is the observation space, and is the observation model. is the probability or probability density of receiving observation in state given that the previous state and action were and .
Information about the state may be inferred from the entire history of previous actions and observations and the initial information, . Thus, in a POMDP, the agent’s policy is a function mapping each possible history, to an action. In some cases, each state’s probability can be calculated based on the history. This distribution is known as a belief, with denoting the probability of state .
The belief is a sufficient statistic for optimal decision making. That is, there exists a policy, such that, when , the expected cumulative reward or “value function” is maximized for the POMDP. Given the POMDP model, each subsequent belief can be calculated using Bayes’ rule [Kaelbling, Littman, and Cassandra1998, Kochenderfer2015]. However, the exact update is computationally intensive, so approximate approaches such as particle filtering are usually used in practice [Thrun, Burgard, and Fox2005].
For many problems, it can be difficult to explicitly determine or represent the probability distributionsor . Some solution approaches, however, only require samples from the state transitions and observations. A generative model, , stochastically generates a new state, reward, and observation in the partially observable case, given the current state and action, that is for an MDP, or for a POMDP. A generative model implicitly defines and , even when they cannot be explicitly represented.
Every POMDP is equivalent to an MDP where the state space of the MDP is the space of possible beliefs. The reward function of this ”belief MDP” is the expectation of the state-action reward function with respect to the belief. The Bayesian update of the belief serves as a generative model for the belief space MDP.
Monte Carlo Tree Search (MCTS) is an effective and widely studied algorithm for online decision-making [Browne et al.2012]. It works by incrementally creating a policy tree consisting of alternating layers of state nodes and action nodes using a generative model
and estimating the state-action value function,, at each of the action nodes. The Upper Confidence Tree (UCT) version expands the tree by selecting nodes that maximize the upper confidence bound
where is the number of times the action node has been visited, , and is a problem-specific parameter that governs the amount of exploration in the tree [Browne et al.2012].
In cases where the action and state spaces are large or continuous, the MCTS algorithm will produce trees that are very shallow. In fact, if the action space is continuous, the UCT algorithm will never try the same action twice (observe that , so untried actions are always favored). Moreover, if the state space is continuous and the transition probability density is finite, the probability of sampling the same state twice from is zero. Because of this, simulations will never pass through the same state node twice and a tree below the first layer of state nodes will never be constructed.
In progressive widening, the number of children of a node is artificially limited to where is the number of times the node has been visited and and are hyper-parameters (see Appendix B) [Couëtoux et al.2011]. Originally, progressive widening was applied to the action space and was found to be especially effective when a set of preferred actions was tried first [Browne et al.2012]. The term double progressive widening refers to progressive widening in both the state and action space. When the number of state nodes is greater than , instead of simulating a new state transition, one of the previously generated states is chosen with probability proportional to the number of times it has been previously generated.
A conceptually straightforward way to solve a POMDP using MCTS is to apply it to the corresponding belief MDP. Indeed, many tree search techniques have been applied to POMDP problems in this way [Ross et al.2008]. However, when the Bayesian belief update is used, this approach is computationally expensive. POMCP and its successors, DESPOT and ABT, can tackle problems many times larger than their predecessors because they use state trajectory simulations, rather than full belief trajectories, to build the tree.
Each of the nodes in a POMCP tree corresponds to a history proceeding from the root belief and terminating with an action or observation. In the search phase of POMCP tree construction, state trajectories are simulated through this tree. At each action node, the rewards from the simulations that pass through the node are used to estimate the function. This simple approach has been shown to work well for large discrete problems [Silver and Veness2010]. However, when the action or observation space is continuous, the tree degenerates and does not extend beyond a single layer of nodes because each new simulation produces a new branch.
This section presents several MCTS algorithms for POMDPs including the new POMCPOW and PFT-DPW approaches.
The three algorithms in this section share a common structure. For all algorithms, the entry point for the decision making process is the Plan procedure, which takes the current belief, , as an input (Plan differs slightly for PFT-DPW in Algorithm 3). The algorithms also share the same ActionProgWiden function to control progressive widening of the action space. These components are listed in Listing 1. The difference between the algorithms is in the Simulate function.
The following variables are used in the listings and text: represents a history , and and are shorthand for histories with and appended to the end, respectively; is the depth to explore, with the maximum depth; is a list of the children of a node (along with the reward in the case of PFT-DPW); is a count of the number of visits; and is a count of the number of times that a history has been generated by the model. The list of states associated with a node is denoted , and is a list of weights corresponding to those states. Finally, is an estimate of the value of taking action after observing history . , , , , , and are all implicitly initialized to or . The Rollout procedure, runs a simulation with a default rollout policy, which can be based on the history or fully observed state for steps and returns the discounted reward.
The first algorithm that we consider is POMCP with double progressive widening (POMCP-DPW). In this algorithm, listed in Algorithm 1, the number of new children sampled from any node in the tree is limited by DPW using the parameters , , , and . In the case where the simulated observation is rejected (line 14), the tree search is continued with an observation selected in proportion to the number of times, , it has been previously simulated (line 15) and a state is sampled from the associated belief (line 16).
This algorithm obtained remarkably good solutions for a very large autonomous freeway driving POMDP with multiple vehicles (up to 40 continuous fully observable state dimensions and 72 continuous correlated partially observable state dimensions) [Sunberg, Ho, and Kochenderfer2017]. To our knowledge, that is the first work applying progressive widening to POMCP, and it does not contain a detailed description of the algorithm or any theoretical or experimental analysis other than the driving application.
This algorithm may converge to the optimal solution for POMDPs with discrete observation spaces; however, on continuous observation spaces, POMCP-DPW is suboptimal. In particular, it finds a QMDP policy, that is, the solution under the assumption that the problem becomes fully observable after one time step [Littman, Cassandra, and Kaelbling1995, Kochenderfer2015]. In fact, for a modified version of POMCP-DPW, it is easy to prove analytically that it will converge to such a policy. This is expressed formally in Theorem 1 below. A complete description of the modified algorithm and problem requirements including the definitions of polynomial exploration, the regularity hypothesis for the problem, and exponentially sure convergence are given in Appendix C.
Let be the optimal state-action value function assuming full observability starting by taking action in state . The QMDP value at belief , , is the expected value of when is distributed according to .
If a bounded-horizon POMDP meets the following conditions: 1) the state and observation spaces are continuous with a finite observation probability density function, and 2) the regularity hypothesis is met, then modified POMCP-DPW will produce a value function estimate, , that converges to the QMDP value for the problem. Specifically, there exists a constant , such that after iterations,
exponentially surely in , for every action .
A proof of this theorem that leverages work by auger2013continuous ̵̃auger2013continuous is given in Appendix C, but we provide a brief justification here. The key is that belief nodes will contain only a single state particle (see Fig. 2). This is because, since the observation space is continuous with a finite density function, the generative model will (with probability one) produce a unique observation each time it is queried. Thus, for every generated history , only one state will ever be inserted into (line 9, Algorithm 1), and therefore is merely an alias for that state. Since each belief node corresponds to a state, the solver is actually solving the fully observable MDP at every node except the root node, leading to a QMDP solution.
As a result of Theorem 1, the action chosen by modified POMCP-DPW will match a QMDP policy (a policy of actions that maximize the QMDP value) with high precision exponentially surely (see Corollary 1 of auger2013continuous ̵̃auger2013continuous). For many problems this is a very useful solution,111Indeed, a useful online QMDP tree search algorithm could be created by deliberately constructing a tree with a single root belief node and fully observable state nodes below it. but since it neglects the value of information, a QMDP policy is suboptimal for problems where information gathering is important [Littman, Cassandra, and Kaelbling1995, Kochenderfer2015].
Although Theorem 1 is only theoretically applicable to the modified version of POMCP-DPW, it helps explain the behavior of other solvers. Modified POMCP-DPW, POMCP-DPW, DESPOT, and ABT all share the characteristic that a belief node can only contain two states if they generated exactly the same observation. Since this is an event with zero probability for a continuous observation space, these solvers exhibit suboptimal, often QMDP-like, behavior. The experiments in Section 5 show this for POMCP-DPW and DESPOT, and this is presumably the case for ABT as well.
In order to address the suboptimality of POMCP-DPW, we now propose a new algorithm, POMCPOW, shown in Algorithm 2. In this algorithm, the belief updates are weighted, but they also expand gradually as more simulations are added. Furthermore, since the richness of the belief representation is related to the number of times the node is visited, beliefs that are more likely to be reached by the optimal policy have more particles. At each step, the simulated state is inserted into the weighted particle collection that represents the belief (line 10), and a new state is sampled from that belief (line 16). A simple illustration of the tree is shown in Figure 2 to contrast with a POMCP-DPW tree. Because the resampling in line 16 can be efficiently implemented with binary search, the computational complexity is .
Another algorithm that one might consider for solving continuous POMDPs online is MCTS-DPW on the equivalent belief MDP. Since the Bayesian belief update is usually computationally intractable, a particle filter is used. This new approach will be referred to as particle filter trees with double progressive widening (PFT-DPW). It is shown in Algorithm 3, where is a particle filter belief update performed with a simulated observation and state particles which approximates the belief MDP generative model. The authors are not aware of any mention of this algorithm in prior literature, but it is very likely that MCTS with particle filters has been used before without double progressive widening under another name.
PFT-DPW is fundamentally different from POMCP and POMCPOW because it relies on simulating approximate belief trajectories instead of state trajectories. This distinction also allows it to be applied to problems where the reward is a function of the belief rather than the state such as pure information-gathering problems [Dressel and Kochenderfer2017, Araya et al.2010].
The primary shortcoming of this algorithm is that the number of particles in the filter, , must be chosen a-priori and is static throughout the tree. Each time a new belief node is created, an particle filter update is performed. If is too small, the beliefs may miss important states, but if is too large, constructing the tree is expensive. Fortunately, the experiments in Section 5 show that it is often easy to choose in practice; for all the problems we studied, a value of resulted in good performance.
It is important to note that, while POMCP, POMCP-DPW, and DESPOT only require a generative model of the problem, both POMCPOW and PFT-DPW require a way to query the relative likelihood of different observations ( in line 11). One may object that this will limit the application of POMCPOW to a small class of POMDPs, but we think it will be an effective tool in practice for two reasons.
First, this requirement is no more stringent than the requirement for a standard importance resampling particle filter, and such filters are used widely, at least in the field of robotics that the authors are most familiar with. Moreover, if the observation model is complex, an approximate model may be sufficient.
Second, given the implications of Theorem 1, it is difficult to imagine a tree-based decision-making algorithm or a robust belief updater that does not require some way of measuring whether a state belongs to a belief or history. The observation model is a straightforward and standard way of specifying such a measure. Finally, in practice, except for the simplest of problems, using POMCP or DESPOT to repeatedly observe and act in an environment already requires more than just a generative model. For example, the authors of the original paper describing POMCP [Silver and Veness2010] use heuristic particle reinvigoration in lieu of an observation model and importance sampling.
Numerical simulation experiments were conducted to evaluate the performance of POMCPOW and PFT-DPW compared to other solvers. The open source code for the experiments is built on the POMDPs.jl framework [Egorov et al.2017] and is hosted at https://github.com/zsunberg/ContinuousPOMDPTreeSearchExperiments.jl. In all experiments, the solvers were limited to
of computation time per step. Belief updates were accomplished with a particle filter independent of the planner, and no part of the tree was saved for re-use on subsequent steps. Hyperparameter values are shown inAppendix B.
|Laser Tag (D, D, D)||Light Dark (D, D, C)||Sub Hunt (D, D, C)||VDP Tag (C, C, C)|
The three C or D characters after the solver indicate whether the state, action, and observation spaces are continuous or discrete, respectively. For continuous problems, solvers with a superscript D were run on a version of the problem with discretized action and observation spaces, but they interacted with continuous simulations of the problem.
The Laser Tag benchmark is taken directly from the work of somani2013despot ̵̃somani2013despot and included for the sake of calibration. DESPOT outperforms the other methods. The score for DESPOT differs slightly from that reported by somani2013despot ̵̃somani2013despot likely because of bounds implementation differences. POMCP performs much better than reported by somani2013despot ̵̃somani2013despot because this implementation uses a state-based rollout policy.
In the Light Dark domain, the state is an integer, and the agent can choose how to move deterministically () from the action space . The goal is to reach the origin. If action is taken at the origin, a reward of is given and the problem terminates; If action is taken at another location, a penalty of is given. There is a cost of at each step before termination. The agent receives a more accurate observation in the “light” region around . Specifically, observations are continuous (.
Table 1 shows the mean reward from simulations for each solver, and Fig. 3 shows an example experiment. The optimal strategy involves moving toward the light region and localizing before proceeding to the origin. QMDP and solvers predicted to behave like QMDP attempt to move directly to the origin, while POMCPOW and PFT-DPW perform better. In this one-dimensional case, discretization allows POMCP to outperform all other methods and DESPOT to perform well, but in subsequent problems where the observation space has more dimensions, discretization does not provide the same performance improvement (see Appendix A).
In the Sub Hunt domain, the agent is a submarine attempting to track and destroy an enemy sub. The state and action spaces are discrete so that QMDP can be used to solve the problem for comparison. The agent and the target each occupy a cell of a 20 by 20 grid. The target is either aware or unaware of the agent and seeks to reach a particular edge of the grid unknown to the agent (). The target stochastically moves either two steps towards the goal or one step forward and one to the side. The agent has six actions, move three steps north, south, east, or west, engage the other submarine, or ping with active sonar. If the agent chooses to engage and the target is unaware and within a range of 2, a hit with reward 100 is scored; The problem ends when a hit is scored or the target reaches its goal edge.
An observation consists of 8 sonar returns () at equally-spaced angles that give a normally distributed estimate (
) of the range to the target if the target is within that beam and a measurement with higher variance if it is not. The range of the sensors depends on whether the agent decides to use active sonar. If the agent does not use active sonar it can only detect the other submarine within a radius of 3, but pinging with active sonar will detect at any range. However, active sonar alerts the target to the presence of the agent, and when the target is aware, the hit probability when engaging drops to.
Table 1 shows the mean reward for simulations for each solver. The optimal strategy includes using the active sonar, but previous approaches have difficulty determining this because of the reduced engagement success rate. The PFT-DPW approach has the best score, followed closely by POMCPOW. All other solvers have similar performance to QMDP.
The final experimental problem is called Van Der Pol tag and has continuous state, action, and observation spaces. In this problem an agent moves through 2D space to try to tag a target () that has a random unknown initial position in . The agent always travels at the same speed, but chooses a direction of travel and whether to take an accurate observation (). The observation again consists of 8 beams () that give measurements to the target. Normally, these measurements are too noisy to be useful (), but, if the agent chooses an accurate measurement with a cost of , the observation has low noise (). The agent is blocked if it comes into contact with one of the barriers that stretch from to in each of the cardinal directions (see Fig. 4), while the target can move freely through. There is a cost of for each step, and a reward of for tagging the target (being within a distance of ).
The target moves following a two dimensional form of the Van Der Pol oscillation defined by the differential equations
where . Gaussian noise () is added to the position at the end of each step. Runge-Kutta fourth order integration is used to propagate the state.
This problem has several challenging features that might be faced in real-world applications. First, the state transitions are more computationally expensive because of the numerical integration. Second, the continuous state space and obstacles make it difficult to construct a good heuristic rollout policy, so random rollouts are used. Table 1 shows the mean reward for simulations of this problem for each solver. Since a POMCPOW iteration requires less computation than a PFT-DPW iteration, POMCPOW simulates more random rollouts and thus performs slightly better.
In this paper, we have proposed a new general-purpose online POMDP algorithm that is able to solve problems with continuous state, action, and observation spaces. This is a qualitative advance in capability over previous solution techniques, with the only major new requirement being explicit knowledge of the observation distribution.
This study has yielded several insights into the behavior of tree search algorithms for POMDPs. We explained why POMCP-DPW and other solvers are unable to choose costly information-gathering actions in continuous spaces, and showed that POMCPOW and PFT-DPW are both able to overcome this challenge. The computational experiments carried out for this work used only small toy problems (though they are quite large compared to many of the POMDPs previously studied in the literature), but other recent research [Sunberg, Ho, and Kochenderfer2017] shows that POMCP-DPW is effective in very large and complex realistic domains and thus provides clear evidence the POMCPOW will also be successful.
The theoretical properties of the algorithms remain to be proven. In addition, better ways for choosing continuous actions would provide an improvement. The techniques that others have studied for handling continuous actions such as generalized pattern search [Seiler, Kurniawati, and Singh2015] and hierarchical optimistic optimization [Mansley, Weinstein, and Littman2011] are complimentary to this work, and the combination of these approaches will likely yield powerful tools for solving real problems.
Toyota Research Institute (“TRI”) provided funds to assist the authors with their research, but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
The authors would also like to thank Zongzhang Zhang for his especially helpful comments and Auke Wiggers for catching several pseudocode mistakes.
Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 194–209. Springer.
International Joint Conference on Artificial Intelligence (IJCAI), 1332–1338.
Discretization is perhaps the most straightforward way to deal with continuous observation spaces. The results in Table 1 show that this approach is only sometimes effective. Figure 5 shows the performance at different discretization granularities for the Light Dark and Sub Hunt problems.
Since the Light Dark domain has only a single observation dimension, it is easy to discretize. In fact, POMCP with fine discretization outperforms POMCPOW. However, discretization is only effective at certain granularities, and this is highly dependent on the solver and possibly hyperparameters. In the Sub Hunt problem, with its high-dimensional observation, discretization is not effective at any granularity. In Van Der Pol tag, both the action and observation spaces must be discretized. Due to the high dimensionality of the observation space, similar to Sub Hunt, no discretization that resulted in good performance was found.
|Laser Tag||Light Dark||Sub Hunt||VDP Tag|
For problems with discrete actions, all actions are considered and and are not needed.
Hyperparameters for POMCPOW and PFT-DPW were chosen using the cross entropy method [Mannor, Rubinstein, and Gat2003], but exact tuning was not a high priority and some parameters were re-used across solvers so the parameters may not be perfectly optimized. The values used in the experiments are shown in Table 2. There are not enough experiments to draw broad conclusions about the hyperparameters, but it appears that performance is most sensitive to the exploration constant, .
The values for the observation widening parameters, and , were similar for all the problems in this work. A small essentially limits the number of observations to a static number , resulting in behavior reminiscent of sparse UCT [Browne et al.2012], preventing unnecessary widening and allowing the tree to grow deep. This seems to work well in practice with the branching factor () set to values between and , and suggests that it may be sufficient to limit the number of children to a fixed number rather than do progressive widening in a real implementation.
A version of Monte Carlo tree search with double progressive widening has been proven to converge to the optimal value function on fully observable MDPs by auger2013continuous ̵̃auger2013continuous. We utilize this proof to show that POMCP-DPW converges to a solution that is sometimes suboptimal.
First we establish some preliminary definitions taken directly from auger2013continuous ̵̃auger2013continuous.
The Regularity hypothesis is the assumption that for any , there is a non zero probability to sample an action that is optimal with precision . More precisely, there is a and a (which remain the same during the whole simulation) such that for all ,
We say that some property depending on an integer is exponentially sure in if there exists positive constants , , and such that the probability that the property holds is at least
In order for the proof from auger2013continuous ̵̃auger2013continuous to apply, the following four minor modifications to the POMCP-DPW algorithm must be made:
Instead of the usual logarithmic exploration, use polynomial exploration, that is, select actions based on the criterion
as opposed to the traditional criterion
and create a new node for progressive widening when rather than when the number of children exceeds .
Instead of performing rollout simulations, keep creating new single-child nodes until the maximum depth is reached.
In line 15, instead of selecting an observation randomly, select the observation that has been visited least proportionally to how many times it has been visited.
Use the depth-dependent coefficient values in Table 1 from auger2013continuous ̵̃auger2013continuous instead of choosing static values.
This version of the algorithm will be referred to as “modified POMCP-DPW”. The algorithm with these changes is listed in Algorithm 4.
We now define the “QMDP value” that POMCP-DPW converges to (this is repeated from the main text of the paper) and prove a preliminary lemma.
Let be the optimal state-action value function assuming full observability starting by taking action in state . The QMDP value at belief , , is the expected value of when is distributed according to .
If POMCP-DPW or modified POMCP-DPW is applied to a POMDP with a continuous observation space and observation probability density functions that are finite everywhere, then each history node in the tree will have only one corresponding state, that is .
We are now ready to restate and prove the theorem from the text.
We prove that modified POMCP-DPW functions exactly as the Polynomial UCT (PUCT) algorithm defined by auger2013continuous ̵̃auger2013continuous applied to an augmented fully observable MDP, and hence converges to the QMDP value. We will show this by proposing incremental changes to Algorithm 4 that do not change its function that will result in an algorithm identical to PUCT.
Before listing the changes, we define the “augmented fully observable MDP” as follows: For a POMDP , and belief , the augmented fully observable MDP, , is the MDP defined by , where
and, for all ,
This is simply the fully observable MDP augmented with a special state representing the current belief. It is clear that the value function for this problem is the same as the QMDP value for the POMDP, . Thus, by showing that modified POMCP-DPW behaves exactly as PUCT applied to , we show that it estimates the QMDP values.
Consider the following modifications to Algorithm 4 that do not change its behavior when the observation space is continuous:
Eliminate the state count . Justification: By Lemma 1, its value will be 1 for every node.
Remove and replace with a mapping from each node to a state of ; define . Justification: By Lemma 1 always contains only a single state, so contains the same information.
Generate states and rewards with , the generative model of , instead of . Justification: Since the state transition model for the fully observable MDP is the same as the POMDP, these are equivalent for all .
Remove the argument of Simulate. Justification: The sampling in line 3 is done implicitly in if , and is redundant in other cases because can be mapped to through .
The result of these changes is shown in Algorithm 5. It is straightforward to verify that this algorithm is equivalent to PUCT applied to . Each observation-terminated history, , corresponds to a PUCT “decision node”, , and each action-terminated history, , corresponds to a PUCT “chance node”, . In other words, the observations have no meaning in the tree other than making up the histories, which are effectively just keys or aliases for the state nodes.
Since PUCT is guaranteed by Theorem 1 of auger2013continuous ̵̃auger2013continuous to converge to the optimal value function of exponentially surely, POMCP-DPW is guaranteed to converge to the QMDP value exponentially surely, and the theorem is proven.
One may object that multiple histories may map to the same state through , and thus the history nodes in a modified POMCP-DPW tree are not equivalent to state nodes in the PUCT tree. In fact, the PUCT algorithm does not check to see if a state has previously been generated by the model, so it may also contain multiple decision nodes that correspond to the same state. Though this is not explicitly stated by the authors, it is clear from the algorithm description, and the proof still holds.