1 Introduction
The concept of planning has a long tradition in the AI literature [Fikes NilssonFikes Nilsson1971, ChapmanChapman1987, McAllester RoseblittMcAllester Roseblitt1991, Penberthy WeldPenberthy Weld1992, Blum FurstBlum Furst1997]. Classical planning is generally concerned with agents which operate in environments that are fully observable, deterministic, finite, static, and discrete. While these techniques are able to solve increasingly large statespace problems, the basic assumptions of classical planning—full observability, static environment, deterministic actions—make these unsuitable for most robotic applications.
Planning under uncertainty aims to improve robustness by explicitly reasoning about the type of uncertainty that can arise. The Partially Observable Markov Decision Process (POMDP) [ÄstromÄstrom1965, SondikSondik1971, MonahanMonahan1982, WhiteWhite1991, LovejoyLovejoy1991b, Kaelbling, Littman, CassandraKaelbling et al.1998, Boutilier, Dean, HanksBoutilier et al.1999] has emerged as possibly the most general representation for (singleagent) planning under uncertainty. The POMDP supersedes other frameworks in terms of representational power simply because it combines the most essential features for planning under uncertainty.
First, POMDPs handle uncertainty in both action effects and state observability, whereas many other frameworks handle neither of these, and some handle only stochastic action effects. To handle partial state observability, plans are expressed over information states
, instead of world states, since the latter ones are not directly observable. The space of information states is the space of all beliefs a system might have regarding the world state. Information states are easily calculated from the measurements of noisy and imperfect sensors. In POMDPs, information states are typically represented by probability distributions over world states.
Second, many POMDP algorithms form plans by optimizing a value function. This is a powerful approach to plan optimization, since it allows one to numerically trade off between alternative ways to satisfy a goal, compare actions with different costs/rewards, as well as plan for multiple interacting goals. While value function optimization is used in other planning approaches—for example Markov Decision Processes (MDPs) [BellmanBellman1957]—POMDPs are unique in expressing the value function over information states, rather than world states.
Finally, whereas classical and conditional planners produce a sequence of actions, POMDPs produce a full policy for action selection, which prescribes the choice of action for any possible information state. By producing a universal plan, POMDPs alleviate the need for replanning, and allow fast execution. Naturally, the main drawback of optimizing a universal plan is the computational complexity of doing so. This is precisely what we seek to alleviate with the work described in this paper
Most known algorithms for exact planning in POMDPs operate by optimizing the value function over all possible information states (also known as beliefs
). These algorithms can run into the wellknown curse of dimensionality, where the dimensionality of planning problem is directly related to the number of states
[Kaelbling, Littman, CassandraKaelbling et al.1998]. But they can also suffer from the lesser known curse of history, where the number of beliefcontingent plans increases exponentially with the planning horizon. In fact, exact POMDP planning is known to be PSPACEcomplete, whereas propositional planning is only NPcomplete [LittmanLittman1996]. As a result, many POMDP domains with only a few states, actions and sensor observations are computationally intractable.A commonly used technique for speeding up POMDP solving involves selecting a finite set of belief points and performing value backups on this set [SondikSondik1971, ChengCheng1988, LovejoyLovejoy1991a, HauskrechtHauskrecht2000, Zhang ZhangZhang Zhang2001]. While the usefulness of belief point updates is well acknowledged, how and when these backups should be applied has not been thoroughly explored.
This paper describes a class of PointBased Value Iteration
(PBVI) POMDP approximations where the value function is estimated based strictly on pointbased updates. In this context, the choice of points is an integral part of the algorithm, and our approach interleaves value backups with steps of belief point selection. One of the key contributions of this paper is the presentation and analysis of a set of heuristics for selecting informative belief points. These range from a naive version that combines pointbased value updates with random belief point selection, to a sophisticated algorithm that combines the standard pointbased value update with an estimate of the error bound between the approximate and exact solutions to select belief points. Empirical and theoretical evaluation of these techniques reveals the importance of taking distance between points into consideration when selecting belief points. The result is an approach which exhibits good performance with very few belief points (sometimes less than the number of states), thereby overcoming the curse of history.
The PBVI class of algorithms has a number of important properties, which are discussed at greater length in the paper:

Theoretical guarantees. We present a bound on the error of the value function obtained by pointbased approximation, with respect to the exact solution. This bound applies to a number of pointbased approaches, including our own PBVI, Perseus [Spaan VlassisSpaan Vlassis2005], and others.

Scalability. We are able to handle problems on the order of states, which is an order of magnitude larger than problems solved by more traditional POMDP techniques. The empirical performance is evaluated extensively in realistic robot tasks, including a searchformissingperson scenario.

Wide applicability. The approach makes few assumptions about the nature or structure of the domain. The PBVI framework does assume known discrete state/ action/observation spaces and a known model (i.e., statetostate transitions, observation probabilities, costs/rewards), but no additional specific structure (e.g., constrained policy class, factored model).

Anytime performance. An anytime solution can be achieved by gradually alternating phases of belief point selection and phases of pointbased value updates. This allows for an effective tradeoff between planning time and solution quality.
While PBVI has many important properties, there are a number of other recent POMDP approaches which exhibit competitive performance [Braziunas BoutilierBraziunas Boutilier2004, Poupart BoutilierPoupart Boutilier2004, Smith SimmonsSmith Simmons2004, Spaan VlassisSpaan Vlassis2005]. We provide an overview of these techniques in the later part of the paper. We also provide a comparative evaluation of these algorithms and PBVI using standard POMDP domains, in an effort to guide practitioners in their choice of algorithm. One of the algorithms, Perseus [Spaan VlassisSpaan Vlassis2005], is most closely related to PBVI both in design and in performance. We therefore provide a direct comparison of the two approaches using a realistic robot task, in an effort to shed further light on the comparative strengths and weaknesses of these two approaches.
The paper is organized as follows. Section 2 begins by exploring the basic concepts in POMDP solving, including representation, inference, and exact planning. Section 3 presents the general anytime PBVI algorithm and its theoretical properties. Section 4 discusses novel strategies to select good belief points. Section 6 presents an empirical comparison of POMDP algorithms using standard simulation problems. Section 7 pursues the empirical evaluation by tackling complex robot domains and directly comparing PBVI with Perseus. Finally, Section 5 surveys a number of existing POMDP approaches that are closely related to PBVI.
2 Review of POMDPs
Partially Observable Markov Decision Processes provide a general planning and decisionmaking framework for acting optimally in partially observable domains. They are wellsuited to a great number of realworld problems where decisionmaking is required despite prevalent uncertainty. They generally assume a complete and correct world model, with stochastic state transitions, imperfect state tracking, and a reward structure. Given this information, the goal is to find an action strategy which maximizes expected reward gains. This section first establishes the basic terminology and essential concepts pertaining to POMDPs, and then reviews optimal techniques for POMDP planning.
2.1 Basic POMDP Terminology
Formally, a POMDP is defined by six distinct quantities, denoted . The first three of these are:

States. The state of the world is denoted , with the finite set of all states denoted by . The state at time is denoted , where is a discrete time index. The state is not directly observable in POMDPs, where an agent can only compute a belief over the state space .

Observations. To infer a belief regarding the world’s state , the agent can take sensor measurements. The set of all measurements, or observations, is denoted . The observation at time is denoted . Observation is usually an incomplete projection of the world state , contaminated by sensor noise.

Actions. To act in the world, the agent is given a finite set of actions, denoted . Actions stochastically affect the state of the world. Choosing the right action as a function of history is the core problem in POMDPs.
Throughout this paper, we assume that states, actions and observations are discrete and finite. For mathematical convenience, we also assume that actions and observations are alternated over time.
To fully define a POMDP, we have to specify the probabilistic laws that describe state transitions and observations. These laws are given by the following distributions:

The state transition probability distribution,
(1) is the probability of transitioning to state , given that the agent is in state and selects action , for any . Since
is a conditional probability distribution, we have
. As our notation suggests, is timeinvariant. 
The observation probability distribution,
(2) is the probability that the agent will perceive observation upon executing action in state . This conditional probability is defined for all triplets, for which . The probability function is also timeinvariant.
Finally, the objective of POMDP planning is to optimize action selection, so the agent is given a reward function describing its performance:

The reward function. , assigns a numerical value quantifying the utility of performing action when in state . We assume the reward is bounded, . The goal of the agent is to collect as much reward as possible over time. More precisely, it wants to maximize the sum:
(3) where is the reward at time , is the mathematical expectation, and where is a discount factor, which ensures that the sum in Equation 3 is finite.
These items together, the states , actions , observations , reward , and the probability distributions, and , define the probabilistic world model that underlies each POMDP.
2.2 Belief Computation
POMDPs are instances of Markov processes, which implies that the current world state, , is sufficient to predict the future, independent of the past . The key characteristic that sets POMDPs apart from many other probabilistic models (such as MDPs) is the fact that the state is not directly observable. Instead, the agent can only perceive observations , which convey incomplete information about the world’s state.
Given that the state is not directly observable, the agent can instead maintain a complete trace of all observations and all actions it ever executed, and use this to select its actions. The action/observation trace is known as a history. We formally define
(4) 
to be the history at time .
This history trace can get very long as time goes on. A wellknown fact is that this history does not need to be represented explicitly, but can instead be summarized via a belief distribution [ÄstromÄstrom1965]
, which is the following posterior probability distribution:
(5) 
This of course requires knowing the initial state probability distribution:
(6) 
which defines the probability that the domain is in state at time . It is common either to specify this initial belief as part of the model, or to give it only to the runtime system which tracks beliefs and selects actions. For our work, we will assume that this initial belief (or a set of possible initial beliefs) are available to the planner.
Because the belief distribution is a sufficient statistic for the history, it suffices to condition the selection of actions on , instead of on the evergrowing sequence of past observations and actions. Furthermore, the belief at time is calculated recursively, using only the belief one time step earlier, , along with the most recent action and observation .
We define the belief update equation, , as:
(7)  
where the denominator is a normalizing constant.
This equation is equivalent to the decadesold Bayes filter [JazwinskiJazwinski1970]
, and is commonly applied in the context of hidden Markov models
[RabinerRabiner1989], where it is known as the forward algorithm. Its continuous generalization forms the basis of Kalman filters
[KalmanKalman1960].It is interesting to consider the nature of belief distributions. Even for finite state spaces, the belief is a continuous quantity. It is defined over a simplex describing the space of all distributions over the state space . For very large state spaces, calculating the belief update (Eqn 7) can be computationally challenging. Recent research has led to efficient techniques for belief state computation that exploit structure of the domain [Dean KanazawaDean Kanazawa1988, Boyen KollerBoyen Koller1998, Poupart BoutilierPoupart Boutilier2000, Thrun, Fox, Burgard, DellaertThrun et al.2000]. However, by far the most complex aspect of POMDP planning is the generation of a policy for action selection, which is described next. For example in robotics, calculating beliefs over state spaces with states is easily done in realtime burgard99. In contrast, calculating optimal action selection policies exactly appears to be infeasible for environments with more than a few dozen states [Kaelbling, Littman, CassandraKaelbling et al.1998], not directly because of the size of the state space, but because of the complexity of the optimal policies. Hence we assume throughout this paper that the belief can be computed accurately, and instead focus on the problem of finding good approximations to the optimal policy.
2.3 Optimal Policy Computation
The central objective of the POMDP perspective is to compute a policy for selecting actions. A policy is of the form:
(8) 
where is a belief distribution and is the action chosen by the policy .
Of particular interest is the notion of optimal policy, which is a policy that maximizes the expected future discounted cumulative reward:
(9) 
There are two distinct but interdependent reasons why computing an optimal policy is challenging. The more widelyknown reason is the socalled curse of dimensionality: in a problem with physical states, is defined over all belief states in an dimensional continuous space. The lesswellknown reason is the curse of history: POMDP solving is in many ways like a search through the space of possible POMDP histories. It starts by searching over short histories (through which it can select the best short policies), and gradually considers increasingly long histories. Unfortunately the number of distinct possible actionobservation histories grows exponentially with the planning horizon.
The two curses—dimensionality and history—often act independently: planning complexity can grow exponentially with horizon even in problems with only a few states, and problems with a large number of physical states may still only have a small number of relevant histories. Which curse is predominant depends both on the problem at hand, and the solution technique. For example, the belief point methods that are the focus of this paper specifically target the curse of history, leaving themselves vulnerable to the curse of dimensionality. Exact algorithms on the other hand typically suffer far more from the curse of history. The goal is therefore to find techniques that offer the best balance between both.
We now describe a straightforward approach to finding optimal policies by sondik71. The overall idea is to apply multiple iterations of dynamic programming, to compute increasingly more accurate values for each belief state . Let be a value function that maps belief states to values in . Beginning with the initial value function:
(10) 
then the th value function is constructed from the th by the following recursive equation:
(11) 
where is the belief updating function defined in Equation 7. This value function update maximizes the expected sum of all (possibly discounted) future payoffs the agent receives in the next time steps, for any belief state . Thus, it produces a policy that is optimal under the planning horizon . The optimal policy can also be directly extracted from the previousstep value function:
(12) 
sondik71 showed that the value function at any finite horizon
can be expressed by a set of vectors:
. Each vector represents an dimensional hyperplane, and defines the value function over a bounded region of the belief:(13) 
In addition, each vector is associated with an action, defining the best immediate policy assuming optimal behavior for the following steps (as defined respectively by the sets ).
The horizon solution set, , can be computed as follows. First, we rewrite Equation 11 as:
(14) 
Notice that in this representation of , the nonlinearity in the term from Equation 11 cancels out the nonlinearity in the term , leaving a linear function of b(s) inside the max operator.
The value cannot be computed directly for each belief (since there are infinitely many beliefs), but the corresponding set can be generated through a sequence of operations on the set .
The first operation is to generate intermediate sets and (Step 1):
(15)  
where each and is once again an dimensional hyperplane.
Next we create (), the crosssum over observations^{1}^{1}1The symbol denotes the crosssum operator. A crosssum operation is defined over two sets, and , and produces a third set, ., which includes one from each (Step 2):
(16) 
Finally we take the union of sets (Step 3):
(17) 
This forms the pieces of the backup solution at horizon . The actual value function is extracted from the set as described in Equation 13.
Using this approach, boundedtime POMDP problems with finite state, action, and observation spaces can be solved exactly given a choice of the horizon . If the environment is such that the agent might not be able to bound the planning horizon in advance, the policy is an approximation to the optimal one whose quality improves in expectation with the planning horizon (assuming ).
As mentioned above, the value function can be extracted directly from the set . An important aspect of this algorithm (and of all optimal finitehorizon POMDP solutions) is that the value function is guaranteed to be a piecewise linear, convex, and continuous function of the belief [SondikSondik1971]. The piecewiselinearity and continuous properties are a direct result of the fact that is composed of finitely many linear vectors. The convexity property is a result of the maximization operator (Eqn 13). It is worth pointing out that the intermediate sets , and also represent functions of the belief which are composed entirely of linear segments. This property holds for the intermediate representations because they incorporate the expectation over observation probabilities (Eqn 15).
In the worst case, the exact value update procedure described could require time doubly exponential in the planning horizon [Kaelbling, Littman, CassandraKaelbling et al.1998]. To better understand the complexity of the exact update, let be the number of states, the number of actions, the number of observations, and the number of vectors in the previous solution set. Then Step 1 creates projections and Step 2 generates crosssums. So, in the worst case, the new solution requires:
(18) 
vectors to represent the value function at horizon ; these can be computed in time .
It is often the case that a vector in will be completely dominated by another vector over the entire belief simplex:
(19) 
Similarly, a vector may be fully dominated by a set of other vectors (e.g., in Fig. 1 is dominated by the combination of and
). This vector can then be pruned away without affecting the solution. Finding dominated vectors can be expensive. Checking whether a single vector is dominated requires solving a linear program with
variables and constraints. Nonetheless it can be timeeffective to apply pruning after each iteration to prevent an explosion of the solution size. In practice, often appears to grow singly exponentially in , given clever mechanisms for pruning unnecessary linear functions. This enormous computational complexity has long been a key impediment toward applying POMDPs to practical problems.2.4 PointBased Value Backup
Exact POMDP solving, as outlined above, optimizes the value function over all beliefs. Many approximate POMDP solutions, including the PBVI approach proposed in this paper, gain computational advantage by applying value updates at specific (and few) belief points, rather than over all beliefs [ChengCheng1988, Zhang ZhangZhang Zhang2001, PoonPoon2001]. These approaches differ significantly (and to great consequence) in how they select the belief points, but once a set of points is selected, the procedure for updating their value is standard. We now describe the procedure for updating the value function at a set of known belief points.
As in Section 2.3, the value function update is implemented as a sequence of operations on a set of vectors. If we assume that we are only interested in updating the value function at a fixed set of belief points, , then it follows that the value function will contain at most one vector for each belief point. The pointbased value function is therefore represented by the corresponding set .
Given a solution set , we simply modify the exact backup operator (Eqn 14) such that only one vector per belief point is maintained. The pointbased backup now gives an vector which is valid over a region around . It assumes that the other belief points in that region have the same action choice and lead to the same facets of as the point . This is the key idea behind all algorithms presented in this paper, and the reason for the large computational savings associated with this class of algorithms.
To obtain solution set from the previous set , we begin once again by generating intermediate sets and (exactly as in Eqn 15) (Step 1):
(20)  
Next, whereas performing an exact value update requires a crosssum operation (Eqn 16), by operating over a finite set of points, we can instead use a simple summation. We construct (Step 2):
(21) 
Finally, we find the best action for each belief point (Step 3):
(22)  
(23) 
While these operations preserve only the best vector at each belief point , an estimate of the value function at any belief in the simplex (including ) can be extracted from the set just as before:
(24) 
To better understand the complexity of updating the value of a set of points , let be the number of states, the number of actions, the number of observations, and the number of vectors in the previous solution set. As with an exact update, Step 1 creates projections (in time ). Steps 2 and 3 then reduce this set to at most components (in time ). Thus, a full pointbased value update takes only polynomial time, and even more crucially, the size of the solution set remains constant at every iteration. The pointbased value backup algorithm is summarized in Table 1.
=BACKUP(, )  1 
For each action  2 
For each observation  3 
For each solution vector  4 
5  
End  6 
7  
End  8 
End  9 
10  
For each belief point  11 
12  
If()  13 
14  
End  15 
Return  16 
Note that the algorithm as outlined in Table 1 includes a trivial pruning step (lines 1314), whereby we refrain from adding to any vector already included in it. As a result, it is often the case that . This situation arises whenever multiple nearby belief points support the same vector. This pruning step can be computed rapidly (without solving linear programs) and is clearly advantageous in terms of reducing the set .
The pointbased value backup is found in many POMDP solvers, and in general serves to improve estimates of the value function. It is also an integral part of the PBVI framework.
3 Anytime PointBased Value Iteration
We now describe the algorithmic framework for our new class of fast approximate POMDP algorithms called PointBased Value Iteration (PBVI). PBVIclass algorithms offer an anytime solution to largescale discrete POMDP domains. The key to achieving an anytime solution is to interleave two main components: the pointbased update described in Table 1 and steps of belief set selection. The approximate value function we find is guaranteed to have bounded error (compared to the optimal) for any discrete POMDP domain.
The current section focuses on the overall anytime algorithm and its theoretical properties, independent of the belief point selection process. Section 4 then discusses in detail various novel techniques for belief point selection.
The overall PBVI framework is simple. We start with a (small) initial set of belief points to which are applied a first series of backup operations. The set of belief points is then grown, a new series of backup operations are applied to all belief points (old and new), and so on, until a satisfactory solution is obtained. By interleaving value backup iterations with expansions of the belief set, PBVI offers a range of solutions, gradually trading off computation time and solution quality.
The full algorithm is presented in Table 2. The algorithm accepts as input an initial belief point set (), an initial value (), the number of desired expansions (), and the planning horizon (). A common choice for is the initial belief ; alternately, a larger set could be used, especially in cases where sample trajectories are available. The initial value, , is typically set to be purposefully low (e.g., ). When we do this, we can show that the pointbased solution is always be a lowerbound on the exact solution [LovejoyLovejoy1991a]. This follows from the simple observation that failing to compute an vector can only lower the value function.
For problems with a finite horizon, we run value backups between each expansion of the belief set. In infinitehorizon problems, we select the horizon so that
where and .
The complete algorithm terminates once a fixed number of expansions () have been completed. Alternately, the algorithm could terminate once the value function approximation reaches a given performance criterion. This is discussed further below.
The algorithm uses the BACKUP routine described in Table 1
. We can assume for the moment that the EXPAND subroutine (line 8) selects belief points at random. This performs reasonably well for small problems where it is easy to achieve good coverage of the entire belief simplex. However it scales poorly to larger domains where exponentially many points are needed to guarantee good coverage of the belief simplex. More sophisticated approaches to selecting belief points are presented in Section
4. Overall, the PBVI framework described here offers a simple yet flexible approach to solving largescale POMDPs.=PBVIMAIN(, , , )  1 
=  2 
3  
For expansions  4 
For iterations  5 
BACKUP(,)  6 
End  7 
EXPAND(,)  8 
9  
End  10 
Return  11 
For any belief set and horizon , the algorithm in Table 2 will produce an estimate of the value function, denoted . We now show that the error between and the optimal value function is bounded. The bound depends on how densely samples the belief simplex ; with denser sampling, converges to , the horizon optimal solution, which in turn has bounded error with respect to , the optimal solution. So cutting off the PBVI iterations at any sufficiently large horizon, we can show that the difference between and the optimal infinitehorizon is not too large. The overall error in PBVI is bounded, according to the triangle inequality, by:
The second term is bounded by [Bertsekas TsitsiklisBertsekas Tsitsiklis1996]. The remainder of this section states and proves a bound on the first term, which we denote .
Begin by assuming that denotes an exact value backup, and denotes the PBVI backup. Now define to be the error introduced at a specific belief by performing one iteration of pointbased backup:
Next define to be the maximum total error introduced by doing one iteration of pointbased backup:
Finally define the density of a set of belief points to be the maximum distance from any belief in the simplex to a belief in set . More precisely:
Now we can prove the following lemma:
Lemma 1.
The error introduced in PBVI when performing one iteration of value backup over , instead of over , is bounded by
Proof: Let be the point where PBVI makes its worst error in value update, and be the closest (1norm) sampled belief to . Let be the vector that is maximal at , and be the vector that would be maximal at . By failing to include in its solution set, PBVI makes an error of at most . On the other hand, since is maximal at , then . So,
The last inequality holds because each vector represents the reward achievable starting from some state and following some sequence of actions and observations. Therefore the sum of rewards must fall between and . ∎
Lemma 1 states a bound on the approximation error introduced by one iteration of pointbased value updates within the PBVI framework. We now look at the bound over multiple value updates.
Theorem 3.1.
For any belief set and any horizon , the error of the PBVI algorithm is bounded by
Proof:
The bound described in this section depends on how densely samples the belief simplex . In the case where not all beliefs are reachable, PBVI does not need to sample all of densely, but can replace by the set of reachable beliefs (Fig. 2). The error bounds and convergence results hold on . We simply need to redefine in lemma 1.
As a side note, it is worth pointing out that because PBVI makes no assumption regarding the initial value function , the pointbased solution is not guaranteed to improve with the addition of belief points. Nonetheless, the theorem presented in this section shows that the bound on the error between (the pointbased solution) and (the optimal solution) is guaranteed to decrease (or stay the same) with the addition of belief points. In cases where is initialized pessimistically (e.g., , as suggested above), then will improve (or stay the same) with each value backup and addition of belief points.
This section has thus far skirted the issue of belief point selection, however the bound presented in this section clearly argues in favor of dense sampling over the belief simplex. While randomly selecting points according to a uniform distribution may eventually accomplish this, it is generally inefficient, in particular for high dimensional cases. Furthermore, it does not take advantage of the fact that the error bound holds for dense sampling over
reachable beliefs. Thus we seek more efficient ways to generate belief points than at random over the entire simplex. This is the issue explored in the next section.4 Belief Point Selection
In section 3, we outlined the prototypical PBVI algorithm, while conveniently avoiding the question of how and when belief points should be selected. There is a clear tradeoff between including fewer beliefs (which would favor fast planning over good performance), versus including many beliefs (which would slow down planning, but ensure a better bound on performance). This brings up the question of how many belief points should be included. However the number of points is not the only consideration. It is likely that some collections of belief points (e.g., those frequently encountered) are more likely to produce a good value function than others. This brings up the question of which beliefs should be included.
A number of approaches have been proposed in the literature. For example, some exact value function approaches use linear programs to identify points where the value function needs to be further improved [ChengCheng1988, LittmanLittman1996, Zhang ZhangZhang Zhang2001], however this is typically very expensive. The value function can also be approximated by learning the value at regular points, using a fixedresolution [LovejoyLovejoy1991a], or variableresolution [Zhou HansenZhou Hansen2001] grid. This is less expensive than solving LPs, but can scales poorly as the number of states increases. Alternately, one can use heuristics to generate gridpoints [HauskrechtHauskrecht2000, PoonPoon2001]. This tends to be more scalable, though significant experimentation is required to establish which heuristics are most useful.
This section presents five heuristic strategies for selecting belief points, from fast and naive random sampling, to increasingly more sophisticated stochastic simulation techniques. The most effective strategy we propose is one that carefully selects points that are likely to have the largest impact in reducing the error bound (Theorem 3.1).
Most of the strategies we consider focus on selecting reachable beliefs, rather than getting uniform coverage over the entire belief simplex. Therefore it is useful to begin this discussion by looking at how reachability is assessed.
While some exact POMDP value iteration solutions are optimal for any initial belief, PBVI (and other related techniques) assume a known initial belief . As shown in Figure 2, we can use the initial belief to build a tree of reachable beliefs. In this representation, each path through the tree corresponds to a sequence in belief space, and increasing depth corresponds to an increasing plan horizon. When selecting a set of belief points for PBVI, including all reachable beliefs would guarantee optimal performance (conditioned on the initial belief), but at the expense of computational tractability, since the set of reachable beliefs, , can grow exponentially with the planning horizon. Therefore, it is best to select a subset which is sufficiently small for computational tractability, but sufficiently large for good value function approximation.^{2}^{2}2All strategies discussed below assume that the belief point set, , approximately doubles in size on each belief expansion. This ensures that the number of rounds of value iteration is logarithmic (in the final number of belief points needed). Alternately, each strategy could be used (with very little modification) to add a fixed number of new belief points, but this may require many more rounds of value iteration. Since value iteration is much more expensive than belief computation, it seems appropriate to double the size of at each expansion.
In domains where the initial belief is not known (or not unique), it is still possible to use reachability analysis by sampling a few initial beliefs (or using a set of known initial beliefs) to seed multiple reachability trees.
We now discuss five strategies for selecting belief points, each of which can be used within the PBVI framework to perform expansion of the belief set.
4.1 Random Belief Selection (Ra)
The first strategy is also the simplest. It consists of sampling belief points from a uniform distribution over the entire belief simplex. To sample over the simplex, we cannot simply sample each independently over (this would violate the constraint that ). Instead, we use the algorithm described in Table 3 [<]see¿[for more details including proof of uniform coverage]devroye86.
=EXPAND(, )  1 
=  2 
Foreach  3 
:= number of states  4 
For  5 
=rand(0,1)  6 
End  7 
Sort in ascending order  8 
For  9 
=  10 
End  11 
12  
End  13 
Return  14 
This random point selection strategy, unlike the other strategies presented below, does not focus on reachable beliefs. For this reason, we do not necessarily advocate this approach. However we include it because it is an obvious choice, it is by far the simplest to implement, and it has been used in related work by hauskrecht00 and poon01. In smaller domains (e.g., 20 states), it performs reasonably well, since the belief simplex is relatively lowdimensional. In large domains (e.g., 100+ states), it cannot provide good coverage of the belief simplex with a reasonable number of points, and therefore exhibits poor performance. This is demonstrated in the experimental results presented in Section 6.
All of the remaining belief selection strategies make use of the belief tree (Figure 2) to focus on reachable beliefs, rather than trying to cover the entire belief simplex.
4.2 Stochastic Simulation with Random Action (Ssra)
To generate points along the belief tree, we use a technique called stochastic simulation. It involves running singlestep forward trajectories from belief points already in . Simulating a singlestep forward trajectory for a given requires selecting an action and observation pair , and then computing the new belief using the Bayesian update rule (Eqn 7). In the case of Stochastic Simulation with Random Action (SSRA), the action selected for forward simulation is picked (uniformly) at random from the full action set. Table 4 summarizes the belief expansion procedure for SSRA. First, a state is drawn from the belief distribution . Second, an action is drawn at random from the full action set. Next, a posterior state is drawn from the transition model . Finally, an observation is drawn from the observation model . Using the triple , we can calculate the new belief (according to Equation 7), and add to the set of belief points .
=EXPAND(, )  1 
=  2 
Foreach  3 
=rand()  4 
=rand()  5 
=rand()  6 
=rand()  7 
= (see Eqn 7)  8 
=  9 
End  10 
Return  11 
This strategy is better than picking points at random (as described above), because it restricts to the belief tree (Fig. 2). However this belief tree is still very large, especially when the branching factor is high, due to large numbers of actions/observations. By being more selective about which paths in the belief tree are explored, one can hope to effectively restrict the belief set further.
A similar technique for stochastic simulation was discussed by poon01, however the belief set was initialized differently (not using ), and therefore the stochastic simulations were not restricted to the set of reachable beliefs.
4.3 Stochastic Simulation with Greedy Action (Ssga)
The procedure for generating points using Stochastic Simulation with Greedy Action (SSGA) is based on the wellknown greedy
exploration strategy used in reinforcement learning
[Sutton BartoSutton Barto1998]. This strategy is similar to the SSRA procedure, except that rather than choosing an action randomly, SSEA will choose the greedy action (i.e., the current best action at the given belief ) with probability , and will chose a random action with probability (we use ). Once the action is selected, we perform a singlestep forward simulation as in SSRA to yield a new belief point. Table 5 summarizes the belief expansion procedure for SSGA.=EXPAND(, )  1 
=  2 
Foreach  3 
=rand()  4 
If rand  5 
=rand()  6 
Else  7 
=  8 
End  9 
=rand()  10 
=rand()  11 
= (see Eqn 7)  12 
=  13 
End  14 
Return  15 
A similar technique, featuring stochastic simulation using greedy actions, was outlined by hauskrecht00. However in that case, the belief set included all extreme points of the belief simplex, and stochastic simulation was done from those extreme points, rather than from the initial belief.
4.4 Stochastic Simulation with Exploratory Action (Ssea)
The error bound in Section 3 suggests that PBVI performs best when its belief set is uniformly dense in the set of reachable beliefs. The belief point strategies proposed thus far ignore this information. The next approach we propose gradually expands by greedily choosing new reachable beliefs that improve the worstcase density.
Unlike SSRA and SSGA which select a single action to simulate the forward trajectory for a given , Stochastic Sampling with Exploratory Action (SSEA) does a one step forward simulation with each action, thus producing new beliefs . However it does not accept all new beliefs , but rather calculates the distance between each and its closest neighbor in . We then keep only that point that is farthest away from any point already in . We use the norm to calculate distance between belief points to be consistent with the error bound in Theorem 3.1. Table 6 summarizes the SSEA expansion procedure.
4.5 Greedy Error Reduction (Ger)
While the SSEA strategy above is able to improve the worstcase density of reachable beliefs, it does not directly minimize the expected error. And while we would like to directly minimize the error, all we can measure is a bound on the error (Lemma 1). We therefore propose a final strategy which greedily adds the candidate beliefs that will most effectively reduce this error bound. Our empirical results, as presented below, show that this strategy is the most successful one discovered thus far.
To understand how we expand the belief set in the GER strategy, it is useful to reconsider the belief tree, which we reproduce in Figure 3. Each node in the tree corresponds to a specific belief. We can divide these nodes into three sets. Set 1 includes those belief points already in , in this case and . Set 2 contains those belief points that are immediate descendants of the points in (i.e., the nodes in the grey zone). These are the candidates from which we will select the new points to be added to . We call this set the envelope (denoted ). Set 3 contains all other reachable beliefs.
We need to decide which belief should be removed from the envelope and added to the set of active belief points . Every point that is added to will improve our estimate of the value function. The new point will reduce the error bounds (as defined in Section 3 for points that were already in ; however, the error bound for the new point itself might be quite large. That means that the largest error bound for points in will not monotonically decrease; however, for a particular point in (such as the initial belief ) the error bound will be decreasing.
To find the point which will most reduce our error bound, we can look at the analysis of Lemma 1. Lemma 1 bounds the amount of additional error that a single pointbased backup introduces. Write for the new belief which we are considering adding, and write for some belief which is already in . Write for the value hyperplane at , and write for . As the lemma points out, we have
When evaluating this error, we need to minimize over all . Also, since we do not know what will be until we have done some backups at , we make a conservative assumption and choose the worstcase value of . Thus, we can evaluate:
(25) 
While one could simply pick the candidate which currently has the largest error bound,^{3}^{3}3We tried this, however it did not perform as well empirically as what we suggest in Equation 4.5, because it did not consider the probability of reaching that belief. , this would ignore reachability considerations. Rather, we evaluate the error at each , by weighing the error of the fringe nodes by their reachability probability:
noting that , and can be evaluated according to Equation 25.
Using Equation 4.5, we find the existing point with the largest error bound. We can now directly reduce its error by adding to our set one of its descendants. We select the nextstep belief which maximizes error bound reduction:
(27)  
where  (29)  
Table 7 summarizes the GER approach to belief point selection.
=EXPAND(, )  1 
=  2 
=  3 
For  4 
5  
6  
7  
8  
End  9 
Return  10 
The complexity of adding one new points with GER is (where =#states, =#actions, =#observations, =#beliefs already selected). In comparison, a value backup (for one point) is , and each point typically needs to be updated several times. As we point out in empirical results below, belief selection (even with GER) takes minimal time compared to value backup.
This concludes our presentation of belief selection techniques for the PBVI framework. In summary, there are three factors to consider when picking a belief point: (1) how likely is it to occur? (2) how far is it from other belief points already selected? (3) what is the current approximate value for that point? The simplest heuristic (RA) accounts for none of these, whereas some of the others (SSRA, SSGA, SSEA) account for one, and GER incorporates all three factors.
4.6 Belief Expansion Example
We consider a simple example, shown in Figure 4, to illustrate the difference between the various belief expansion techniques outlined above. This 1D POMDP [LittmanLittman1996] has four states, one of which is the goal (indicated by the star). The two actions, left and right, have the expected (deterministic) effect. The goal state is fully observable (observation=goal), while the other three states are aliased (observation=none). A reward of is received for being in the goal state, otherwise the reward is zero. We assume a discount factor of . The initial distribution is uniform over nongoal states, and the system resets to that distribution whenever the goal is reached.
The belief set is always initialized to contain the initial belief . Figure 5 shows part of the belief tree, including the original belief set (top node), and its envelope (leaf nodes). We now consider what each belief expansion method might do.
The Random heuristic can pick any belief point (with equal probability) from the entire belief simplex. It does not directly expand any branches of the belief tree, but it will eventually put samples nearby.
The Stochastic Simulation with Random Action has a chance of picking each action. Then, regardless of which action was picked, there’s a 2/3 chance of seeing observation none, and a 1/3 chance of seeing observation goal. As a result, the SSRA will select: , , , .
The Stochastic Simulation with Greedy Action first needs to know the policy at . A few iterations of pointbased updates (Section 2.4) applied to this initial (single point) belief set reveal that .^{4}^{4}4This may not be obvious to the reader, but it follows directly from the repeated application of equations 20–23. As a result, expansion of the belief will greedily select action with probability (assuming and ). Action will be selected for belief expansion with probability . Combining this along with the observation probabilities, we can tell that SSGA will expand as follows: , ,