# Anytime Point-Based Approximations for Large POMDPs

The Partially Observable Markov Decision Process has long been recognized as a rich framework for real-world planning and control problems, especially in robotics. However exact solutions in this framework are typically computationally intractable for all but the smallest problems. A well-known technique for speeding up POMDP solving involves performing value backups at specific belief points, rather than over the entire belief simplex. The efficiency of this approach, however, depends greatly on the selection of points. This paper presents a set of novel techniques for selecting informative belief points which work well in practice. The point selection procedure is combined with point-based value backups to form an effective anytime POMDP algorithm called Point-Based Value Iteration (PBVI). The first aim of this paper is to introduce this algorithm and present a theoretical analysis justifying the choice of belief selection technique. The second aim of this paper is to provide a thorough empirical comparison between PBVI and other state-of-the-art POMDP methods, in particular the Perseus algorithm, in an effort to highlight their similarities and differences. Evaluation is performed using both standard POMDP domains and realistic robotic tasks.

• 1 publication
• 2 publications
• 3 publications
09/09/2011

### Perseus: Randomized Point-based Value Iteration for POMDPs

Partially observable Markov decision processes (POMDPs) form an attracti...
01/05/2021

### Improving Training Result of Partially Observable Markov Decision Process by Filtering Beliefs

In this study I proposed a filtering beliefs method for improving perfor...
08/05/2015

### On the Linear Belief Compression of POMDPs: A re-examination of current methods

Belief compression improves the tractability of large-scale partially ob...
06/30/2011

### Finding Approximate POMDP solutions Through Belief Compression

Standard value function approaches to finding policies for Partially Obs...
01/16/2013

### Value-Directed Belief State Approximation for POMDPs

We consider the problem belief-state monitoring for the purposes of impl...
05/23/2022

### Flow-based Recurrent Belief State Learning for POMDPs

Partially Observable Markov Decision Process (POMDP) provides a principl...
06/30/2011

### Restricted Value Iteration: Theory and Algorithms

Value iteration is a popular algorithm for finding near optimal policies...

## 1 Introduction

The concept of planning has a long tradition in the AI literature [Fikes  NilssonFikes  Nilsson1971, ChapmanChapman1987, McAllester  RoseblittMcAllester  Roseblitt1991, Penberthy  WeldPenberthy  Weld1992, Blum  FurstBlum  Furst1997]. Classical planning is generally concerned with agents which operate in environments that are fully observable, deterministic, finite, static, and discrete. While these techniques are able to solve increasingly large state-space problems, the basic assumptions of classical planning—full observability, static environment, deterministic actions—make these unsuitable for most robotic applications.

Planning under uncertainty aims to improve robustness by explicitly reasoning about the type of uncertainty that can arise. The Partially Observable Markov Decision Process (POMDP) [ÄstromÄstrom1965, SondikSondik1971, MonahanMonahan1982, WhiteWhite1991, LovejoyLovejoy1991b, Kaelbling, Littman,  CassandraKaelbling et al.1998, Boutilier, Dean,  HanksBoutilier et al.1999] has emerged as possibly the most general representation for (single-agent) planning under uncertainty. The POMDP supersedes other frameworks in terms of representational power simply because it combines the most essential features for planning under uncertainty.

First, POMDPs handle uncertainty in both action effects and state observability, whereas many other frameworks handle neither of these, and some handle only stochastic action effects. To handle partial state observability, plans are expressed over information states

, instead of world states, since the latter ones are not directly observable. The space of information states is the space of all beliefs a system might have regarding the world state. Information states are easily calculated from the measurements of noisy and imperfect sensors. In POMDPs, information states are typically represented by probability distributions over world states.

Second, many POMDP algorithms form plans by optimizing a value function. This is a powerful approach to plan optimization, since it allows one to numerically trade off between alternative ways to satisfy a goal, compare actions with different costs/rewards, as well as plan for multiple interacting goals. While value function optimization is used in other planning approaches—for example Markov Decision Processes (MDPs) [BellmanBellman1957]—POMDPs are unique in expressing the value function over information states, rather than world states.

Finally, whereas classical and conditional planners produce a sequence of actions, POMDPs produce a full policy for action selection, which prescribes the choice of action for any possible information state. By producing a universal plan, POMDPs alleviate the need for re-planning, and allow fast execution. Naturally, the main drawback of optimizing a universal plan is the computational complexity of doing so. This is precisely what we seek to alleviate with the work described in this paper

Most known algorithms for exact planning in POMDPs operate by optimizing the value function over all possible information states (also known as beliefs

). These algorithms can run into the well-known curse of dimensionality, where the dimensionality of planning problem is directly related to the number of states

[Kaelbling, Littman,  CassandraKaelbling et al.1998]. But they can also suffer from the lesser known curse of history, where the number of belief-contingent plans increases exponentially with the planning horizon. In fact, exact POMDP planning is known to be PSPACE-complete, whereas propositional planning is only NP-complete [LittmanLittman1996]. As a result, many POMDP domains with only a few states, actions and sensor observations are computationally intractable.

A commonly used technique for speeding up POMDP solving involves selecting a finite set of belief points and performing value backups on this set [SondikSondik1971, ChengCheng1988, LovejoyLovejoy1991a, HauskrechtHauskrecht2000, Zhang  ZhangZhang  Zhang2001]. While the usefulness of belief point updates is well acknowledged, how and when these backups should be applied has not been thoroughly explored.

This paper describes a class of Point-Based Value Iteration

(PBVI) POMDP approximations where the value function is estimated based strictly on point-based updates. In this context, the choice of points is an integral part of the algorithm, and our approach interleaves value backups with steps of belief point selection. One of the key contributions of this paper is the presentation and analysis of a set of heuristics for selecting informative belief points. These range from a naive version that combines point-based value updates with random belief point selection, to a sophisticated algorithm that combines the standard point-based value update with an estimate of the error bound between the approximate and exact solutions to select belief points. Empirical and theoretical evaluation of these techniques reveals the importance of taking distance between points into consideration when selecting belief points. The result is an approach which exhibits good performance with very few belief points (sometimes less than the number of states), thereby overcoming the curse of history.

The PBVI class of algorithms has a number of important properties, which are discussed at greater length in the paper:

• Theoretical guarantees. We present a bound on the error of the value function obtained by point-based approximation, with respect to the exact solution. This bound applies to a number of point-based approaches, including our own PBVI, Perseus [Spaan  VlassisSpaan  Vlassis2005], and others.

• Scalability. We are able to handle problems on the order of states, which is an order of magnitude larger than problems solved by more traditional POMDP techniques. The empirical performance is evaluated extensively in realistic robot tasks, including a search-for-missing-person scenario.

• Wide applicability. The approach makes few assumptions about the nature or structure of the domain. The PBVI framework does assume known discrete state/ action/observation spaces and a known model (i.e., state-to-state transitions, observation probabilities, costs/rewards), but no additional specific structure (e.g., constrained policy class, factored model).

• Anytime performance. An anytime solution can be achieved by gradually alternating phases of belief point selection and phases of point-based value updates. This allows for an effective trade-off between planning time and solution quality.

While PBVI has many important properties, there are a number of other recent POMDP approaches which exhibit competitive performance [Braziunas  BoutilierBraziunas  Boutilier2004, Poupart  BoutilierPoupart  Boutilier2004, Smith  SimmonsSmith  Simmons2004, Spaan  VlassisSpaan  Vlassis2005]. We provide an overview of these techniques in the later part of the paper. We also provide a comparative evaluation of these algorithms and PBVI using standard POMDP domains, in an effort to guide practitioners in their choice of algorithm. One of the algorithms, Perseus [Spaan  VlassisSpaan  Vlassis2005], is most closely related to PBVI both in design and in performance. We therefore provide a direct comparison of the two approaches using a realistic robot task, in an effort to shed further light on the comparative strengths and weaknesses of these two approaches.

The paper is organized as follows. Section 2 begins by exploring the basic concepts in POMDP solving, including representation, inference, and exact planning. Section 3 presents the general anytime PBVI algorithm and its theoretical properties. Section 4 discusses novel strategies to select good belief points. Section 6 presents an empirical comparison of POMDP algorithms using standard simulation problems. Section 7 pursues the empirical evaluation by tackling complex robot domains and directly comparing PBVI with Perseus. Finally, Section 5 surveys a number of existing POMDP approaches that are closely related to PBVI.

## 2 Review of POMDPs

Partially Observable Markov Decision Processes provide a general planning and decision-making framework for acting optimally in partially observable domains. They are well-suited to a great number of real-world problems where decision-making is required despite prevalent uncertainty. They generally assume a complete and correct world model, with stochastic state transitions, imperfect state tracking, and a reward structure. Given this information, the goal is to find an action strategy which maximizes expected reward gains. This section first establishes the basic terminology and essential concepts pertaining to POMDPs, and then reviews optimal techniques for POMDP planning.

### 2.1 Basic POMDP Terminology

Formally, a POMDP is defined by six distinct quantities, denoted . The first three of these are:

• States. The state of the world is denoted , with the finite set of all states denoted by . The state at time is denoted , where is a discrete time index. The state is not directly observable in POMDPs, where an agent can only compute a belief over the state space .

• Observations. To infer a belief regarding the world’s state , the agent can take sensor measurements. The set of all measurements, or observations, is denoted . The observation at time is denoted . Observation is usually an incomplete projection of the world state , contaminated by sensor noise.

• Actions. To act in the world, the agent is given a finite set of actions, denoted . Actions stochastically affect the state of the world. Choosing the right action as a function of history is the core problem in POMDPs.

Throughout this paper, we assume that states, actions and observations are discrete and finite. For mathematical convenience, we also assume that actions and observations are alternated over time.

To fully define a POMDP, we have to specify the probabilistic laws that describe state transitions and observations. These laws are given by the following distributions:

• The state transition probability distribution,

 T(s,a,s′) := Pr(st=s′∣st−1=s,at−1=a)∀t, (1)

is the probability of transitioning to state , given that the agent is in state and selects action , for any . Since

is a conditional probability distribution, we have

. As our notation suggests, is time-invariant.

• The observation probability distribution,

 O(s,a,z) := Pr(zt=z∣st−1=s,at−1=a)∀t, (2)

is the probability that the agent will perceive observation upon executing action in state . This conditional probability is defined for all triplets, for which . The probability function is also time-invariant.

Finally, the objective of POMDP planning is to optimize action selection, so the agent is given a reward function describing its performance:

• The reward function. , assigns a numerical value quantifying the utility of performing action when in state . We assume the reward is bounded, . The goal of the agent is to collect as much reward as possible over time. More precisely, it wants to maximize the sum:

 E[T∑t=t0γt−t0rt], (3)

where is the reward at time , is the mathematical expectation, and where is a discount factor, which ensures that the sum in Equation 3 is finite.

These items together, the states , actions , observations , reward , and the probability distributions, and , define the probabilistic world model that underlies each POMDP.

### 2.2 Belief Computation

POMDPs are instances of Markov processes, which implies that the current world state, , is sufficient to predict the future, independent of the past . The key characteristic that sets POMDPs apart from many other probabilistic models (such as MDPs) is the fact that the state is not directly observable. Instead, the agent can only perceive observations , which convey incomplete information about the world’s state.

Given that the state is not directly observable, the agent can instead maintain a complete trace of all observations and all actions it ever executed, and use this to select its actions. The action/observation trace is known as a history. We formally define

 ht := {a0,z1,…,zt−1,at−1,zt} (4)

to be the history at time .

This history trace can get very long as time goes on. A well-known fact is that this history does not need to be represented explicitly, but can instead be summarized via a belief distribution [ÄstromÄstrom1965]

, which is the following posterior probability distribution:

 bt(s) := Pr(st=s∣zt,at−1,zt−1,…,a0,b0). (5)

This of course requires knowing the initial state probability distribution:

 b0(s) := Pr(s0=s), (6)

which defines the probability that the domain is in state at time . It is common either to specify this initial belief as part of the model, or to give it only to the runtime system which tracks beliefs and selects actions. For our work, we will assume that this initial belief (or a set of possible initial beliefs) are available to the planner.

Because the belief distribution is a sufficient statistic for the history, it suffices to condition the selection of actions on , instead of on the ever-growing sequence of past observations and actions. Furthermore, the belief at time is calculated recursively, using only the belief one time step earlier, , along with the most recent action and observation .

We define the belief update equation, , as:

 τ(bt−1,at−1,zt) = bt(s′) (7) = ∑s′O(s′,at−1,zt)T(s,at−1,s′)bt−1(s)Pr(zt|bt−1,at−1)

where the denominator is a normalizing constant.

This equation is equivalent to the decades-old Bayes filter [JazwinskiJazwinski1970]

, and is commonly applied in the context of hidden Markov models

[RabinerRabiner1989]

, where it is known as the forward algorithm. Its continuous generalization forms the basis of Kalman filters

[KalmanKalman1960].

It is interesting to consider the nature of belief distributions. Even for finite state spaces, the belief is a continuous quantity. It is defined over a simplex describing the space of all distributions over the state space . For very large state spaces, calculating the belief update (Eqn 7) can be computationally challenging. Recent research has led to efficient techniques for belief state computation that exploit structure of the domain [Dean  KanazawaDean  Kanazawa1988, Boyen  KollerBoyen  Koller1998, Poupart  BoutilierPoupart  Boutilier2000, Thrun, Fox, Burgard,  DellaertThrun et al.2000]. However, by far the most complex aspect of POMDP planning is the generation of a policy for action selection, which is described next. For example in robotics, calculating beliefs over state spaces with states is easily done in real-time burgard99. In contrast, calculating optimal action selection policies exactly appears to be infeasible for environments with more than a few dozen states [Kaelbling, Littman,  CassandraKaelbling et al.1998], not directly because of the size of the state space, but because of the complexity of the optimal policies. Hence we assume throughout this paper that the belief can be computed accurately, and instead focus on the problem of finding good approximations to the optimal policy.

### 2.3 Optimal Policy Computation

The central objective of the POMDP perspective is to compute a policy for selecting actions. A policy is of the form:

 π(b) ⟶ a, (8)

where is a belief distribution and is the action chosen by the policy .

Of particular interest is the notion of optimal policy, which is a policy that maximizes the expected future discounted cumulative reward:

 π∗(bt0) = argmaxπEπ[T∑t=t0γt−t0rt∣∣∣bt0]. (9)

There are two distinct but interdependent reasons why computing an optimal policy is challenging. The more widely-known reason is the so-called curse of dimensionality: in a problem with physical states, is defined over all belief states in an -dimensional continuous space. The less-well-known reason is the curse of history: POMDP solving is in many ways like a search through the space of possible POMDP histories. It starts by searching over short histories (through which it can select the best short policies), and gradually considers increasingly long histories. Unfortunately the number of distinct possible action-observation histories grows exponentially with the planning horizon.

The two curses—dimensionality and history—often act independently: planning complexity can grow exponentially with horizon even in problems with only a few states, and problems with a large number of physical states may still only have a small number of relevant histories. Which curse is predominant depends both on the problem at hand, and the solution technique. For example, the belief point methods that are the focus of this paper specifically target the curse of history, leaving themselves vulnerable to the curse of dimensionality. Exact algorithms on the other hand typically suffer far more from the curse of history. The goal is therefore to find techniques that offer the best balance between both.

We now describe a straightforward approach to finding optimal policies by sondik71. The overall idea is to apply multiple iterations of dynamic programming, to compute increasingly more accurate values for each belief state . Let be a value function that maps belief states to values in . Beginning with the initial value function:

 V0(b) = maxa∑s∈SR(s,a)b(s), (10)

then the -th value function is constructed from the -th by the following recursive equation:

 Vt(b) = maxa[∑s∈SR(s,a)b(s)+γ∑z∈ZPr(z∣a,b)Vt−1(τ(b,a,z))], (11)

where is the belief updating function defined in Equation 7. This value function update maximizes the expected sum of all (possibly discounted) future pay-offs the agent receives in the next time steps, for any belief state . Thus, it produces a policy that is optimal under the planning horizon . The optimal policy can also be directly extracted from the previous-step value function:

 π∗t(b) = argmaxa[∑s∈SR(s,a)b(s)+γ∑z∈ZPr(z∣a,b)Vt−1(τ(b,a,z))]. (12)

sondik71 showed that the value function at any finite horizon

can be expressed by a set of vectors:

. Each -vector represents an -dimensional hyper-plane, and defines the value function over a bounded region of the belief:

 Vt(b) = maxα∈Γt∑s∈Sα(s)b(s). (13)

In addition, each -vector is associated with an action, defining the best immediate policy assuming optimal behavior for the following steps (as defined respectively by the sets ).

The -horizon solution set, , can be computed as follows. First, we rewrite Equation 11 as:

 Vt(b) = maxa∈A[∑s∈SR(s,a)b(s)+γ∑z∈Zmaxα∈Γt−1∑s∈S∑s′∈ST(s,a,s′)O(s′,a,z)α(s′)b(s)]. (14)

Notice that in this representation of , the nonlinearity in the term from Equation 11 cancels out the nonlinearity in the term , leaving a linear function of b(s) inside the max operator.

The value cannot be computed directly for each belief (since there are infinitely many beliefs), but the corresponding set can be generated through a sequence of operations on the set .

The first operation is to generate intermediate sets and (Step 1):

 Γa,∗t ← αa,∗(s)=R(s,a) (15) Γa,zt ← αa,zi(s)=γ∑s′∈ST(s,a,s′)O(s′,a,z)αi(s′),∀αi∈Γt−1

where each and is once again an -dimensional hyper-plane.

Next we create (), the cross-sum over observations111The symbol denotes the cross-sum operator. A cross-sum operation is defined over two sets, and , and produces a third set, ., which includes one from each (Step 2):

 Γat = Γa,∗t+Γa,z1t⊕Γa,z2t⊕… (16)

Finally we take the union of sets (Step 3):

 Γt = ∪a∈AΓat. (17)

This forms the pieces of the backup solution at horizon . The actual value function is extracted from the set as described in Equation 13.

Using this approach, bounded-time POMDP problems with finite state, action, and observation spaces can be solved exactly given a choice of the horizon . If the environment is such that the agent might not be able to bound the planning horizon in advance, the policy is an approximation to the optimal one whose quality improves in expectation with the planning horizon (assuming ).

As mentioned above, the value function can be extracted directly from the set . An important aspect of this algorithm (and of all optimal finite-horizon POMDP solutions) is that the value function is guaranteed to be a piecewise linear, convex, and continuous function of the belief [SondikSondik1971]. The piecewise-linearity and continuous properties are a direct result of the fact that is composed of finitely many linear -vectors. The convexity property is a result of the maximization operator (Eqn 13). It is worth pointing out that the intermediate sets , and also represent functions of the belief which are composed entirely of linear segments. This property holds for the intermediate representations because they incorporate the expectation over observation probabilities (Eqn 15).

In the worst case, the exact value update procedure described could require time doubly exponential in the planning horizon  [Kaelbling, Littman,  CassandraKaelbling et al.1998]. To better understand the complexity of the exact update, let be the number of states, the number of actions, the number of observations, and the number of -vectors in the previous solution set. Then Step 1 creates projections and Step 2 generates cross-sums. So, in the worst case, the new solution requires:

 |Γt|=O(|A||Γt−1||Z|) (18)

-vectors to represent the value function at horizon ; these can be computed in time .

It is often the case that a vector in will be completely dominated by another vector over the entire belief simplex:

 αi⋅b<αj⋅b,∀b. (19)

Similarly, a vector may be fully dominated by a set of other vectors (e.g., in Fig. 1 is dominated by the combination of and

). This vector can then be pruned away without affecting the solution. Finding dominated vectors can be expensive. Checking whether a single vector is dominated requires solving a linear program with

variables and constraints. Nonetheless it can be time-effective to apply pruning after each iteration to prevent an explosion of the solution size. In practice, often appears to grow singly exponentially in , given clever mechanisms for pruning unnecessary linear functions. This enormous computational complexity has long been a key impediment toward applying POMDPs to practical problems.

### 2.4 Point-Based Value Backup

Exact POMDP solving, as outlined above, optimizes the value function over all beliefs. Many approximate POMDP solutions, including the PBVI approach proposed in this paper, gain computational advantage by applying value updates at specific (and few) belief points, rather than over all beliefs [ChengCheng1988, Zhang  ZhangZhang  Zhang2001, PoonPoon2001]. These approaches differ significantly (and to great consequence) in how they select the belief points, but once a set of points is selected, the procedure for updating their value is standard. We now describe the procedure for updating the value function at a set of known belief points.

As in Section 2.3, the value function update is implemented as a sequence of operations on a set of -vectors. If we assume that we are only interested in updating the value function at a fixed set of belief points, , then it follows that the value function will contain at most one -vector for each belief point. The point-based value function is therefore represented by the corresponding set .

Given a solution set , we simply modify the exact backup operator (Eqn 14) such that only one -vector per belief point is maintained. The point-based backup now gives an -vector which is valid over a region around . It assumes that the other belief points in that region have the same action choice and lead to the same facets of as the point . This is the key idea behind all algorithms presented in this paper, and the reason for the large computational savings associated with this class of algorithms.

To obtain solution set from the previous set , we begin once again by generating intermediate sets and (exactly as in Eqn 15) (Step 1):

 Γa,∗t ← αa,∗(s)=R(s,a) (20) Γa,zt ← αa,zi(s)=γ∑s′∈ST(s,a,s′)O(s′,a,z)αi(s′),∀αi∈Γt−1.

Next, whereas performing an exact value update requires a cross-sum operation (Eqn 16), by operating over a finite set of points, we can instead use a simple summation. We construct (Step 2):

 Γat ← αab=Γa,∗t+∑z∈Zargmaxα∈Γa,zt(∑s∈Sα(s)b(s)),∀b∈B. (21)

Finally, we find the best action for each belief point (Step 3):

 αb = argmaxΓat,∀a∈A(∑s∈SΓat(s)b(s)),∀b∈B. (22) Γt = ∪b∈Bαb (23)

While these operations preserve only the best -vector at each belief point , an estimate of the value function at any belief in the simplex (including ) can be extracted from the set just as before:

 Vt(b) = maxα∈Γt∑s∈Sα(s)b(s). (24)

To better understand the complexity of updating the value of a set of points , let be the number of states, the number of actions, the number of observations, and the number of -vectors in the previous solution set. As with an exact update, Step 1 creates projections (in time ). Steps 2 and 3 then reduce this set to at most components (in time ). Thus, a full point-based value update takes only polynomial time, and even more crucially, the size of the solution set remains constant at every iteration. The point-based value backup algorithm is summarized in Table 1.

Note that the algorithm as outlined in Table 1 includes a trivial pruning step (lines 13-14), whereby we refrain from adding to any vector already included in it. As a result, it is often the case that . This situation arises whenever multiple nearby belief points support the same vector. This pruning step can be computed rapidly (without solving linear programs) and is clearly advantageous in terms of reducing the set .

The point-based value backup is found in many POMDP solvers, and in general serves to improve estimates of the value function. It is also an integral part of the PBVI framework.

## 3 Anytime Point-Based Value Iteration

We now describe the algorithmic framework for our new class of fast approximate POMDP algorithms called Point-Based Value Iteration (PBVI). PBVI-class algorithms offer an anytime solution to large-scale discrete POMDP domains. The key to achieving an anytime solution is to interleave two main components: the point-based update described in Table 1 and steps of belief set selection. The approximate value function we find is guaranteed to have bounded error (compared to the optimal) for any discrete POMDP domain.

The current section focuses on the overall anytime algorithm and its theoretical properties, independent of the belief point selection process. Section 4 then discusses in detail various novel techniques for belief point selection.

The overall PBVI framework is simple. We start with a (small) initial set of belief points to which are applied a first series of backup operations. The set of belief points is then grown, a new series of backup operations are applied to all belief points (old and new), and so on, until a satisfactory solution is obtained. By interleaving value backup iterations with expansions of the belief set, PBVI offers a range of solutions, gradually trading off computation time and solution quality.

The full algorithm is presented in Table 2. The algorithm accepts as input an initial belief point set (), an initial value (), the number of desired expansions (), and the planning horizon (). A common choice for is the initial belief ; alternately, a larger set could be used, especially in cases where sample trajectories are available. The initial value, , is typically set to be purposefully low (e.g., ). When we do this, we can show that the point-based solution is always be a lower-bound on the exact solution [LovejoyLovejoy1991a]. This follows from the simple observation that failing to compute an -vector can only lower the value function.

For problems with a finite horizon, we run value backups between each expansion of the belief set. In infinite-horizon problems, we select the horizon so that

 γT[Rmax−Rmin]<ϵ,

where and .

The complete algorithm terminates once a fixed number of expansions () have been completed. Alternately, the algorithm could terminate once the value function approximation reaches a given performance criterion. This is discussed further below.

The algorithm uses the BACKUP routine described in Table 1

. We can assume for the moment that the EXPAND subroutine (line 8) selects belief points at random. This performs reasonably well for small problems where it is easy to achieve good coverage of the entire belief simplex. However it scales poorly to larger domains where exponentially many points are needed to guarantee good coverage of the belief simplex. More sophisticated approaches to selecting belief points are presented in Section

4. Overall, the PBVI framework described here offers a simple yet flexible approach to solving large-scale POMDPs.

For any belief set and horizon , the algorithm in Table 2 will produce an estimate of the value function, denoted . We now show that the error between and the optimal value function is bounded. The bound depends on how densely samples the belief simplex ; with denser sampling, converges to , the -horizon optimal solution, which in turn has bounded error with respect to , the optimal solution. So cutting off the PBVI iterations at any sufficiently large horizon, we can show that the difference between and the optimal infinite-horizon is not too large. The overall error in PBVI is bounded, according to the triangle inequality, by:

 ∥VBt−V∗∥∞≤∥VBt−V∗t∥∞+∥V∗t−V∗∥∞.

The second term is bounded by  [Bertsekas  TsitsiklisBertsekas  Tsitsiklis1996]. The remainder of this section states and proves a bound on the first term, which we denote .

Begin by assuming that denotes an exact value backup, and denotes the PBVI backup. Now define to be the error introduced at a specific belief by performing one iteration of point-based backup:

 ϵ(b)=|~HVB(b)−HVB(b)|∞.

Next define to be the maximum total error introduced by doing one iteration of point-based backup:

 ϵ = |~HVB−HVB|∞ = maxb∈Δϵ(b).

Finally define the density of a set of belief points to be the maximum distance from any belief in the simplex to a belief in set . More precisely:

 δB=maxb′∈Δminb∈B∥b−b′∥1.

Now we can prove the following lemma:

###### Lemma 1.

The error introduced in PBVI when performing one iteration of value backup over , instead of over , is bounded by

 ϵ≤(Rmax−Rmin)δB1−γ

Proof: Let be the point where PBVI makes its worst error in value update, and be the closest (1-norm) sampled belief to . Let be the vector that is maximal at , and be the vector that would be maximal at . By failing to include in its solution set, PBVI makes an error of at most . On the other hand, since is maximal at , then . So,

 ϵ≤α′⋅b′−α⋅b′=α′⋅b′−α⋅b′+(α′⋅b−α′⋅b)Add zero≤α′⋅b′−α⋅b′+α⋅b−α′⋅bAssume α is optimal at b=(α′−α)⋅(b′−b)Re-arrange the terms≤∥α′−α∥∞∥b′−b∥1By H\"{o}% lder inequality≤∥α′−α∥∞δBBy definition of δB≤(Rmax−Rmin)δB1−γ

The last inequality holds because each -vector represents the reward achievable starting from some state and following some sequence of actions and observations. Therefore the sum of rewards must fall between and . ∎

Lemma 1 states a bound on the approximation error introduced by one iteration of point-based value updates within the PBVI framework. We now look at the bound over multiple value updates.

###### Theorem 3.1.

For any belief set and any horizon , the error of the PBVI algorithm is bounded by

 ϵt≤(Rmax−Rmin)δB(1−γ)2

Proof:

 ϵt=||VBt−V∗t||∞=||~HVBt−1−HV∗t−1||∞By definition of ~H≤||~HVBt−1−HVBt−1||∞+||HVBt−1−HV∗t−1||∞By triangle inequality≤(Rmax−Rmin)δB1−γ+||HVBt−1−HV∗t−1||∞By lemma~{}???≤(Rmax−Rmin)δB1−γ+γ||VBt−1−V∗t−1||∞By contraction of exact value backup=(Rmax−Rmin)δB1−γ+γϵt−1By definition of ϵt−1≤(Rmax−Rmin)δB(1−γ)2By sum % of a geometric series\qed

The bound described in this section depends on how densely samples the belief simplex . In the case where not all beliefs are reachable, PBVI does not need to sample all of densely, but can replace by the set of reachable beliefs (Fig. 2). The error bounds and convergence results hold on . We simply need to re-define in lemma 1.

As a side note, it is worth pointing out that because PBVI makes no assumption regarding the initial value function , the point-based solution is not guaranteed to improve with the addition of belief points. Nonetheless, the theorem presented in this section shows that the bound on the error between (the point-based solution) and (the optimal solution) is guaranteed to decrease (or stay the same) with the addition of belief points. In cases where is initialized pessimistically (e.g., , as suggested above), then will improve (or stay the same) with each value backup and addition of belief points.

This section has thus far skirted the issue of belief point selection, however the bound presented in this section clearly argues in favor of dense sampling over the belief simplex. While randomly selecting points according to a uniform distribution may eventually accomplish this, it is generally inefficient, in particular for high dimensional cases. Furthermore, it does not take advantage of the fact that the error bound holds for dense sampling over

reachable beliefs. Thus we seek more efficient ways to generate belief points than at random over the entire simplex. This is the issue explored in the next section.

## 4 Belief Point Selection

In section 3, we outlined the prototypical PBVI algorithm, while conveniently avoiding the question of how and when belief points should be selected. There is a clear trade-off between including fewer beliefs (which would favor fast planning over good performance), versus including many beliefs (which would slow down planning, but ensure a better bound on performance). This brings up the question of how many belief points should be included. However the number of points is not the only consideration. It is likely that some collections of belief points (e.g., those frequently encountered) are more likely to produce a good value function than others. This brings up the question of which beliefs should be included.

A number of approaches have been proposed in the literature. For example, some exact value function approaches use linear programs to identify points where the value function needs to be further improved [ChengCheng1988, LittmanLittman1996, Zhang  ZhangZhang  Zhang2001], however this is typically very expensive. The value function can also be approximated by learning the value at regular points, using a fixed-resolution [LovejoyLovejoy1991a], or variable-resolution [Zhou  HansenZhou  Hansen2001] grid. This is less expensive than solving LPs, but can scales poorly as the number of states increases. Alternately, one can use heuristics to generate grid-points [HauskrechtHauskrecht2000, PoonPoon2001]. This tends to be more scalable, though significant experimentation is required to establish which heuristics are most useful.

This section presents five heuristic strategies for selecting belief points, from fast and naive random sampling, to increasingly more sophisticated stochastic simulation techniques. The most effective strategy we propose is one that carefully selects points that are likely to have the largest impact in reducing the error bound (Theorem 3.1).

Most of the strategies we consider focus on selecting reachable beliefs, rather than getting uniform coverage over the entire belief simplex. Therefore it is useful to begin this discussion by looking at how reachability is assessed.

While some exact POMDP value iteration solutions are optimal for any initial belief, PBVI (and other related techniques) assume a known initial belief . As shown in Figure 2, we can use the initial belief to build a tree of reachable beliefs. In this representation, each path through the tree corresponds to a sequence in belief space, and increasing depth corresponds to an increasing plan horizon. When selecting a set of belief points for PBVI, including all reachable beliefs would guarantee optimal performance (conditioned on the initial belief), but at the expense of computational tractability, since the set of reachable beliefs, , can grow exponentially with the planning horizon. Therefore, it is best to select a subset which is sufficiently small for computational tractability, but sufficiently large for good value function approximation.222All strategies discussed below assume that the belief point set, , approximately doubles in size on each belief expansion. This ensures that the number of rounds of value iteration is logarithmic (in the final number of belief points needed). Alternately, each strategy could be used (with very little modification) to add a fixed number of new belief points, but this may require many more rounds of value iteration. Since value iteration is much more expensive than belief computation, it seems appropriate to double the size of at each expansion.

In domains where the initial belief is not known (or not unique), it is still possible to use reachability analysis by sampling a few initial beliefs (or using a set of known initial beliefs) to seed multiple reachability trees.

We now discuss five strategies for selecting belief points, each of which can be used within the PBVI framework to perform expansion of the belief set.

### 4.1 Random Belief Selection (Ra)

The first strategy is also the simplest. It consists of sampling belief points from a uniform distribution over the entire belief simplex. To sample over the simplex, we cannot simply sample each independently over (this would violate the constraint that ). Instead, we use the algorithm described in Table 3 [<]see¿[for more details including proof of uniform coverage]devroye86.

This random point selection strategy, unlike the other strategies presented below, does not focus on reachable beliefs. For this reason, we do not necessarily advocate this approach. However we include it because it is an obvious choice, it is by far the simplest to implement, and it has been used in related work by hauskrecht00 and poon01. In smaller domains (e.g., 20 states), it performs reasonably well, since the belief simplex is relatively low-dimensional. In large domains (e.g., 100+ states), it cannot provide good coverage of the belief simplex with a reasonable number of points, and therefore exhibits poor performance. This is demonstrated in the experimental results presented in Section 6.

All of the remaining belief selection strategies make use of the belief tree (Figure 2) to focus on reachable beliefs, rather than trying to cover the entire belief simplex.

### 4.2 Stochastic Simulation with Random Action (Ssra)

To generate points along the belief tree, we use a technique called stochastic simulation. It involves running single-step forward trajectories from belief points already in . Simulating a single-step forward trajectory for a given requires selecting an action and observation pair , and then computing the new belief using the Bayesian update rule (Eqn 7). In the case of Stochastic Simulation with Random Action (SSRA), the action selected for forward simulation is picked (uniformly) at random from the full action set. Table 4 summarizes the belief expansion procedure for SSRA. First, a state is drawn from the belief distribution . Second, an action is drawn at random from the full action set. Next, a posterior state is drawn from the transition model . Finally, an observation is drawn from the observation model . Using the triple , we can calculate the new belief (according to Equation 7), and add to the set of belief points .

This strategy is better than picking points at random (as described above), because it restricts to the belief tree (Fig. 2). However this belief tree is still very large, especially when the branching factor is high, due to large numbers of actions/observations. By being more selective about which paths in the belief tree are explored, one can hope to effectively restrict the belief set further.

A similar technique for stochastic simulation was discussed by poon01, however the belief set was initialized differently (not using ), and therefore the stochastic simulations were not restricted to the set of reachable beliefs.

### 4.3 Stochastic Simulation with Greedy Action (Ssga)

The procedure for generating points using Stochastic Simulation with Greedy Action (SSGA) is based on the well-known -greedy

exploration strategy used in reinforcement learning

[Sutton  BartoSutton  Barto1998]. This strategy is similar to the SSRA procedure, except that rather than choosing an action randomly, SSEA will choose the greedy action (i.e., the current best action at the given belief ) with probability , and will chose a random action with probability (we use ). Once the action is selected, we perform a single-step forward simulation as in SSRA to yield a new belief point. Table 5 summarizes the belief expansion procedure for SSGA.

A similar technique, featuring stochastic simulation using greedy actions, was outlined by hauskrecht00. However in that case, the belief set included all extreme points of the belief simplex, and stochastic simulation was done from those extreme points, rather than from the initial belief.

### 4.4 Stochastic Simulation with Exploratory Action (Ssea)

The error bound in Section 3 suggests that PBVI performs best when its belief set is uniformly dense in the set of reachable beliefs. The belief point strategies proposed thus far ignore this information. The next approach we propose gradually expands by greedily choosing new reachable beliefs that improve the worst-case density.

Unlike SSRA and SSGA which select a single action to simulate the forward trajectory for a given , Stochastic Sampling with Exploratory Action (SSEA) does a one step forward simulation with each action, thus producing new beliefs . However it does not accept all new beliefs , but rather calculates the distance between each and its closest neighbor in . We then keep only that point that is farthest away from any point already in . We use the norm to calculate distance between belief points to be consistent with the error bound in Theorem 3.1. Table 6 summarizes the SSEA expansion procedure.

### 4.5 Greedy Error Reduction (Ger)

While the SSEA strategy above is able to improve the worst-case density of reachable beliefs, it does not directly minimize the expected error. And while we would like to directly minimize the error, all we can measure is a bound on the error (Lemma 1). We therefore propose a final strategy which greedily adds the candidate beliefs that will most effectively reduce this error bound. Our empirical results, as presented below, show that this strategy is the most successful one discovered thus far.

To understand how we expand the belief set in the GER strategy, it is useful to re-consider the belief tree, which we reproduce in Figure 3. Each node in the tree corresponds to a specific belief. We can divide these nodes into three sets. Set 1 includes those belief points already in , in this case and . Set 2 contains those belief points that are immediate descendants of the points in (i.e., the nodes in the grey zone). These are the candidates from which we will select the new points to be added to . We call this set the envelope (denoted ). Set 3 contains all other reachable beliefs.

We need to decide which belief should be removed from the envelope and added to the set of active belief points . Every point that is added to will improve our estimate of the value function. The new point will reduce the error bounds (as defined in Section 3 for points that were already in ; however, the error bound for the new point itself might be quite large. That means that the largest error bound for points in will not monotonically decrease; however, for a particular point in (such as the initial belief ) the error bound will be decreasing.

To find the point which will most reduce our error bound, we can look at the analysis of Lemma 1. Lemma 1 bounds the amount of additional error that a single point-based backup introduces. Write for the new belief which we are considering adding, and write for some belief which is already in . Write for the value hyper-plane at , and write for . As the lemma points out, we have

 ϵ(b′)≤(α′−α)⋅(b′−b)

When evaluating this error, we need to minimize over all . Also, since we do not know what will be until we have done some backups at , we make a conservative assumption and choose the worst-case value of . Thus, we can evaluate:

 ϵ(b′)≤minb∈B∑s∈S⎧⎪⎨⎪⎩(Rmax1−γ−α(s))(b′(s)−b(s))b′(s)≥b(s)(Rmin1−γ−α(s))(b′(s)−b(s))b′(s)

While one could simply pick the candidate which currently has the largest error bound,333We tried this, however it did not perform as well empirically as what we suggest in Equation 4.5, because it did not consider the probability of reaching that belief. , this would ignore reachability considerations. Rather, we evaluate the error at each , by weighing the error of the fringe nodes by their reachability probability:

 ϵ(b) = maxa∈A∑z∈ZO(b,a,z)ϵ(τ(b,a,z)) = maxa∈A∑z∈Z(∑s∈S∑s′∈ST(s,a,s′)O(s′,a,z)b(s))ϵ(τ(b,a,z)),

noting that , and can be evaluated according to Equation 25.

Using Equation 4.5, we find the existing point with the largest error bound. We can now directly reduce its error by adding to our set one of its descendants. We select the next-step belief which maximizes error bound reduction:

 B = B∪τ(~b,~a,~z), (27) where ~b,~a:=argmaxb∈B,a∈A∑z∈ZO(b,a,z)ϵ(τ(b,a,z)) (29) ~z:=argmaxz∈ZO(~b,~a,z)ϵ(τ(~b,~a,z))

Table 7 summarizes the GER approach to belief point selection.

The complexity of adding one new points with GER is (where =#states, =#actions, =#observations, =#beliefs already selected). In comparison, a value backup (for one point) is , and each point typically needs to be updated several times. As we point out in empirical results below, belief selection (even with GER) takes minimal time compared to value backup.

This concludes our presentation of belief selection techniques for the PBVI framework. In summary, there are three factors to consider when picking a belief point: (1) how likely is it to occur? (2) how far is it from other belief points already selected? (3) what is the current approximate value for that point? The simplest heuristic (RA) accounts for none of these, whereas some of the others (SSRA, SSGA, SSEA) account for one, and GER incorporates all three factors.

### 4.6 Belief Expansion Example

We consider a simple example, shown in Figure 4, to illustrate the difference between the various belief expansion techniques outlined above. This 1D POMDP [LittmanLittman1996] has four states, one of which is the goal (indicated by the star). The two actions, left and right, have the expected (deterministic) effect. The goal state is fully observable (observation=goal), while the other three states are aliased (observation=none). A reward of is received for being in the goal state, otherwise the reward is zero. We assume a discount factor of . The initial distribution is uniform over non-goal states, and the system resets to that distribution whenever the goal is reached.

The belief set is always initialized to contain the initial belief . Figure 5 shows part of the belief tree, including the original belief set (top node), and its envelope (leaf nodes). We now consider what each belief expansion method might do.

The Random heuristic can pick any belief point (with equal probability) from the entire belief simplex. It does not directly expand any branches of the belief tree, but it will eventually put samples nearby.

The Stochastic Simulation with Random Action has a chance of picking each action. Then, regardless of which action was picked, there’s a 2/3 chance of seeing observation none, and a 1/3 chance of seeing observation goal. As a result, the SSRA will select: , , , .

The Stochastic Simulation with Greedy Action first needs to know the policy at . A few iterations of point-based updates (Section 2.4) applied to this initial (single point) belief set reveal that .444This may not be obvious to the reader, but it follows directly from the repeated application of equations 2023. As a result, expansion of the belief will greedily select action with probability (assuming and ). Action will be selected for belief expansion with probability . Combining this along with the observation probabilities, we can tell that SSGA will expand as follows: , ,