Bayes-CPACE: PAC Optimal Exploration in Continuous Space Bayes-Adaptive Markov Decision Processes

by   Gilwoo Lee, et al.
University of Washington

We present the first PAC optimal algorithm for Bayes-Adaptive Markov Decision Processes (BAMDPs) in continuous state and action spaces, to the best of our knowledge. The BAMDP framework elegantly addresses model uncertainty by incorporating Bayesian belief updates into long-term expected return. However, computing an exact optimal Bayesian policy is intractable. Our key insight is to compute a near-optimal value function by covering the continuous state-belief-action space with a finite set of representative samples and exploiting the Lipschitz continuity of the value function. We prove the near-optimality of our algorithm and analyze a number of schemes that boost the algorithm's efficiency. Finally, we empirically validate our approach on a number of discrete and continuous BAMDPs and show that the learned policy has consistently competitive performance against baseline approaches.


page 1

page 2

page 3

page 4


Bayesian Policy Optimization for Model Uncertainty

Addressing uncertainty is critical for autonomous systems to robustly ad...

Geometric Policy Iteration for Markov Decision Processes

Recently discovered polyhedral structures of the value function for fini...

Performance Guarantees for Homomorphisms Beyond Markov Decision Processes

Most real-world problems have huge state and/or action spaces. Therefore...

Finding Approximate POMDP solutions Through Belief Compression

Standard value function approaches to finding policies for Partially Obs...

Bounded Optimal Exploration in MDP

Within the framework of probably approximately correct Markov decision p...

State-Continuity Approximation of Markov Decision Processes via Finite Element Analysis for Autonomous System Planning

Motion planning under uncertainty for an autonomous system can be formul...

Bayesian Residual Policy Optimization: Scalable Bayesian Reinforcement Learning with Clairvoyant Experts

Informed and robust decision making in the face of uncertainty is critic...

1 Introduction

Addressing uncertainty is critical for robots that interact with the real world. Often though, with good engineering and experience, we can obtain reasonable regimes for uncertainty, specifically model uncertainty, and prepare offline for various contingencies. However, we must to predict, refine, and act online. Thus, in this paper we focus on uncertainty over a set of scenarios, which requires the agent to balance exploration (uncertainty reduction) and exploitation (prior knowledge).

We can naturally express this objective as a Bayes-Adaptive Markov Decision Process [Kolter and Ng2009], which incorporates Bayesian belief updates into long-term expected return. The BAMDP framework formalizes the notion of uncertainty over multiple latent MDPs. This has widespread applications in navigation [Guilliard et al.2018], manipulation [Chen et al.2016], and shared autonomy [Javdani, Srinivasa, and Bagnell2015].

Although BAMDPs provide an elegant problem formulation for model uncertainty, Probably Approximately Correct (henceforth PAC) algorithms for continuous state and action space BAMDPs have been less explored, limiting possible applications in many robotics problems. In the discrete domain, there exist some efficient online, PAC optimal approaches [Kolter and Ng2009, Chen et al.2016] and approximate Monte-Carlo algorithms [Guez, Silver, and Dayan2012], but it is not straightforward to extend this line of work to the continuous domain. State-of-the-art approximation-based approaches for belief space planning in continuous spaces [Sunberg and Kochenderfer2017, Guez et al.2014] do not provide PAC optimality.

In this work, we present the first PAC optimal algorithm for BAMDPs in continuous state and action spaces, to the best of our knowledge. The key challenge for PAC optimal exploration in continuous BAMDPs is that the same state will not be visited twice, which often renders Monte-Carlo approaches computationally prohibitive, as discussed in  [Sunberg and Kochenderfer2017]. However, if the value function satisfies certain smoothness properties, i.e. Lipschitz continuity, we can efficiently “cover” the reachable belief space. In other words, we leverage the following property:

A set of representative samples is sufficient to approximate a Lipschitz continuous value function of the reachable continuous state-belief-action space.

Our algorithm, Bayes-CPACE (Figure 1) maintains an approximate value function based on a set of visited samples, with bounded optimism in the approximation from Lipschitz continuity. At each timestep, it greedily selects an action that maximizes the value function. If the action lies in an underexplored region of state-belief-action space, the visited sample is added to the set of samples and the value function is updated. Our algorithm adopts C-PACE [Pazis and Parr2013], a PAC optimal algorithm for continuous MDPs, as our engine for exploring belief space.

Figure 1: The Bayes-CPACE algorithm for BAMDPs. The vertices of the belief simplex correspond to the latent MDPs constituting the BAMDP model, for which we can precompute the optimal Q-values. During an iteration of Bayes-CPACE, it executes its greedy policy from initial belief , which either never escapes the known belief MDP or leads to an unknown sample. Adding the unknown sample to the sample set may expand the known set and the known belief MDP . The algorithm terminates when the optimally reachable belief space is sufficiently covered.

We make the following contributions:

  1. We present a PAC optimal algorithm for continuous BAMDPs (Section 3).

  2. We show how BAMDPs can leverage the value functions of latent MDPs to reduce the sample complexity of policy search, without sacrificing PAC optimality (Definitions 3.3 and 3.4).

  3. We prove that Lipschitz continuity of latent MDP reward and transition functions is a sufficient condition for Lipschitz continuity of the BAMDP value function (Lemma 3.1).

  4. Through experiments, we show that Bayes-CPACE has competitive performance against state-of-art algorithms in discrete BAMDPs and promising performance in continuous BAMDPs (Section 4).

2 Preliminaries

In this section, we review the Bayes-Adaptive Markov Decision Process (BAMDP) framework. A BAMDP is a belief MDP with hidden latent variables that govern the reward and transition functions. The task is to compute an optimal policy that maps state and belief over the latent variables to actions. Since computing an exact optimal policy is intractable [Kurniawati, Hsu, and Lee2008], we state a more achievable property of an algorithm being Probably Approximately Correct. We review related work that addresses this problem, and contrast this objective with other formulations.

Bayes-Adaptive Markov Decision Process

The BAMDP framework assumes that a latent variable governs the reward and transition functions of the underlying Markov Decision Process [Ghavamzadeh et al.2015, Guez, Silver, and Dayan2012, Chen et al.2016]. A BAMDP is defined by a tuple , where is the set of hyper-states (state , latent variable ), is the set of actions, is the transition function, is the initial distribution over hyper-states, represents the reward obtained when action is taken in hyper-state , and is the discount factor.

In this paper, we allow the spaces to be continuous111

For simplicity of exposition, our notation assumes that the spaces are discrete. For the continuous case, all corresponding probabilities are replaced by probability density functions and all summation operators are replaced by integrals.

, but limit the set of latent variables to be finite. For simplicity, we assume that the latent variable is constant throughout an episode. 222It is straightforward to extend this to a deterministically-changing latent variable or incorporate an observation model. This requires augmenting observation into the state definition and computing belief evolution appropriately. This model is derived in [Chen et al.2016].

We now introduce the notion of a Bayes estimator . Since the latent variable is unknown, the agent maintains a belief distribution , where is a

-dimensional probability simplex. The agent uses the Bayes estimator

to update its current belief upon taking an action from state and transitioning to a state :

We reformulate BAMDP as a belief MDP . We consider the pair to be the state of this MDP. The transition function is as follows:

where for the belief computed by the Bayes estimator and zero everywhere else. The reward function is defined as .

A policy maps the pair to an action . The value of a policy is given by

where . The optimal Bayesian value function satisfies the Bellman optimality equation

We now characterize what it means to efficiently explore the reachable continuous state-belief-action space. We extend [Kakade2003]’s definition of sample complexity for BAMDPs.

Definition 2.1 (Sample Complexity).

Let be a learning algorithm and be its policy at timestep . The sample complexity of an algorithm is the number of steps such that .

In order to define PAC optimal exploration for continuous space, we need to use the notion of covering number of the reachable belief space.

Definition 2.2 (Covering Number).

An -cover of is a set of state-belief-action tuples such that for any reachable query , there exists a sample such that . We define the covering number to be the size of the largest minimal -cover, i.e. the largest which will not remain a cover if any sample is removed.

Using this definition, we now formalize the notion of PAC optimal exploration for BAMDPs.

Definition 2.3 (PAC-Bayes).

A BAMDP algorithm is called PAC-Bayes if, given any and , its sample complexity is polynomial in the relevant quantities , with probability at least .

Comparison of PAC-Bayes vs PAC-Bayes-MDP

We shed some light on the important distinction between the concept of PAC-Bayes on a BAMDP (which we analyze) and the more commonly referred PAC-Bayes on an MDP.

The concept of PAC-Bayes on an MDP with unknown transition and reward functions was first introduced by an online Bayesian exploration algorithm [Kolter and Ng2009], which is often referred to as BEB (Bayesian Exploration Bonus) for the reward bonus term it introduces. At timestep , the algorithm forms a BAMDP using the uncertainty over the reward and transition functions of the single MDP being explored at that time. It is assumed that, even when the episode terminates and the problem resets, the same MDP is continued to be explored using the knowledge gathered thus far. The problem addressed is different from ours; Bayes-CPACE produces a policy which is Bayes-optimal with respect to the uncertainty over multiple latent MDPs. We assume that a different latent MDP may be assigned upon reset.

POMDP-lite [Chen et al.2016] extends BEB’s concept of PAC-Bayes to a BAMDP over multiple latent MDPs. Crucially, however, the latent variable in this case cannot reset during the learning phase. The authors allude to this as a “one-shot game … (which) remains unchanged.” In other words, POMDP-lite is an online algorithm which is near-Bayes-optimal only for the current episode, and it does not translate to a BAMDP where a repeated game occurs.

Related Work

While planning in belief space offers a systematic way to deal with uncertainty [Sondik1978, Kaelbling, Littman, and Cassandra1998], it is very hard to solve in general. For a finite horizon problem, finding the optimal policy over the entire belief space is PSPACE-complete [Papadimitriou and Tsitsiklis1987]. For an infinite horizon problem, the problem is undecidable [Madani, Hanks, and Condon1999]. Intuitively, the intractability comes from the number of states in the belief MDP growing exponentially with . Point-based algorithms that sample the belief space have seen success in approximately solving POMDPs [Pineau, Gordon, and Thrun2003, Smith and Simmons2005]. Analysis by hsu2008hardness shows that the success can be attributed to the ability to “cover” the optimally reachable belief space.

Offline BAMDP approaches compute a policy a priori for any reachable state and belief. When is discrete, this is a MOMDP [Ong et al.2010], and can be solved efficiently by representing the augmented belief space with samples and using a point-based solver such as SARSOP [Kurniawati, Hsu, and Lee2008]. A similar approach is used by the BEETLE algorithm [Poupart et al.2006, Spaan and Vlassis2005]. [Bai, Hsu, and Lee2014] presents an offline continuous state and observation POMDP solver which implies it can solve a BAMDP. However, their approach uses a policy graph where nodes are actions, which makes it difficult to extend to continuous actions.

While offline approaches enjoy good performance, they are computationally expensive. Online approaches circumvent this by starting from the current belief and searching forward. The key is to do sparse sampling [Kearns, Mansour, and Ng2002] to prevent an exponential tree growth. [Wang et al.2005]

apply Thompson sampling. BAMCP 

[Guez, Silver, and Dayan2012] applies Monte-Carlo tree search in belief space [Silver and Veness2010]. DESPOT [Somani et al.2013] improves on this by using lower bounds and determinized sampling techniques. Recently, [Sunberg and Kochenderfer2017] presented an online algorithm, POMCPOW, for continuous state, actions and observations which can be applied to BAMDP problems. Of course, online and offline approaches can be combined, e.g. by using the offline policy as a default rollout policy.

The aforementioned approaches aim for asymptotic guarantees. On the other hand, PAC-MDP [Strehl, Li, and Littman2009] approaches seek to bound the number of exploration steps before achieving near-optimal performance. This was originally formulated in the context of discrete MDPs with unknown transition and reward functions [Brafman and Tennenholtz2002, Strehl et al.2006] and extended to continuous spaces [Kakade, Kearns, and Langford2003, Pazis and Parr2013]. BOSS [Asmuth et al.2009] first introduced the notion of uncertainty over model parameters, albeit for a PAC-MDP style guarantee. The PAC-Bayes property for an MDP was formally introduced in  [Kolter and Ng2009], as discussed in the previous subsection.

There are several effective heuristic-based approaches 

[Dearden, Friedman, and Russell1998, Strens2000] to BAMDP that we omit for brevity. We refer the reader to [Ghavamzadeh et al.2015] for a comprehensive survey. We also compare with QMDP [Littman, Cassandra, and Kaelbling1995] which approximates the expected Q-value with respect to the current belief and greedily chooses an action.

Algorithm Continuous PAC Offline
SARSOP kurniawati2008sarsop
POMDP-lite chen2016pomdp
POMCPOW sunberg2017continuous
Bayes-CPACE (Us)
Table 1: Comparison of BAMDP algorithms

Table 1 compares the key features of Bayes-CPACE against a selection of prior work.

3 Bayes-CPACE: Continuous PAC Optimal Exploration in Belief Space

In this section, we present Bayes-CPACE, an offline PAC-Bayes algorithm that computes a near-optimal policy for a continuous state and action BAMDP. Bayes-CPACE is an extension of C-PACE [Pazis and Parr2013], a PAC optimal algorithm for continuous state and action MDPs. Efficient exploration of a continuous space is challenging because that the same state-action pair cannot be visited more than once. C-PACE addresses this by assuming that the state-action value function is Lipschitz continuous, allowing the value of a state-action pair to be approximated with nearby samples. Similar to other PAC optimal algorithms [Strehl, Li, and Littman2009], C-PACE applies the principle of optimism in the face of uncertainty: the value of a state-action pair is approximated by averaging the value of nearby samples, inflated proportionally to their distances. Intuitively, this distance-dependent bonus term encourages exploration of regions that are far from previous samples until the optimistic estimate results in a near-optimal policy.

Our key insight is that C-PACE can be extended from continuous states to those augmented with finite-dimensional belief states. We derive sufficient conditions for Lipschitz continuity of the belief value function. We show that Bayes-CPACE is indeed PAC-Bayes and bound the sample complexity as a function of the covering number of the reachable belief space from initial belief . In addition, we also present and analyze three practical strategies for improving the sample complexity and runtime of Bayes-CPACE.

Definitions and Assumptions

We assume all rewards lie in which implies . We will first show that Assumption 3.1 and Assumption 3.2 are sufficient conditions for Lipschitz continuity of the value function.333For all proofs, refer to supplementary material. Subsequent proofs do not depend on these assumptions as long as the value function is Lipschitz continuous.

Assumption 3.1 (Lipschitz Continuous Reward and Transition Functions).

Given any two state-action pairs and , there exists a distance metric and Lipschitz constants such that the following is true:


Assumption 3.2 (Belief Contraction).

Given any two belief vectors

and any tuple of , the updated beliefs from the Bayes estimator and satisfy the following:

Assumption 3.1 and Assumption 3.2 can be used to prove the following lemma.

Lemma 3.1 (Lipschitz Continuous Value Function).

Given any two state-belief-action tuples and , there exists a distance metric and a Lipschitz constant such that the following is true:


The distance metric for state-belief-action tuples is a linear combination of the distance metric for state-action pairs used in Assumption 3.1 and the norm for belief

for an appropriate choice of , which is a function of and .

Bayes-CPACE builds an optimistic estimator for the value function using nearest neighbor function approximation from a collected sample set. Since the value function is Lipschitz continuous, the value for any query can be estimated by extrapolating the value of neighboring samples with a distance-dependent bonus. If the number of close neighbors is sufficiently large, the query is said to be “known” and the estimate can be bounded. Otherwise, the query is unknown and is added to the sample set. Once enough samples are added, the entire reachable space will be known and the estimate will be bounded with respect to the true optimal value function . We define these terms more formally below.

Definition 3.1 (Known Query).

Let be the Lipschitz constant of the optimistic estimator. A state-belief-action query is said to be "known" if its nearest neighbor in the sample set is within .

We are now ready to define the estimator.

Definition 3.2 (Optimistic Value Estimate).

Assume we have a set of samples where every element is a tuple : starting from , the agent took an action , received a reward , and transitioned to . Given a state-belief-action query , its nearest neighbor from the sample set provides an optimistic estimate


The value is the average of all the nearest neighbor estimates


where is the upper bound of the estimate. If there are fewer than neighbors, can be used in place of the corresponding .

Note that the estimator is a recursive function. Given a sample set , value iteration is performed to compute the estimate for each of the sample points,


where is approximated via (2) using its nearby samples. This estimate must be updated every time a new sample is added to the set.

We introduce two additional techniques that leverage the Q-values of the underlying latent MDPs to improve the sample complexity of Bayes-CPACE.

Definition 3.3 (Best-Case Upper Bound).

We can replace the constant in Definition 3.2 with computed as follows:

In general, any admissible heuristic that satisfies can be used. In practice, the Best-Case Upper Bound reduces exploration of actions which are suboptimal in all latent MDPs with nonzero probability.

We can also take advantage of whenever the belief distribution collapses. These exact values for the latent MDPs can be used to seed the initial estimates.

Definition 3.4 (Known Latent Initialization).

Let be the belief distribution where

, i.e. a one-hot encoding. If there exists

such that , then we can use the following estimate:


This extends Definition 3.1 for a known query to include any state-belief-action tuple where the belief is within of a one-hot vector.

We refer to Proposition 3.1 for how this reduces sample complexity.


1:Bayes-Estimator , initial belief , BAMDP ,terminal condition , horizon
2:Action value estimate
4:Initialize sample set
5:while  is false do
6:     Initialize by resampling initial state and latent variable to
7:     Reset belief to
8:     for  do
9:         Compute action
10:         Execute on to receive
11:         Invoke to get
12:         if  is not known then
13:              Add to
14:              Find fixed point of for               
18:     Find closest one-hot vector
19:     if  then
21:     else
22:         Find nearest neighbors      in sample set
23:         for  do
27:     Return
Algorithm 1 Bayes-CPACE

We describe our algorithm, Bayes-CPACE, in Algorithm 1. To summarize, at every timestep the algorithm computes a greedy action using its current value estimate , receives a reward , and transitions to a new state-belief  (Lines 911). If the sample is not known, it is added to the sample set (Line 13). The value estimates for all samples are updated until the fixed point is reached (Line 14). Terminal condition is met when no more samples are added and value iteration has converged for sufficient number of iterations. The algorithm invokes a subroutine for computing the estimated value function (Lines 1727) which correspond to the operations described in Definition 3.23.3, and  3.4.

Analysis of Sample Complexity

We now prove that Bayes-CPACE is PAC-Bayes. Since we adopt the proof of C-PACE, we only state the main steps and defer the full proof to supplementary material. We begin with the concept of a known belief MDP.

Definition 3.5 (Known Belief MDP).

Let be the original belief MDP. Let be the set of all known state-belief-action tuples. We define a known belief MDP that is identical to on (i.e. identical transition and reward functions) and for all other state-belief-action tuples, it transitions deterministically with a reward to an absorbing state with zero reward.

We can then bound the performance of a policy on with its performance on and the maximum penalty incurred by escaping it.

Lemma 3.2 (Generalized Induced Inequality, Lemma 8 in [Strehl and Littman2008]).

We are given the original belief MDP , the known belief MDP , a policy and time horizon . Let be the probability of an escape event, i.e. the probability of sampling a state-belief-action tuple that is not in when executing on from for steps. Let be the value of executing policy on . Then the following is true:

We now show one of two things can happen: either the greedy policy escapes from the known MDP, or it remains in it and performs near optimally. We first show that it can only escape a certain number of times before the entire reachable space is known.

Lemma 3.3 (Full Coverage of Known Space, Lemma 4.5 in [Kakade, Kearns, and Langford2003]).

All reachable state-belief-action queries will become known after adding at most samples to .

Corollary 3.1 (Bounded Escape Probability).

At a given timestep, let . Then with probability , this can happen at most for timesteps.

We now show that when inside the known MDP, the greedy policy will be near optimal.

Lemma 3.4 (Near-optimality of Approximate Greedy (Theorem 3.12 of [Pazis and Parr2013])).

Let be an estimate of the value function that has bounded Bellman error , where is the Bellman operator. Let be the greedy policy on . Then the policy is near-optimal:

Let be the approximation error caused by using a finite number of neighbors in (2) instead of the Bellman operator. Then Lemma 3.4 leads to the following corollary.

Corollary 3.2 (Near-optimality on Known Belief MDP).

If , i.e. the number of neighbors is large enough, then using Hoeffding’s inequality we can show . Then on the known belief MDP , the following can be shown with probability :

We now put together these ideas to state the main theorem.

Theorem 3.1 (Bayes-CPACE is PAC-Bayes).

Let be a belief MDP. At timestep , let be the greedy policy on , and let be the state-belief pair. With probability at least , , i.e. the algorithm is -close to the optimal policy for all but

steps when is used for the number of neighbors in (2).

Proof (sketch).

At time , we can form a known belief MDP from the samples collected so far. Either the policy leads to an escape event within the next steps or the agent stays within . Such an escape can happen at most times with high probability; when the escape probability is low, is -optimal. ∎

Analysis of Performance Enhancements

We can initialize estimates with exact Q values for the latent MDPs.This makes the known space larger, thus reducing covering number.

Proposition 3.1 (Known Latent Initialization).

Let be the covering number of the reduced space . Then the sample complexity reduces by a factor of .

It is also unnecessary to perform value iteration until convergence.

Proposition 3.2 (Approximate Value Iteration).

Let . Suppose the value iteration step (Line 14) is run only for iterations denoted by (instead of until convergence ). We can bound the difference between two functions as . This results in an added suboptimality term in Theorem 3.1:


One practical enhancement is to collect new samples in a batch with a fixed policy before performing value iteration. This requires two changes to the algorithm: 1) an additional loop to repeat (Lines 814) times, and 2) perform (Line 14) outside of the loop. This increases the sample complexity by a constant factor but has empirically reduced runtime by only performing value iteration when a large change is expected.

Proposition 3.3 (Batch Sample Update).

Suppose we collect new samples from rollouts with the greedy policy at time before performing value iteration. This increases the sample complexity only by a constant factor of .

(a) Value approximation for Tiger
(b) One optimal path taken by Bayes-CPACE for Light-Dark Tiger. QMDP P-Lite SARSOP BCPACE Tiger Chain LDT LDT(cont.) - - (c) Benchmark results. LDT(cont.) has continuous state space.
Figure 2: With greedy exploration, only best actions are tightly approximated (Figure 1(a)). Bayes-CPACE takes optimal actions for a continuous BAMDP (Figure 1(b)). Bayes-CPACE is competitive for both discrete and continuous BAMDPs (Table 1(c)).

4 Experimental Results

We compare Bayes-CPACE with QMDP, POMDP-lite, and SARSOP for discrete BAMDPs and with QMDP for continuous BAMDPs. For discrete state spaces, we evaluate Bayes-CPACE on two widely used synthetic examples, Tiger [Kaelbling, Littman, and Cassandra1998] and Chain [Strens2000]. For both Bayes-CPACE and POMDP-lite, the parameters were tuned offline for best performance. For continuous state spaces, we evaluate on a variant of the Light-Dark problem [Platt Jr et al.2010].

While our analysis is applicable for BAMDPs with continuous state and action spaces, any approximation the greedy selection of an action is not guaranteed to be PAC-Bayes. Thus, we limit our continuous BAMDP experiments to discrete action spaces and leave the continuous action case for future work.

Tiger: We start with the Tiger problem. The agent stands in front of two closed doors and can choose one of three actions: listen, open the left door, or open the right door. One of the doors conceals a tiger; opening this door results in a penalty of -100, while the other results in a reward of 10. Listening informs the agent of the correct location of the tiger with probability , with a cost of -1. As observed by [Chen et al.2016], this POMDP problem can be cast as a BAMDP problem with two latent MDPs.

Table 1(c) shows that Bayes-CPACE performs as competitively as SARSOP and is better than QMDP or POMDP-lite. This is not surprising since both Bayes-CPACE and SARSOP are offline solvers.

Figure 1(a) visualizes the estimated values. Because Bayes-CPACE explores greedily, exploration is focused on actions with high estimated value, either due to optimism from under-exploration or actual high value. As a result, suboptimal actions are not taken once Bayes-CPACE is confident that they have lower value than other actions. Because fewer samples have been observed for these suboptimal actions, their approximated values are not tight. Note also that the original problem explores a much smaller subset of the belief space, so we have randomly initialized the initial belief from rather than always initializing to 0.5 for this visualization, forcing Bayes-CPACE to perform additional exploration.

Chain: The Chain problem consists of five states and two actions . Taking action in state transitions to with no reward; taking action in state transitions to with a reward of 10. Action transitions from any state to with a reward of 2. However, these actions are noisy: in the canonical version of Chain, the opposite action is taken with slip probability 0.2. In our variant, we allow the slip probability to be selected from with uniform probability at the beginning of each episode. These three latent MDPs form a BAMDP. Table 1(c) shows that Bayes-CPACE outperforms other algorithms.

Light-Dark Tiger: We consider a variant of the Light-Dark problem, which we call Light-Dark Tiger (Figure 1(b)). In this problem, one of the two goal corners (top-right or bottom-right) contains a tiger. The agent receives a penalty of -100 if it enters the goal corner containing the tiger and a reward of 10 if it enters the other region. There are four actions—Up, Down, Left, Right—which move one unit with Gaussian noise of . The tiger location is unknown to the agent until the left wall is reached. As in the original Tiger problem, this POMDP can be formulated as a BAMDP with two latent MDPs.

We consider two cases, one with zero noise and another with . With zero noise, the problem is a discrete POMDP and the optimal solution is deterministic; the agent hits the left wall and goes straight to the goal location. When there is noise, the agent may not reach the left wall in the first step. Paths executed by Bayes-CPACE still take Left until the left wall is hit and goes to the goal (Figure 1(b)).

5 Discussion

We have presented the first PAC-Bayes algorithm for continuous BAMDPs whose value functions are Lipschitz continuous. While the practical implementation of Bayes-CPACE is limited to discrete actions, our analysis holds for both continuous and discrete state and actions. We believe that our analysis provides an important insight for the development of PAC efficient algorithms for continuous BAMDPs.

The BAMDP formulation is useful for real-world robotics problems where uncertainty over latent models is expected at test time. An efficient policy search algorithm must incorporate prior knowledge over the latent MDPs to take advantage of this formulation. As a step toward this direction, we have introduced several techniques that utilize the value functions of underlying latent MDPs without affecting PAC optimality.

One of the key assumptions Bayes-CPACE has made is that the cardinality of the latent state space is finite. This may not be true in many robotics applications in which latent variables are drawn from continuous distributions. In such cases, the true BAMDP can be approximated by sampling a set of latent variables, as introduced in  [Wang et al.2012]. In future work, we will investigate methods to select representative MDPs and to bound the gap between the optimal value function of the true BAMDP and the approximated one.

Although it is beyond the scope of this paper, we would like to make two remarks. First, Bayes-CPACE can easily be extended to allow parallel exploration, similar to how [Pazis and Parr2016] extended the original C-PACE to concurrently explore multiple MDPs. Second, since we have generative models for the latent MDPs, we may enforce exploration from arbitrary belief points. Of course, the key to efficient exploration of belief space lies in exploring just beyond the optimally reachable belief space, so “random” initialization is unlikely to be helpful. However, if we can approximate this space similarly to sampling-based kinodynamic planning algorithms [Li, Littlefield, and Bekris2016], this may lead to more structured search in belief space.

6 Acknowledgements

This work was partially funded by Kwanjeong Educational Foundation, NASA Space Technology Research Fellowships (NSTRF), the National Institute of Health R01 (#R01EB019335), National Science Foundation CPS (#1544797), National Science Foundation NRI (#1637748), the Office of Naval Research, the RCTA, Amazon, and Honda.


  • [Asmuth et al.2009] Asmuth, J.; Li, L.; Littman, M. L.; Nouri, A.; and Wingate, D. 2009.

    A bayesian sampling approach to exploration in reinforcement learning.


    Conference on Uncertainty in Artificial Intelligence

  • [Bai, Hsu, and Lee2014] Bai, H.; Hsu, D.; and Lee, W. S. 2014. Integrated perception and planning in the continuous space: A pomdp approach. The International Journal of Robotics Research 33(9).
  • [Brafman and Tennenholtz2002] Brafman, R. I., and Tennenholtz, M. 2002. R-max - A general polynomial time algorithm for near-optimal reinforcement learning.

    Journal of Machine Learning Research

  • [Chen et al.2016] Chen, M.; Frazzoli, E.; Hsu, D.; and Lee, W. S. 2016. POMDP-lite for Robust Robot Planning under Uncertainty. In IEEE International Conference on Robotics and Automation.
  • [Dearden, Friedman, and Russell1998] Dearden, R.; Friedman, N.; and Russell, S. 1998. Bayesian q-learning. In AAAI Conference on Artificial Intelligence.
  • [Ghavamzadeh et al.2015] Ghavamzadeh, M.; Mannor, S.; Pineau, J.; Tamar, A.; et al. 2015. Bayesian reinforcement learning: A survey. Foundations and Trends® in Machine Learning 8(5-6):359–483.
  • [Guez et al.2014] Guez, A.; Heess, N.; Silver, D.; and Dayan, P. 2014. Bayes-adaptive simulation-based search with value function approximation. In Advances in Neural Information Processing Systems.
  • [Guez, Silver, and Dayan2012] Guez, A.; Silver, D.; and Dayan, P. 2012. Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search. In Advances in Neural Information Processing Systems.
  • [Guilliard et al.2018] Guilliard, I.; Rogahn, R. J.; Piavis, J.; and Kolobov, A. 2018. Autonomous thermalling as a partially observable markov decision process. In Robotics: Science and Systems.
  • [Hsu, Rong, and Lee2008] Hsu, D.; Rong, N.; and Lee, W. S. 2008. What makes some pomdp problems easy to approximate? In Advances in Neural Information Processing Systems.
  • [Javdani, Srinivasa, and Bagnell2015] Javdani, S.; Srinivasa, S.; and Bagnell, J. 2015. Shared autonomy via hindsight optimization. In Robotics: Science and Systems.
  • [Kaelbling, Littman, and Cassandra1998] Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artificial intelligence 101(1-2):99–134.
  • [Kakade, Kearns, and Langford2003] Kakade, S.; Kearns, M. J.; and Langford, J. 2003. Exploration in metric state spaces. In International Conference on Machine Learning.
  • [Kakade2003] Kakade, S. M. 2003. On the sample complexity of reinforcement learning. Ph.D. Dissertation, University College London (University of London).
  • [Kearns and Singh2002] Kearns, M., and Singh, S. 2002. Near-optimal reinforcement learning in polynomial time. Machine learning 49(2-3):209–232.
  • [Kearns, Mansour, and Ng2002] Kearns, M.; Mansour, Y.; and Ng, A. Y. 2002. A sparse sampling algorithm for near-optimal planning in large markov decision processes. Machine learning 49(2-3):193–208.
  • [Kolter and Ng2009] Kolter, J. Z., and Ng, A. Y. 2009. Near-bayesian exploration in polynomial time. In International Conference on Machine Learning.
  • [Kurniawati, Hsu, and Lee2008] Kurniawati, H.; Hsu, D.; and Lee, W. S. 2008. Sarsop: Efficient point-based pomdp planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems.
  • [Li, Littlefield, and Bekris2016] Li, Y.; Littlefield, Z.; and Bekris, K. E. 2016. Asymptotically optimal sampling-based kinodynamic planning. The International Journal of Robotics Research 35(5):528–564.
  • [Li2009] Li, L. 2009. A unifying framework for computational reinforcement learning theory. Ph.D. Dissertation, Rutgers University-Graduate School-New Brunswick.
  • [Littman, Cassandra, and Kaelbling1995] Littman, M. L.; Cassandra, A. R.; and Kaelbling, L. P. 1995. Learning policies for partially observable environments: Scaling up. In Machine Learning Proceedings. 362–370.
  • [Madani, Hanks, and Condon1999] Madani, O.; Hanks, S.; and Condon, A. 1999. On the undecidability of probabilistic planning and infinite-horizon partially observable markov decision problems. In AAAI Conference on Artificial Intelligence.
  • [Ong et al.2010] Ong, S. C.; Png, S. W.; Hsu, D.; and Lee, W. S. 2010. Planning under uncertainty for robotic tasks with mixed observability. The International Journal of Robotics Research 29(8):1053–1068.
  • [Papadimitriou and Tsitsiklis1987] Papadimitriou, C. H., and Tsitsiklis, J. N. 1987. The complexity of markov decision processes. Mathematics of operations research 12(3):441–450.
  • [Pazis and Parr2013] Pazis, J., and Parr, R. 2013. Pac optimal exploration in continuous space markov decision processes. In AAAI Conference on Artificial Intelligence.
  • [Pazis and Parr2016] Pazis, J., and Parr, R. 2016. Efficient pac-optimal exploration in concurrent, continuous state mdps with delayed updates. In AAAI Conference on Artificial Intelligence.
  • [Pineau, Gordon, and Thrun2003] Pineau, J.; Gordon, G.; and Thrun, S. 2003. Point-based value iteration: An anytime algorithm for pomdps. In International Joint Conference on Artificial Intelligence.
  • [Platt Jr et al.2010] Platt Jr, R.; Tedrake, R.; Kaelbling, L.; and Lozano-Perez, T. 2010. Belief space planning assuming maximum likelihood observations. In Robotics: Science and Systems.
  • [Poupart et al.2006] Poupart, P.; Vlassis, N.; Hoey, J.; and Regan, K. 2006. An analytic solution to discrete bayesian reinforcement learning. In International Conference on Machine Learning.
  • [Silver and Veness2010] Silver, D., and Veness, J. 2010. Monte-carlo planning in large pomdps. In Advances in Neural Information Processing Systems.
  • [Smith and Simmons2005] Smith, T., and Simmons, R. 2005. Point-based pomdp algorithms: Improved analysis and implementation. In UAI.
  • [Somani et al.2013] Somani, A.; Ye, N.; Hsu, D.; and Lee, W. S. 2013. Despot: Online pomdp planning with regularization. In Advances in Neural Information Processing Systems.
  • [Sondik1978] Sondik, E. J. 1978. The optimal control of partially observable markov processes over the infinite horizon: Discounted costs. Operations research 26(2):282–304.
  • [Spaan and Vlassis2005] Spaan, M. T., and Vlassis, N. 2005. Perseus: Randomized point-based value iteration for pomdps. Journal of Artificial Intelligence Research 24:195–220.
  • [Strehl and Littman2008] Strehl, A. L., and Littman, M. L. 2008.

    Online linear regression and its application to model-based reinforcement learning.

    In Advances in Neural Information Processing Systems.
  • [Strehl et al.2006] Strehl, A. L.; Li, L.; Wiewiora, E.; Langford, J.; and Littman, M. L. 2006. Pac model-free reinforcement learning. In International Conference on Machine Learning.
  • [Strehl, Li, and Littman2009] Strehl, A. L.; Li, L.; and Littman, M. L. 2009. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research 10(Nov):2413–2444.
  • [Strens2000] Strens, M. 2000. A bayesian framework for reinforcement learning. In International Conference on Machine Learning.
  • [Sunberg and Kochenderfer2017] Sunberg, Z., and Kochenderfer, M. J. 2017. Online algorithms for pomdps with continuous state, action, and observation spaces. preprint arXiv:1709.06196.
  • [Wang et al.2005] Wang, T.; Lizotte, D.; Bowling, M.; and Schuurmans, D. 2005. Bayesian sparse sampling for on-line reward optimization. In International Conference on Machine Learning.
  • [Wang et al.2012] Wang, Y.; Won, K. S.; Hsu, D.; and Lee, W. S. 2012. Monte carlo bayesian reinforcement learning. In International Conference on Machine Learning.

7 Supplementary Material

Proof of Lemma 3.1

The proof has a few key components. Firstly, we show that the reward and transition functions are Lipschitz continuous. Secondly, we show that the Q value that differ only in belief is Lipschitz continuous. Finally, we put these together to show that the Q value in state-belief-action space is Lipschitz continuous. For notational simplicity, let .

Lipschitz continuity for reward and transition functions

We begin by showing that the reward as a function of the state-belief-action is Lipschitz continuous. For any two tuples and , the following is true:


where we have used Assumption 3.1 for the 4th inequality.

Similarly, the state transition as a function of the state-belief-action can also be shown to be Lipschitz continuous:


where we have used Assumption 3.1 for the 4th inequality.

Lipschitz continuity for fixed state-action Q value

We’ll use the following inequality. For two positive bounded functions and ,


First let’s assume the following is true:


We will derive the value of (if it exists) by expanding the expression for the action value function.

Let be the deterministic belief update. We have the following:


where we have used (6), (7), (8), (9), and Assumption 3.2 for the 2nd, 3rd, 4th, 5th and last inequalities, respectively.

Applying above inequality to (9), we can solve for :


We can now use the Lipschitz constant from  (11) in (9) :


Lipchitz contiuous Q value

We can now show that the Q value is Lipschitz continuous in state-belief-action space. For any two tuples and satisfying Assumption 3.1 and Assumption 3.1, the following is true:


where the 2nd inequality follows from taking similar steps as in (10), the 3rd inequality follows from (12) and Assumption 3.2.

Now, with , define the distance metric in state-belief-action space be the following:


From (13) and (14), we can derive the following:

Proof of Corollary 3.1

Supp. Lemma 7.1 (Lemma 56 in [Li2009]).

Let be a sequence of independent Bernoulli trials, each with a success probability at least , for some constant Then for any and , with probability at least , if .

After at most non-overlapping trajectories of length , happens for at least times with probability at least . Then, from Lemma 3.3, all reachable state-actions will have become known, making . Setting , we can have at most steps in which .

Proof of Corollary 3.2

This follows from Lemma 3.13, 3.14 of [Pazis and Parr2013] to get , and applying our Lemma 3.3.

Proof of Theorem 3.1

Supp. Lemma 7.2 (Lemma 2 in [Kearns and Singh2002]).