1 Introduction
Addressing uncertainty is critical for robots that interact with the real world. Often though, with good engineering and experience, we can obtain reasonable regimes for uncertainty, specifically model uncertainty, and prepare offline for various contingencies. However, we must to predict, refine, and act online. Thus, in this paper we focus on uncertainty over a set of scenarios, which requires the agent to balance exploration (uncertainty reduction) and exploitation (prior knowledge).
We can naturally express this objective as a BayesAdaptive Markov Decision Process [Kolter and Ng2009], which incorporates Bayesian belief updates into longterm expected return. The BAMDP framework formalizes the notion of uncertainty over multiple latent MDPs. This has widespread applications in navigation [Guilliard et al.2018], manipulation [Chen et al.2016], and shared autonomy [Javdani, Srinivasa, and Bagnell2015].
Although BAMDPs provide an elegant problem formulation for model uncertainty, Probably Approximately Correct (henceforth PAC) algorithms for continuous state and action space BAMDPs have been less explored, limiting possible applications in many robotics problems. In the discrete domain, there exist some efficient online, PAC optimal approaches [Kolter and Ng2009, Chen et al.2016] and approximate MonteCarlo algorithms [Guez, Silver, and Dayan2012], but it is not straightforward to extend this line of work to the continuous domain. Stateoftheart approximationbased approaches for belief space planning in continuous spaces [Sunberg and Kochenderfer2017, Guez et al.2014] do not provide PAC optimality.
In this work, we present the first PAC optimal algorithm for BAMDPs in continuous state and action spaces, to the best of our knowledge. The key challenge for PAC optimal exploration in continuous BAMDPs is that the same state will not be visited twice, which often renders MonteCarlo approaches computationally prohibitive, as discussed in [Sunberg and Kochenderfer2017]. However, if the value function satisfies certain smoothness properties, i.e. Lipschitz continuity, we can efficiently “cover” the reachable belief space. In other words, we leverage the following property:
A set of representative samples is sufficient to approximate a Lipschitz continuous value function of the reachable continuous statebeliefaction space.
Our algorithm, BayesCPACE (Figure 1) maintains an approximate value function based on a set of visited samples, with bounded optimism in the approximation from Lipschitz continuity. At each timestep, it greedily selects an action that maximizes the value function. If the action lies in an underexplored region of statebeliefaction space, the visited sample is added to the set of samples and the value function is updated. Our algorithm adopts CPACE [Pazis and Parr2013], a PAC optimal algorithm for continuous MDPs, as our engine for exploring belief space.
We make the following contributions:

We present a PAC optimal algorithm for continuous BAMDPs (Section 3).

We prove that Lipschitz continuity of latent MDP reward and transition functions is a sufficient condition for Lipschitz continuity of the BAMDP value function (Lemma 3.1).

Through experiments, we show that BayesCPACE has competitive performance against stateofart algorithms in discrete BAMDPs and promising performance in continuous BAMDPs (Section 4).
2 Preliminaries
In this section, we review the BayesAdaptive Markov Decision Process (BAMDP) framework. A BAMDP is a belief MDP with hidden latent variables that govern the reward and transition functions. The task is to compute an optimal policy that maps state and belief over the latent variables to actions. Since computing an exact optimal policy is intractable [Kurniawati, Hsu, and Lee2008], we state a more achievable property of an algorithm being Probably Approximately Correct. We review related work that addresses this problem, and contrast this objective with other formulations.
BayesAdaptive Markov Decision Process
The BAMDP framework assumes that a latent variable governs the reward and transition functions of the underlying Markov Decision Process [Ghavamzadeh et al.2015, Guez, Silver, and Dayan2012, Chen et al.2016]. A BAMDP is defined by a tuple , where is the set of hyperstates (state , latent variable ), is the set of actions, is the transition function, is the initial distribution over hyperstates, represents the reward obtained when action is taken in hyperstate , and is the discount factor.
In this paper, we allow the spaces to be continuous^{1}^{1}1
For simplicity of exposition, our notation assumes that the spaces are discrete. For the continuous case, all corresponding probabilities are replaced by probability density functions and all summation operators are replaced by integrals.
, but limit the set of latent variables to be finite. For simplicity, we assume that the latent variable is constant throughout an episode. ^{2}^{2}2It is straightforward to extend this to a deterministicallychanging latent variable or incorporate an observation model. This requires augmenting observation into the state definition and computing belief evolution appropriately. This model is derived in [Chen et al.2016].We now introduce the notion of a Bayes estimator . Since the latent variable is unknown, the agent maintains a belief distribution , where is a
dimensional probability simplex. The agent uses the Bayes estimator
to update its current belief upon taking an action from state and transitioning to a state :We reformulate BAMDP as a belief MDP . We consider the pair to be the state of this MDP. The transition function is as follows:
where for the belief computed by the Bayes estimator and zero everywhere else. The reward function is defined as .
A policy maps the pair to an action . The value of a policy is given by
where . The optimal Bayesian value function satisfies the Bellman optimality equation
We now characterize what it means to efficiently explore the reachable continuous statebeliefaction space. We extend [Kakade2003]’s definition of sample complexity for BAMDPs.
Definition 2.1 (Sample Complexity).
Let be a learning algorithm and be its policy at timestep . The sample complexity of an algorithm is the number of steps such that .
In order to define PAC optimal exploration for continuous space, we need to use the notion of covering number of the reachable belief space.
Definition 2.2 (Covering Number).
An cover of is a set of statebeliefaction tuples such that for any reachable query , there exists a sample such that . We define the covering number to be the size of the largest minimal cover, i.e. the largest which will not remain a cover if any sample is removed.
Using this definition, we now formalize the notion of PAC optimal exploration for BAMDPs.
Definition 2.3 (PACBayes).
A BAMDP algorithm is called PACBayes if, given any and , its sample complexity is polynomial in the relevant quantities , with probability at least .
Comparison of PACBayes vs PACBayesMDP
We shed some light on the important distinction between the concept of PACBayes on a BAMDP (which we analyze) and the more commonly referred PACBayes on an MDP.
The concept of PACBayes on an MDP with unknown transition and reward functions was first introduced by an online Bayesian exploration algorithm [Kolter and Ng2009], which is often referred to as BEB (Bayesian Exploration Bonus) for the reward bonus term it introduces. At timestep , the algorithm forms a BAMDP using the uncertainty over the reward and transition functions of the single MDP being explored at that time. It is assumed that, even when the episode terminates and the problem resets, the same MDP is continued to be explored using the knowledge gathered thus far. The problem addressed is different from ours; BayesCPACE produces a policy which is Bayesoptimal with respect to the uncertainty over multiple latent MDPs. We assume that a different latent MDP may be assigned upon reset.
POMDPlite [Chen et al.2016] extends BEB’s concept of PACBayes to a BAMDP over multiple latent MDPs. Crucially, however, the latent variable in this case cannot reset during the learning phase. The authors allude to this as a “oneshot game … (which) remains unchanged.” In other words, POMDPlite is an online algorithm which is nearBayesoptimal only for the current episode, and it does not translate to a BAMDP where a repeated game occurs.
Related Work
While planning in belief space offers a systematic way to deal with uncertainty [Sondik1978, Kaelbling, Littman, and Cassandra1998], it is very hard to solve in general. For a finite horizon problem, finding the optimal policy over the entire belief space is PSPACEcomplete [Papadimitriou and Tsitsiklis1987]. For an infinite horizon problem, the problem is undecidable [Madani, Hanks, and Condon1999]. Intuitively, the intractability comes from the number of states in the belief MDP growing exponentially with . Pointbased algorithms that sample the belief space have seen success in approximately solving POMDPs [Pineau, Gordon, and Thrun2003, Smith and Simmons2005]. Analysis by hsu2008hardness shows that the success can be attributed to the ability to “cover” the optimally reachable belief space.
Offline BAMDP approaches compute a policy a priori for any reachable state and belief. When is discrete, this is a MOMDP [Ong et al.2010], and can be solved efficiently by representing the augmented belief space with samples and using a pointbased solver such as SARSOP [Kurniawati, Hsu, and Lee2008]. A similar approach is used by the BEETLE algorithm [Poupart et al.2006, Spaan and Vlassis2005]. [Bai, Hsu, and Lee2014] presents an offline continuous state and observation POMDP solver which implies it can solve a BAMDP. However, their approach uses a policy graph where nodes are actions, which makes it difficult to extend to continuous actions.
While offline approaches enjoy good performance, they are computationally expensive. Online approaches circumvent this by starting from the current belief and searching forward. The key is to do sparse sampling [Kearns, Mansour, and Ng2002] to prevent an exponential tree growth. [Wang et al.2005]
apply Thompson sampling. BAMCP
[Guez, Silver, and Dayan2012] applies MonteCarlo tree search in belief space [Silver and Veness2010]. DESPOT [Somani et al.2013] improves on this by using lower bounds and determinized sampling techniques. Recently, [Sunberg and Kochenderfer2017] presented an online algorithm, POMCPOW, for continuous state, actions and observations which can be applied to BAMDP problems. Of course, online and offline approaches can be combined, e.g. by using the offline policy as a default rollout policy.The aforementioned approaches aim for asymptotic guarantees. On the other hand, PACMDP [Strehl, Li, and Littman2009] approaches seek to bound the number of exploration steps before achieving nearoptimal performance. This was originally formulated in the context of discrete MDPs with unknown transition and reward functions [Brafman and Tennenholtz2002, Strehl et al.2006] and extended to continuous spaces [Kakade, Kearns, and Langford2003, Pazis and Parr2013]. BOSS [Asmuth et al.2009] first introduced the notion of uncertainty over model parameters, albeit for a PACMDP style guarantee. The PACBayes property for an MDP was formally introduced in [Kolter and Ng2009], as discussed in the previous subsection.
There are several effective heuristicbased approaches
[Dearden, Friedman, and Russell1998, Strens2000] to BAMDP that we omit for brevity. We refer the reader to [Ghavamzadeh et al.2015] for a comprehensive survey. We also compare with QMDP [Littman, Cassandra, and Kaelbling1995] which approximates the expected Qvalue with respect to the current belief and greedily chooses an action.Algorithm  Continuous  PAC  Offline 
State/Action  
SARSOP kurniawati2008sarsop  ✓  
POMDPlite chen2016pomdp  ✓  
POMCPOW sunberg2017continuous  ✓  
BayesCPACE (Us)  ✓  ✓  ✓ 
Table 1 compares the key features of BayesCPACE against a selection of prior work.
3 BayesCPACE: Continuous PAC Optimal Exploration in Belief Space
In this section, we present BayesCPACE, an offline PACBayes algorithm that computes a nearoptimal policy for a continuous state and action BAMDP. BayesCPACE is an extension of CPACE [Pazis and Parr2013], a PAC optimal algorithm for continuous state and action MDPs. Efficient exploration of a continuous space is challenging because that the same stateaction pair cannot be visited more than once. CPACE addresses this by assuming that the stateaction value function is Lipschitz continuous, allowing the value of a stateaction pair to be approximated with nearby samples. Similar to other PAC optimal algorithms [Strehl, Li, and Littman2009], CPACE applies the principle of optimism in the face of uncertainty: the value of a stateaction pair is approximated by averaging the value of nearby samples, inflated proportionally to their distances. Intuitively, this distancedependent bonus term encourages exploration of regions that are far from previous samples until the optimistic estimate results in a nearoptimal policy.
Our key insight is that CPACE can be extended from continuous states to those augmented with finitedimensional belief states. We derive sufficient conditions for Lipschitz continuity of the belief value function. We show that BayesCPACE is indeed PACBayes and bound the sample complexity as a function of the covering number of the reachable belief space from initial belief . In addition, we also present and analyze three practical strategies for improving the sample complexity and runtime of BayesCPACE.
Definitions and Assumptions
We assume all rewards lie in which implies . We will first show that Assumption 3.1 and Assumption 3.2 are sufficient conditions for Lipschitz continuity of the value function.^{3}^{3}3For all proofs, refer to supplementary material. Subsequent proofs do not depend on these assumptions as long as the value function is Lipschitz continuous.
Assumption 3.1 (Lipschitz Continuous Reward and Transition Functions).
Given any two stateaction pairs and , there exists a distance metric and Lipschitz constants such that the following is true:
where
Assumption 3.2 (Belief Contraction).
Given any two belief vectors
and any tuple of , the updated beliefs from the Bayes estimator and satisfy the following:Lemma 3.1 (Lipschitz Continuous Value Function).
Given any two statebeliefaction tuples and , there exists a distance metric and a Lipschitz constant such that the following is true:
where
The distance metric for statebeliefaction tuples is a linear combination of the distance metric for stateaction pairs used in Assumption 3.1 and the norm for belief
for an appropriate choice of , which is a function of and .
BayesCPACE builds an optimistic estimator for the value function using nearest neighbor function approximation from a collected sample set. Since the value function is Lipschitz continuous, the value for any query can be estimated by extrapolating the value of neighboring samples with a distancedependent bonus. If the number of close neighbors is sufficiently large, the query is said to be “known” and the estimate can be bounded. Otherwise, the query is unknown and is added to the sample set. Once enough samples are added, the entire reachable space will be known and the estimate will be bounded with respect to the true optimal value function . We define these terms more formally below.
Definition 3.1 (Known Query).
Let be the Lipschitz constant of the optimistic estimator. A statebeliefaction query is said to be "known" if its nearest neighbor in the sample set is within .
We are now ready to define the estimator.
Definition 3.2 (Optimistic Value Estimate).
Assume we have a set of samples where every element is a tuple : starting from , the agent took an action , received a reward , and transitioned to . Given a statebeliefaction query , its nearest neighbor from the sample set provides an optimistic estimate
(1) 
The value is the average of all the nearest neighbor estimates
(2) 
where is the upper bound of the estimate. If there are fewer than neighbors, can be used in place of the corresponding .
Note that the estimator is a recursive function. Given a sample set , value iteration is performed to compute the estimate for each of the sample points,
(3) 
where is approximated via (2) using its nearby samples. This estimate must be updated every time a new sample is added to the set.
We introduce two additional techniques that leverage the Qvalues of the underlying latent MDPs to improve the sample complexity of BayesCPACE.
Definition 3.3 (BestCase Upper Bound).
We can replace the constant in Definition 3.2 with computed as follows:
In general, any admissible heuristic that satisfies can be used. In practice, the BestCase Upper Bound reduces exploration of actions which are suboptimal in all latent MDPs with nonzero probability.
We can also take advantage of whenever the belief distribution collapses. These exact values for the latent MDPs can be used to seed the initial estimates.
Definition 3.4 (Known Latent Initialization).
Let be the belief distribution where
, i.e. a onehot encoding. If there exists
such that , then we can use the following estimate:(4) 
This extends Definition 3.1 for a known query to include any statebeliefaction tuple where the belief is within of a onehot vector.
We refer to Proposition 3.1 for how this reduces sample complexity.
Algorithm
We describe our algorithm, BayesCPACE, in Algorithm 1. To summarize, at every timestep the algorithm computes a greedy action using its current value estimate , receives a reward , and transitions to a new statebelief (Lines 9–11). If the sample is not known, it is added to the sample set (Line 13). The value estimates for all samples are updated until the fixed point is reached (Line 14). Terminal condition is met when no more samples are added and value iteration has converged for sufficient number of iterations. The algorithm invokes a subroutine for computing the estimated value function (Lines 17–27) which correspond to the operations described in Definition 3.2, 3.3, and 3.4.
Analysis of Sample Complexity
We now prove that BayesCPACE is PACBayes. Since we adopt the proof of CPACE, we only state the main steps and defer the full proof to supplementary material. We begin with the concept of a known belief MDP.
Definition 3.5 (Known Belief MDP).
Let be the original belief MDP. Let be the set of all known statebeliefaction tuples. We define a known belief MDP that is identical to on (i.e. identical transition and reward functions) and for all other statebeliefaction tuples, it transitions deterministically with a reward to an absorbing state with zero reward.
We can then bound the performance of a policy on with its performance on and the maximum penalty incurred by escaping it.
Lemma 3.2 (Generalized Induced Inequality, Lemma 8 in [Strehl and Littman2008]).
We are given the original belief MDP , the known belief MDP , a policy and time horizon . Let be the probability of an escape event, i.e. the probability of sampling a statebeliefaction tuple that is not in when executing on from for steps. Let be the value of executing policy on . Then the following is true:
We now show one of two things can happen: either the greedy policy escapes from the known MDP, or it remains in it and performs near optimally. We first show that it can only escape a certain number of times before the entire reachable space is known.
Lemma 3.3 (Full Coverage of Known Space, Lemma 4.5 in [Kakade, Kearns, and Langford2003]).
All reachable statebeliefaction queries will become known after adding at most samples to .
Corollary 3.1 (Bounded Escape Probability).
At a given timestep, let . Then with probability , this can happen at most for timesteps.
We now show that when inside the known MDP, the greedy policy will be near optimal.
Lemma 3.4 (Nearoptimality of Approximate Greedy (Theorem 3.12 of [Pazis and Parr2013])).
Let be an estimate of the value function that has bounded Bellman error , where is the Bellman operator. Let be the greedy policy on . Then the policy is nearoptimal:
Let be the approximation error caused by using a finite number of neighbors in (2) instead of the Bellman operator. Then Lemma 3.4 leads to the following corollary.
Corollary 3.2 (Nearoptimality on Known Belief MDP).
If , i.e. the number of neighbors is large enough, then using Hoeffding’s inequality we can show . Then on the known belief MDP , the following can be shown with probability :
We now put together these ideas to state the main theorem.
Theorem 3.1 (BayesCPACE is PACBayes).
Let be a belief MDP. At timestep , let be the greedy policy on , and let be the statebelief pair. With probability at least , , i.e. the algorithm is close to the optimal policy for all but
steps when is used for the number of neighbors in (2).
Proof (sketch).
At time , we can form a known belief MDP from the samples collected so far. Either the policy leads to an escape event within the next steps or the agent stays within . Such an escape can happen at most times with high probability; when the escape probability is low, is optimal. ∎
Analysis of Performance Enhancements
We can initialize estimates with exact Q values for the latent MDPs.This makes the known space larger, thus reducing covering number.
Proposition 3.1 (Known Latent Initialization).
Let be the covering number of the reduced space . Then the sample complexity reduces by a factor of .
It is also unnecessary to perform value iteration until convergence.
Proposition 3.2 (Approximate Value Iteration).
One practical enhancement is to collect new samples in a batch with a fixed policy before performing value iteration. This requires two changes to the algorithm: 1) an additional loop to repeat (Lines 8–14) times, and 2) perform (Line 14) outside of the loop. This increases the sample complexity by a constant factor but has empirically reduced runtime by only performing value iteration when a large change is expected.
Proposition 3.3 (Batch Sample Update).
Suppose we collect new samples from rollouts with the greedy policy at time before performing value iteration. This increases the sample complexity only by a constant factor of .
4 Experimental Results
We compare BayesCPACE with QMDP, POMDPlite, and SARSOP for discrete BAMDPs and with QMDP for continuous BAMDPs. For discrete state spaces, we evaluate BayesCPACE on two widely used synthetic examples, Tiger [Kaelbling, Littman, and Cassandra1998] and Chain [Strens2000]. For both BayesCPACE and POMDPlite, the parameters were tuned offline for best performance. For continuous state spaces, we evaluate on a variant of the LightDark problem [Platt Jr et al.2010].
While our analysis is applicable for BAMDPs with continuous state and action spaces, any approximation the greedy selection of an action is not guaranteed to be PACBayes. Thus, we limit our continuous BAMDP experiments to discrete action spaces and leave the continuous action case for future work.
Tiger: We start with the Tiger problem. The agent stands in front of two closed doors and can choose one of three actions: listen, open the left door, or open the right door. One of the doors conceals a tiger; opening this door results in a penalty of 100, while the other results in a reward of 10. Listening informs the agent of the correct location of the tiger with probability , with a cost of 1. As observed by [Chen et al.2016], this POMDP problem can be cast as a BAMDP problem with two latent MDPs.
Table 1(c) shows that BayesCPACE performs as competitively as SARSOP and is better than QMDP or POMDPlite. This is not surprising since both BayesCPACE and SARSOP are offline solvers.
Figure 1(a) visualizes the estimated values. Because BayesCPACE explores greedily, exploration is focused on actions with high estimated value, either due to optimism from underexploration or actual high value. As a result, suboptimal actions are not taken once BayesCPACE is confident that they have lower value than other actions. Because fewer samples have been observed for these suboptimal actions, their approximated values are not tight. Note also that the original problem explores a much smaller subset of the belief space, so we have randomly initialized the initial belief from rather than always initializing to 0.5 for this visualization, forcing BayesCPACE to perform additional exploration.
Chain: The Chain problem consists of five states and two actions . Taking action in state transitions to with no reward; taking action in state transitions to with a reward of 10. Action transitions from any state to with a reward of 2. However, these actions are noisy: in the canonical version of Chain, the opposite action is taken with slip probability 0.2. In our variant, we allow the slip probability to be selected from with uniform probability at the beginning of each episode. These three latent MDPs form a BAMDP. Table 1(c) shows that BayesCPACE outperforms other algorithms.
LightDark Tiger: We consider a variant of the LightDark problem, which we call LightDark Tiger (Figure 1(b)). In this problem, one of the two goal corners (topright or bottomright) contains a tiger. The agent receives a penalty of 100 if it enters the goal corner containing the tiger and a reward of 10 if it enters the other region. There are four actions—Up, Down, Left, Right—which move one unit with Gaussian noise of . The tiger location is unknown to the agent until the left wall is reached. As in the original Tiger problem, this POMDP can be formulated as a BAMDP with two latent MDPs.
We consider two cases, one with zero noise and another with . With zero noise, the problem is a discrete POMDP and the optimal solution is deterministic; the agent hits the left wall and goes straight to the goal location. When there is noise, the agent may not reach the left wall in the first step. Paths executed by BayesCPACE still take Left until the left wall is hit and goes to the goal (Figure 1(b)).
5 Discussion
We have presented the first PACBayes algorithm for continuous BAMDPs whose value functions are Lipschitz continuous. While the practical implementation of BayesCPACE is limited to discrete actions, our analysis holds for both continuous and discrete state and actions. We believe that our analysis provides an important insight for the development of PAC efficient algorithms for continuous BAMDPs.
The BAMDP formulation is useful for realworld robotics problems where uncertainty over latent models is expected at test time. An efficient policy search algorithm must incorporate prior knowledge over the latent MDPs to take advantage of this formulation. As a step toward this direction, we have introduced several techniques that utilize the value functions of underlying latent MDPs without affecting PAC optimality.
One of the key assumptions BayesCPACE has made is that the cardinality of the latent state space is finite. This may not be true in many robotics applications in which latent variables are drawn from continuous distributions. In such cases, the true BAMDP can be approximated by sampling a set of latent variables, as introduced in [Wang et al.2012]. In future work, we will investigate methods to select representative MDPs and to bound the gap between the optimal value function of the true BAMDP and the approximated one.
Although it is beyond the scope of this paper, we would like to make two remarks. First, BayesCPACE can easily be extended to allow parallel exploration, similar to how [Pazis and Parr2016] extended the original CPACE to concurrently explore multiple MDPs. Second, since we have generative models for the latent MDPs, we may enforce exploration from arbitrary belief points. Of course, the key to efficient exploration of belief space lies in exploring just beyond the optimally reachable belief space, so “random” initialization is unlikely to be helpful. However, if we can approximate this space similarly to samplingbased kinodynamic planning algorithms [Li, Littlefield, and Bekris2016], this may lead to more structured search in belief space.
6 Acknowledgements
This work was partially funded by Kwanjeong Educational Foundation, NASA Space Technology Research Fellowships (NSTRF), the National Institute of Health R01 (#R01EB019335), National Science Foundation CPS (#1544797), National Science Foundation NRI (#1637748), the Office of Naval Research, the RCTA, Amazon, and Honda.
References

[Asmuth et al.2009]
Asmuth, J.; Li, L.; Littman, M. L.; Nouri, A.; and Wingate, D.
2009.
A bayesian sampling approach to exploration in reinforcement learning.
InConference on Uncertainty in Artificial Intelligence
.  [Bai, Hsu, and Lee2014] Bai, H.; Hsu, D.; and Lee, W. S. 2014. Integrated perception and planning in the continuous space: A pomdp approach. The International Journal of Robotics Research 33(9).

[Brafman and Tennenholtz2002]
Brafman, R. I., and Tennenholtz, M.
2002.
Rmax  A general polynomial time algorithm for nearoptimal
reinforcement learning.
Journal of Machine Learning Research
3:213–231.  [Chen et al.2016] Chen, M.; Frazzoli, E.; Hsu, D.; and Lee, W. S. 2016. POMDPlite for Robust Robot Planning under Uncertainty. In IEEE International Conference on Robotics and Automation.
 [Dearden, Friedman, and Russell1998] Dearden, R.; Friedman, N.; and Russell, S. 1998. Bayesian qlearning. In AAAI Conference on Artificial Intelligence.
 [Ghavamzadeh et al.2015] Ghavamzadeh, M.; Mannor, S.; Pineau, J.; Tamar, A.; et al. 2015. Bayesian reinforcement learning: A survey. Foundations and Trends® in Machine Learning 8(56):359–483.
 [Guez et al.2014] Guez, A.; Heess, N.; Silver, D.; and Dayan, P. 2014. Bayesadaptive simulationbased search with value function approximation. In Advances in Neural Information Processing Systems.
 [Guez, Silver, and Dayan2012] Guez, A.; Silver, D.; and Dayan, P. 2012. Efficient BayesAdaptive Reinforcement Learning using SampleBased Search. In Advances in Neural Information Processing Systems.
 [Guilliard et al.2018] Guilliard, I.; Rogahn, R. J.; Piavis, J.; and Kolobov, A. 2018. Autonomous thermalling as a partially observable markov decision process. In Robotics: Science and Systems.
 [Hsu, Rong, and Lee2008] Hsu, D.; Rong, N.; and Lee, W. S. 2008. What makes some pomdp problems easy to approximate? In Advances in Neural Information Processing Systems.
 [Javdani, Srinivasa, and Bagnell2015] Javdani, S.; Srinivasa, S.; and Bagnell, J. 2015. Shared autonomy via hindsight optimization. In Robotics: Science and Systems.
 [Kaelbling, Littman, and Cassandra1998] Kaelbling, L. P.; Littman, M. L.; and Cassandra, A. R. 1998. Planning and acting in partially observable stochastic domains. Artificial intelligence 101(12):99–134.
 [Kakade, Kearns, and Langford2003] Kakade, S.; Kearns, M. J.; and Langford, J. 2003. Exploration in metric state spaces. In International Conference on Machine Learning.
 [Kakade2003] Kakade, S. M. 2003. On the sample complexity of reinforcement learning. Ph.D. Dissertation, University College London (University of London).
 [Kearns and Singh2002] Kearns, M., and Singh, S. 2002. Nearoptimal reinforcement learning in polynomial time. Machine learning 49(23):209–232.
 [Kearns, Mansour, and Ng2002] Kearns, M.; Mansour, Y.; and Ng, A. Y. 2002. A sparse sampling algorithm for nearoptimal planning in large markov decision processes. Machine learning 49(23):193–208.
 [Kolter and Ng2009] Kolter, J. Z., and Ng, A. Y. 2009. Nearbayesian exploration in polynomial time. In International Conference on Machine Learning.
 [Kurniawati, Hsu, and Lee2008] Kurniawati, H.; Hsu, D.; and Lee, W. S. 2008. Sarsop: Efficient pointbased pomdp planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems.
 [Li, Littlefield, and Bekris2016] Li, Y.; Littlefield, Z.; and Bekris, K. E. 2016. Asymptotically optimal samplingbased kinodynamic planning. The International Journal of Robotics Research 35(5):528–564.
 [Li2009] Li, L. 2009. A unifying framework for computational reinforcement learning theory. Ph.D. Dissertation, Rutgers UniversityGraduate SchoolNew Brunswick.
 [Littman, Cassandra, and Kaelbling1995] Littman, M. L.; Cassandra, A. R.; and Kaelbling, L. P. 1995. Learning policies for partially observable environments: Scaling up. In Machine Learning Proceedings. 362–370.
 [Madani, Hanks, and Condon1999] Madani, O.; Hanks, S.; and Condon, A. 1999. On the undecidability of probabilistic planning and infinitehorizon partially observable markov decision problems. In AAAI Conference on Artificial Intelligence.
 [Ong et al.2010] Ong, S. C.; Png, S. W.; Hsu, D.; and Lee, W. S. 2010. Planning under uncertainty for robotic tasks with mixed observability. The International Journal of Robotics Research 29(8):1053–1068.
 [Papadimitriou and Tsitsiklis1987] Papadimitriou, C. H., and Tsitsiklis, J. N. 1987. The complexity of markov decision processes. Mathematics of operations research 12(3):441–450.
 [Pazis and Parr2013] Pazis, J., and Parr, R. 2013. Pac optimal exploration in continuous space markov decision processes. In AAAI Conference on Artificial Intelligence.
 [Pazis and Parr2016] Pazis, J., and Parr, R. 2016. Efficient pacoptimal exploration in concurrent, continuous state mdps with delayed updates. In AAAI Conference on Artificial Intelligence.
 [Pineau, Gordon, and Thrun2003] Pineau, J.; Gordon, G.; and Thrun, S. 2003. Pointbased value iteration: An anytime algorithm for pomdps. In International Joint Conference on Artificial Intelligence.
 [Platt Jr et al.2010] Platt Jr, R.; Tedrake, R.; Kaelbling, L.; and LozanoPerez, T. 2010. Belief space planning assuming maximum likelihood observations. In Robotics: Science and Systems.
 [Poupart et al.2006] Poupart, P.; Vlassis, N.; Hoey, J.; and Regan, K. 2006. An analytic solution to discrete bayesian reinforcement learning. In International Conference on Machine Learning.
 [Silver and Veness2010] Silver, D., and Veness, J. 2010. Montecarlo planning in large pomdps. In Advances in Neural Information Processing Systems.
 [Smith and Simmons2005] Smith, T., and Simmons, R. 2005. Pointbased pomdp algorithms: Improved analysis and implementation. In UAI.
 [Somani et al.2013] Somani, A.; Ye, N.; Hsu, D.; and Lee, W. S. 2013. Despot: Online pomdp planning with regularization. In Advances in Neural Information Processing Systems.
 [Sondik1978] Sondik, E. J. 1978. The optimal control of partially observable markov processes over the infinite horizon: Discounted costs. Operations research 26(2):282–304.
 [Spaan and Vlassis2005] Spaan, M. T., and Vlassis, N. 2005. Perseus: Randomized pointbased value iteration for pomdps. Journal of Artificial Intelligence Research 24:195–220.

[Strehl and Littman2008]
Strehl, A. L., and Littman, M. L.
2008.
Online linear regression and its application to modelbased reinforcement learning.
In Advances in Neural Information Processing Systems.  [Strehl et al.2006] Strehl, A. L.; Li, L.; Wiewiora, E.; Langford, J.; and Littman, M. L. 2006. Pac modelfree reinforcement learning. In International Conference on Machine Learning.
 [Strehl, Li, and Littman2009] Strehl, A. L.; Li, L.; and Littman, M. L. 2009. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research 10(Nov):2413–2444.
 [Strens2000] Strens, M. 2000. A bayesian framework for reinforcement learning. In International Conference on Machine Learning.
 [Sunberg and Kochenderfer2017] Sunberg, Z., and Kochenderfer, M. J. 2017. Online algorithms for pomdps with continuous state, action, and observation spaces. preprint arXiv:1709.06196.
 [Wang et al.2005] Wang, T.; Lizotte, D.; Bowling, M.; and Schuurmans, D. 2005. Bayesian sparse sampling for online reward optimization. In International Conference on Machine Learning.
 [Wang et al.2012] Wang, Y.; Won, K. S.; Hsu, D.; and Lee, W. S. 2012. Monte carlo bayesian reinforcement learning. In International Conference on Machine Learning.
7 Supplementary Material
Proof of Lemma 3.1
The proof has a few key components. Firstly, we show that the reward and transition functions are Lipschitz continuous. Secondly, we show that the Q value that differ only in belief is Lipschitz continuous. Finally, we put these together to show that the Q value in statebeliefaction space is Lipschitz continuous. For notational simplicity, let .
Lipschitz continuity for reward and transition functions
We begin by showing that the reward as a function of the statebeliefaction is Lipschitz continuous. For any two tuples and , the following is true:
(6)  
where we have used Assumption 3.1 for the 4th inequality.
Similarly, the state transition as a function of the statebeliefaction can also be shown to be Lipschitz continuous:
(7)  
where we have used Assumption 3.1 for the 4th inequality.
Lipschitz continuity for fixed stateaction Q value
We’ll use the following inequality. For two positive bounded functions and ,
(8) 
First let’s assume the following is true:
(9) 
We will derive the value of (if it exists) by expanding the expression for the action value function.
Let be the deterministic belief update. We have the following:
(10)  
where we have used (6), (7), (8), (9), and Assumption 3.2 for the 2nd, 3rd, 4th, 5th and last inequalities, respectively.
Applying above inequality to (9), we can solve for :
(11)  
Lipchitz contiuous Q value
We can now show that the Q value is Lipschitz continuous in statebeliefaction space. For any two tuples and satisfying Assumption 3.1 and Assumption 3.1, the following is true:
Proof of Corollary 3.1
Supp. Lemma 7.1 (Lemma 56 in [Li2009]).
Let be a sequence of independent Bernoulli trials, each with a success probability at least , for some constant Then for any and , with probability at least , if .
After at most nonoverlapping trajectories of length , happens for at least times with probability at least . Then, from Lemma 3.3, all reachable stateactions will have become known, making . Setting , we can have at most steps in which .
Proof of Corollary 3.2
This follows from Lemma 3.13, 3.14 of [Pazis and Parr2013] to get , and applying our Lemma 3.3.
Proof of Theorem 3.1
Supp. Lemma 7.2 (Lemma 2 in [Kearns and Singh2002]).
If