I Introduction
Imperfect robot control, sensor noise, and unexpected environment changes all contribute to uncertainties and pose significant challenges to robust robot planning. Robots must explore in order to gain information and reduce uncertainty. At the same time, they must exploit the information to achieve task objectives. The partially observable Markov decision process (POMDP) [8, 25] provides a principled general framework to balance exploration and exploitation optimally. It has found application in many robotic tasks, ranging from navigation [20], manipulation [7, 12] to humanrobot interaction [14]. However, solving POMDPs exactly is computationally intractable in the worst case [16]. While there has been rapid progress on efficient approximate POMDP algorithms in recent years (e.g., [23, 13, 22, 24, 21]), it remains a challenge to scale up to very large POMDPs with complex dynamics.
The complexity of a POMDP lies in the system dynamics, partial observability, and particularly, the confluence of the two. We introduce POMDPlite, a factored model that restricts partial observability to state variables that are constant or change deterministically. While this may appear restrictive, the POMDPlite is powerful enough to model a variety of interesting robotic tasks:

Unknown goals. An assistive agent helps a human cooking one of several dishes, without knowing the human’s intention in advance [6].
These tasks all require the robot to gather information on the unknown quantities from noisy observations, while achieving the task objective at the same time. They in fact belong to a special case, in which the hidden variables remain constant throughout. We mainly focus on this special case here.
Interestingly, the famous Tiger problem, which appeared in a seminal paper on POMDPs [8]
, also belongs to this special case, after a small modification. In Tiger, an agent stands in front of two closed doors. A tiger is behind one of the doors. The agent’s objective is to open the door without the tiger. In the POMDP model, the state is the unknown tiger position. The agent has three actions: open the left door (OL), open the right door (OR), and listen (LS). OL and OR produce no observation. LS produces a noisy observation, tiger left (TL) or tiger right (TR), each correct with probability
. Listening has a cost of . If the agent opens the door with no tiger, it gets a reward of ; otherwise, it incurs a penalty of . To perform well, the agent must decide on the optimal number of listening actions before taking the open action. While Tiger is a toy problem, it captures the essence of robust planning under uncertainty: trade off gathering information and exploiting the information to achieve the task objective. The original Tiger is a repeated game. Once the agent opens a door, the game resets with the tiger going behind the two doors with equal probability. We change it into a oneshot game: the game terminates once the agent opens a door. The oneshot game has a single state variable, the tiger position, which remains unchanged during the game, and thus admits a POMDPlite model. The repeated game is a POMDP, but not a POMDPlite.A POMDPlite is equivalent to a set of Markov decision processes (MDPs) indexed by a hidden parameter. The key idea for the equivalence transformation is to combine a POMDP state and an observation to form an expanded MDP state, and capture both POMDP statetransition uncertainty and observation uncertainty in the MDP transition dynamics. In the oneshot Tiger example, we form two MDPs indexed by the tiger position, left (L) or right (R) (Fig. 1). An MDP state is a pair, consisting of a POMDP state and an observation. For example, in the MDP with the tiger on the left, we have , which represents that the true tiger position is L and the agent receives the observation TL. If the agent takes the action LS, with probability , we transit to the new state and receives observation TR. See Section III for details of the general construction.
The equivalence enables us to develop an online algorithm for POMDPlite through modelbased Bayesian reinforcement learning (RL). If the hidden parameter value were known, our problem would simply become an MDP, which has wellestablished algorithms. To gather information on the unknown hidden parameter, the robot must explore. It maintains a belief, i.e.
, a probability distribution over the hidden parameter and follows the internal reward approach for modelbased Bayesian RL
[11, 26], which modifies the MDP reward function in order to encourage exploration. At each time step, the online algorithm solves an internal reward MDP to choose an action and then updates the belief to incorporate the new observation received. Our algorithm is simple to implement. It performs well on largescale POMDPlite tasks with up to states and outperforms the stateoftheart generalpurpose POMDP algorithms. Furthermore, it is nearBayesianoptimal under suitable conditions.Ii Related Work
POMDP planning has a huge literature (see, e.g., [8, 13, 25, 23, 13, 22, 24, 21] ). Our brief review focuses on online search algorithms. At each time step, an online algorithm performs a lookahead search and computes a best action for the current belief only [19]. After the robot executes the action, the algorithm updates the belief based on the observation received. The process then repeats at the new belief for the next time step. Online search algorithms scale up by focusing on the current belief only, rather than all possible beliefs that the robot may encounter. Further, since online algorithms recompute a best action from scratch at each step, they naturally handle unexpected environment changes without additional overhead. POMCP [22] and DESPOT [24] are the fastest online POMDP algorithms available today. Both employ the idea of sampling
future contingencies. POMCP performs Monte Carlo tree search (MCTS). It has low overhead and scales up to very large POMDPs, but it has extremely poor worstcase performance, because MCTS is sometimes overly greedy. DESPOT samples a fixed number of future contingencies deterministically in advance and performs heuristic search on the resulting search tree. This substantially improves the worstcase performance bound. It is also more flexible and easily incorporates domain knowledge. DESPOT has been successfully implemented for realtime autonomous driving in a crowd
[2]. It is also a crucial component in a system that won the Humanitarian Robotics and Automation Technology Challenge (HRATC) 2015 on a demining task.Instead of solving the general POMDP, we take a different approach and identify a structural property that enables simpler and more efficient algorithms through modelbased Bayesian RL. Like POMDPlite, the mixed observability Markov decision process (MOMDP) [15] is also a factored model. However, it does not place any restriction on partially observable state variables. It is in fact equivalent to the general POMDP, as every POMDP can be represented as a MOMDP and vice versa. The hidden goal Markov decision process (HGMDP) [6] and the hidden parameter Markov decision process (HiPMDP) [5] are related to POMDPlite. They both restrict partially observability to static hidden variables. The work on HGMDP relies on a myopic heuristic for planning, and it is unlikely to perform well on tasks that need exploration. The work on HiPMDP focuses mainly on learning the hidden structure from data.
There are several approaches to Bayesian RL [1, 17, 27, 11]. The internal reward approach is among the most successful. It is simple and performs well in practice. Internal reward methods can be further divided into two main categories, PACMDP and Bayesian optimal. PACMDP algorithms are optimal with respect to the true MDP [9, 27, 26]. They provide strong theoretical guarantee, but may suffer from over exploration empirically. Bayesian optimal algorithms are optimal with respect to the optimal Bayesian policy. They simply try to achieve high expected total reward. In particular, the Bayesian Exploration Bonus (BEB) [11] algorithm achieves lower sample complexity than the PACMDP algorithms. However, BEB requires a Dirichlet prior on the hidden parameters. Our algorithm is inspired by BEB, but constructs the exploration bonus differently. It allows arbitrary discrete prior, a very useful feature in practice.
Iii POMDPlite
Iiia Definition
POMDPlite is a special class of POMDP with a “deterministic assumption” on its partially observable variable, specifically, the partially observable variable in POMDPlite is static or has deterministic dynamic. Formally we introduce POMDPlite as a tuple , where is a set of fully observable states, is the hidden parameter which has finite number of possible values: , the state space is a cross product of fully observable states and hidden parameter: . is a set of actions, is a set of observations. The transition function for specifies the probability of reaching state when the agent takes action at state , where or according to the “deterministic assumption”. The observation function specifies the probability of receiving observation after taking action and reaching state . The reward function specifies the reward received when the agent takes action at state . is the discount factor.
In POMDPlite, the state is unknown and the agent maintains a belief , which is a probability distribution over the states. In each step, the agent takes an action and receives a new observation, the belief is updated according to Bayes’ rule, . The solution to a POMDPlite is a policy which maps belief states to actions, i.e., . The value of a policy is the expected reward with respect to the initial belief : , where and denote the state and action at time . An optimal policy has the highest value in all belief states, i.e., , and the corresponding optimal value function satisfies Bellman’s equation:
IiiB Equivalent Transformation to a Set of MDPs
In this section, we show an important property of POMDPlite model that it is equivalent to a collection of MDPs indexed by . A MDP model with parameter is a tuple , where is a set of states, is a set of actions, is the transition function, is the reward function, is the discount factor.
Theorem 1
Let be a POMDPlite model, where . It equals to a collection of MDPs indexed by , .
[Proof of Theorem 1] To show the equivalence between and , we first reduce to . This direction is easy, we can simply treat as part of the state in a POMDPlite model, the remaining part can become part of a POMDPlite model without change. The more interesting direction is to reduce to .
Let’s first consider the case when the value of remains constant. Given , the POMDPlite model becomes a MDP model with parameter , which consists of the following elements: . Where is the state space: , , in which simply means no observation received. is the set of actions which is identical to the actions in the POMDPlite model. The transition function specifies the probability of reaching state after taking action at state , where and are the transition and observation probability function in the POMDPlite model. The reward function specifies the reward received when the agent takes action at state . is the discount factor. The graphic model in Fig. 2 shows the relationship between POMDPlite model and the corresponding MDP model with parameter . Since the hidden parameter has finite number of values, a POMDPlite can be reduced to a collection of MDPs indexed by .
Then, we show that a simple extension allows us to handle the case when the value of the hidden variable changes deterministically. The key intuition here is that the deterministic dynamic of hidden variable does not introduce additional uncertainties into the model, i.e., given the initial value of hidden variable , and the history up to any time step , , the value of hidden variable can be predicted using a deterministic function, . Thus given the initial value of hidden variable and a deterministic function , a POMDPlite model can be reduced to a MDP model: . Compared with the static case, the state here is further augmented by the history i.e., . The value of was fully captured by , and , . The rest of the MDP model is similar to the static case. In particular, the set of actions is identical to the POMDPlite model, the transition function is , the reward function is , is the discount factor. Since has finite number of values, a POMDPlite can be reduced to a collection of MDPs indexed by .
IiiC Algorithm
In this part, we present an efficient model based BRL algorithm for POMDPlite. The solution to the BRL problem is a policy , which maps a tuple to actions, i.e., . The value of a policy for a belief and state is given by Bellman’s equation
Where , is the mean reward function, is the mean transition function. The second line follows from the fact that belief update is deterministic, i.e., . The optimal Bayesian value function is
(1) 
is the optimal action that maximizes the right hand size. Like the optimal policy in the original POMDPlite problem, the optimal Bayesian policy chooses actions not only based on how they affect the next state but also based on how they affect the next belief.
However, the optimal Bayesian policy is computationally intractable. Instead of exploring by updating the belief each step, our algorithm explores by explicitly modify the reward function. In other words, each state action pair will have a reward bonus based on how much information it can reveal. The reward bonus used by our algorithm is motivated by the observation that the belief gets updated whenever some information about the hidden parameter has been revealed, thus we use the divergence between two beliefs to measure the amount of information gain. The reward bonus is defined formally as follows:
Definition 1
When the belief is updated from to , we measure the information gain by the divergence between and , i.e., . Based on it, the reward bonus for is defined as the expected divergence between current belief and next belief :
where is the constant tuning factor, is the updated belief after observing .
At each time step, our algorithm solves an internal reward MDP, . It chooses action greedily with respect to the following value function
(2) 
Where , in which is the reward bonus term and it is defined in Definition 1. Other parts are identical to Equation 1 except that belief is not updated in this equation. We can solve it using the standard Value Iteration algorithms, which have time complexity of . In this work, we are more interested in problems with large state space, thus we are using UCT [10], an online MDP solver, to achieve online performance. Details of our algorithm is described in Algorithm 1, in which .
Iv Analysis
Although our algorithm is a greedy algorithm, it actually performs suboptimally only in a polynomial number of time steps. In this section, we present some theoretical results to bound the sample complexity of our algorithm. Unless stated otherwise, the proof of the lemmas in this section are deferred to the appendix. For a clean analysis, we assume the reward function is bounded in .
Iva Sample Complexity
The sample complexity measures the number of samples needed for an algorithm to perform optimally. We start with a definition of sample complexity on a state action pair .
Definition 2
Given the initial belief , target accuracy , reward bonus tuning factor , we define the sample complexity function of as: , such that if has been visited more than times, starting from belief , the corresponding reward bonus of visiting at the new belief is less than , i.e., . We declare as known if it has been sampled more than times, and cease to update the belief for sampling known state action pairs.
The following is an assumption for our theorem to hold true in general. The assumption essentially says that the earlier you try a stateaction pair, the more information you can gain from it. We give a concrete example to illustrate our assumption in Lemma 1.
Assumption 1
The reward bonus monotonically decreases for all state action pairs and timesteps , i.e., .
Now, we present our central theoretical result, which bounds the sample complexity of our algorithm with respect to the optimal Bayesian policy.
Theorem 2
Let the sample complexity of be , where , . Let denote the policy followed by the algorithm at time , and let , be the corresponding state and belief. Then with probability at least , , i.e., the algorithm is close to the optimal Bayesian policy, for all but
time steps.
In other words, our algorithm acts suboptimally for only a polynomial number of time steps.
Although our algorithm was primary designed for Discrete prior, Theorem 2 can be applied to many prior distributions. We apply it to two simple special classes, which we can provide concrete sample complexity bound. First, we show that in the case of independent Dirichlet prior, the reward bonus monotonically decreases and the sample complexity of a pair can be bounded by a polynomial function. This case also satisfies Assumption 1.
Lemma 1 (Independent Dirichlet Prior)
Let be the number of times has been visited. For a known reward function and an independent Dirichlet prior over the transition dynamics for each pair, monotonically decreases at the rate of , and the sample complexity function .
The strength of our algorithm lies in its ability to handle Discrete prior. We use a very simple example (Discrete prior over unknown deterministic MDPs) to show this advantage, and we state it in the following lemma. The intuition behind this lemma is quite simple, after sampling a state action pair, the agent will know its effect without noise.
Lemma 2 (Discrete prior over Deterministic MDPs)
Let be a Discrete prior over deterministic MDPs, the sample complexity function .
IvB Proof of Theorem 2
The key intuition for us to prove our algorithm quickly achieves nearoptimality is that at each time step our algorithm is optimistic with respect to the Bayesian policy, and the value of optimism decays to zero given enough samples.
The proof for Theorem 2 follows the standard arguments from previous PACMDP results. We first show that is close to the value of acting according to the optimal Bayesian policy, assuming the probability of escaping the known stateaction set is small. Then we use the Hoeffding bound to show that this “escaping probability” can be large only for a polynomial number of time steps.
We begin our proof with the following lemmas. Our first lemma essentially says that if we solve the internal reward MDP using the current mean of belief state with an additional exploration bonus in Definition 1, this will lead to a value function which is optimistic to the Bayesian policy.
Lemma 3 (Optimistic)
Let be the value function in our algorithm, be the value function in Bayesian policy. if , then , .
The following definition is a generalization of the “known stateaction MDP” in [27] to Bayesian settings. It is an MDP whose dynamics (transition function and reward function) are equal to the mean MDP for pairs in (known set). For other pairs, the value of taking those pairs in is equal to the current
value estimate.
Definition 3
Given current belief is , a set of value estimate for each pair, i.e., , and a set of known pairs, i.e., . We define the known stateaction MDP, , as follows. is an additional state, under all actions from the agent returned to with probability and received reward . For all , and . For all , and .
Our final lemma shows that the internal reward MDP and the known stateaction MDP have low error in the set of known pairs.
Lemma 4 (Accuracy)
Fix the history to the time step , let be the belief, be the state, be the set of known pairs, be the known stateaction MDP, be the greedy policy with respect to current belief , i.e., . Then .
Now, we are ready to prove Theorem 2.
[Proof of Theorem 2] Let be as described in Lemma 4. Let , then (see Lemma 2 of [9]). Let denote the event that, a pair not in is generated when executing starting from for time steps. We have
The first inequality follows from the fact that equals to unless occurs, and can be bounded by since we can limit the reward bonus to and still maintain optimism. The second inequality follows from the definition of above, the third inequality follows from Lemma 4, the last inequality follows from Lemma 3 and the fact that is precisely the optimal policy for the internal reward MDP at time . Now, suppose , we have . Otherwise, if , by Hoeffding inequality, this will happen no more than time steps with probability , where (.) notation suppresses logarithmic factors.
V Experiments
To evaluate our algorithm experimentally, we compare it with several state of the art algorithms in POMDP literature. POMCP [22] and DESPOT [24] are two successful online POMDP planners which can scale to very large POMDPs. QMDP [28] is a myopic offline solver being widely used for its efficiency. SARSOP [13] is a state of the art offline POMDP solver which helps to calibrate the best performance achievable for POMDPs of moderate size. Mean MDP is a common myopic approximation of Bayesian planning, which does not do exploration. For SARSOP, POMCP and DESPOT, we used the software provided by the authors, with a slight modification on POMCP to make it strictly follow 1second time limit for planning. For our algorithm and Mean MDP, a MDP needs to be solved each step. We use an online MDP solver UCT [10] with similar parameter settings used in POMCP. The reward bonus scalar used by our algorithm is typically much smaller than the one required by Theorem 2, which is a common trend for internal reward algorithms. We tuned offline before using it for planning.
States  

Actions  
Observations  
QMDP  
SARSOP  
POMCP  
DESPOT  
Mean MDP  
POMDPlite 
We first apply the algorithms on two benchmarks problems in POMDP literature, in which we demonstrate the scaling up ability of our algorithm on larger POMDPs. In [23], a robot moving in an grid which contains rocks, each of which may be good or bad with probability initially. At each step, the robot can move to an adjacent cell, or sense a rock. The robot can only sample the rock when it is in the grid which contains a rock. Sample a rock gives a reward of if the rock is good and otherwise. Move or Sample will not produce observation. Sensing produces an observation in set with accuracy decreasing exponentially as the robot’s distance to the rock increases. The robot reaches the terminal state when it passes the east edge of the map. The discount factor is . The hidden parameter is the property of rock and it remains constant, thus this problem can be modeled as POMDPlite.
In [22], ships are placed at random into a grid, subject to the constraint that no ship may be placed adjacent or diagonally adjacent to another ship. Each ship has a different size of . The goal is to find and sink all ships. Initially, the agent does not know the configuration of the ships. Each step, the agent can fire upon one cell of the grid, and receives observation if a ship was hit, otherwise it will receive observation . There is a reward per step, and a terminal reward of for hitting every cell of every ship. It is illegal to fire twice on the same cell. The discount factor is . The hidden parameter is the configuration of ships, which remains constant, thus this problem can also be modeled as POMDPlite.
The results for Rocksample and Battleship are shown in Table I. All algorithms, except for QMDP and SARSOP (offline algorithms), run in real time with second per step. The result for SARSOP was replicated from [15], other results are from our own test and were averaged over 1000 runs. “” means the problem size is too large for the algorithm. is short for Rocksample, is short for Battleship. As we can see from Table I, our algorithm achieves similar performance with the state of the art offline solvers when the problem size is small. However, when the size of problem increases, offline solvers start to fail and our algorithm outperforms other online algorithms.
Finally, we show a robot arm grasping task, which is originated from Amazon Picking Challenge. A VREP [18] simulation view is shown in Fig. (a)a. The goal of the robot arm is to grasp the cup out of the shelf quickly and robustly. The robot knows its configuration exactly and its movement is deterministic. However, due to sensor limitations, the initial position of the cup is uncertain. The gripper has a tactile sensor inside each finger, which gives positive readings when the inner part of the finger gets in touch with the cup. The robot needs to move around to localize the cup, and grasp it as soon as possible. Usually, this can be modeled as a POMDP problem. However, if we model it as a POMDPlite, our algorithm can achieve much better performance compared with solving it as a POMDP. Now, we introduce our planning model for this task. We restrict the movement of the gripper in a plane, as shown in Fig. (b)b. We divide the plane into regions relative to the gripper. If the cup is in region , the gripper can get in touch with the cup by moving along xaxis or yaxis. If the cup is in region , the gripper can get in touch with the cup by moving along the yaxis. If the cup is in region , the gripper can not sense the cup by moving in a single direction. The gripper can move along the or axis with step size of . The reward for each movement is in region , in region and in region . The gripper can close or open its fingers, with reward of . Picking the cup gives a reward of if the pick is successful and otherwise.
We compare our algorithm with POMCP, DESPOT, and Mean MDP, since QMDP and SARSOP do not support continuous state space. All algorithm are tested via model evaluation and VREP simulation. Model evaluation means we use the planning model to examine the policy. VREP simulation means we compute the best action using the planning model, then execute it in VREP simulation. The next state and observation are obtained from VREP. The results for model evaluation and VREP simulation are reported in Table II. The time used for online planning of all algorithms is second per step. We run trials for model evaluation and trials for VREP simulation. As we can see, our algorithm achieves higher return and success rate in both settings compared with other algorithms.
Model Evaluation  VREP simulation  
States  Continuous  Continuous  
Actions  7  7  
Observations  6  6  



POMCP 



DESPOT 



Mean MDP 



POMDPlite 


Vi Conclusion
We have introduced POMDPlite, a subclass of POMDP with hidden variables that are either static or change deterministically. A POMDPlite is equivalent to a set of MDPs indexed by a hidden parameter. By exploiting this equivalence, we have developed a simple online algorithm for POMDPlite through modelbased Bayesian reinforcement learning. Preliminary experiments suggest that the algorithm outperforms stateoftheart generalpurpose POMDP solvers on very large POMDPlite models, makes it a promising tool for largescale robot planning under uncertainty.
Currently, we are implementing and experimenting with the algorithm on a Kinova Mico robot for object manipulation. It is interesting and important to investigate extensions that handle large observation and action spaces.
References

[1]
J. Asmuth, L. Li, M. L. Littman, A. Nouri, and D. Wingate, “A bayesian
sampling approach to exploration in reinforcement learning,” in
Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence
, 2009.  [2] H. Bai, S. Cai, D. Hsu, and W. Lee, “Intentionaware online POMDP planning for autonomous driving in a crowd,” in Proc. IEEE Int. Conf. on Robotics & Automation, 2015.
 [3] H. Bai, D. Hsu, and W. Lee, “Planning how to learn,” in Proc. IEEE Int. Conf. on Robotics & Automation, 2013.
 [4] T. Bandyopadhyay, K. Won, E. Frazzoli, D. Hsu, W. Lee, and D. Rus, “Intentionaware motion planning,” in Algorithmic Foundations of Robotics X—Proc. Int. Workshop on the Algorithmic Foundations of Robotics (WAFR), 2012.
 [5] F. DoshiVelez and G. Konidaris, “Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations,” arXiv preprint arXiv:1308.3513, 2013.
 [6] A. Fern, S. Natarajan, K. Judah, and P. Tadepalli, “A decisiontheoretic model of assistance,” in Proc. AAAI Conf. on Artificial Intelligence, 2007.
 [7] K. Hsiao, L. Kaelbling, and T. LozanoPérez, “Grasping POMDPs,” in Proc. IEEE Int. Conf. on Robotics & Automation, 2007.
 [8] L. Kaelbling, M. Littman, and A. Cassandra, “Planning and acting in partially observable stochastic domains,” Artificial Intelligence, vol. 101, no. 1–2, pp. 99–134, 1998.
 [9] M. Kearns and S. Singh, “Nearoptimal reinforcement learning in polynomial time,” Machine Learning, vol. 49, no. 23, pp. 209–232, 2002.
 [10] L. Kocsis and C. Szepesvári, “Bandit based montecarlo planning,” in Machine Learning: ECML 2006. Springer, 2006, pp. 282–293.
 [11] J. Z. Kolter and A. Y. Ng, “Nearbayesian exploration in polynomial time,” in Proceedings of the 26th Annual International Conference on Machine Learning, 2009.
 [12] M. Koval, N. Pollard, and S. Srinivasa, “Pre and postcontact policy decomposition for planar contact manipulation under uncertainty,” Int. J. Robotics Research, 2015.
 [13] H. Kurniawati, D. Hsu, and W. Lee, “SARSOP: Efficient pointbased POMDP planning by approximating optimally reachable belief spaces,” in Proc. Robotics: Science & Systems, 2008.
 [14] S. Nikolaidis, R. Ramakrishnan, K. Gu, and J. Shah, “Efficient model learning from jointaction demonstrations for humanrobot collaborative tasks,” in Proc. ACM/IEEE Int. Conf. on HumanRobot Interaction, 2015.
 [15] S. C. Ong, S. W. Png, D. Hsu, and W. S. Lee, “Planning under uncertainty for robotic tasks with mixed observability,” The International Journal of Robotics Research, vol. 29, no. 8, pp. 1053–1068, 2010.
 [16] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of markov decision processes,” Mathematics of operations research, vol. 12, no. 3, pp. 441–450, 1987.
 [17] P. Poupart, N. Vlassis, J. Hoey, and K. Regan, “An analytic solution to discrete bayesian reinforcement learning,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006.
 [18] E. Rohmer, S. P. Singh, and M. Freese, “Vrep: A versatile and scalable robot simulation framework,” in Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE, 2013, pp. 1321–1326.
 [19] S. Ross, J. Pineau, S. Paquet, and B. ChaibDraa, “Online planning algorithms for POMDPs,” J. Artificial Intelligence Research, vol. 32, no. 1, pp. 663–704, 2008.
 [20] N. Roy and S. Thrun, “Coastal navigation with mobile robots,” in Advances in Neural Information Processing Systems. The MIT Press, 1999, vol. 12, pp. 1043–1049.
 [21] K. Seiler, H. Kurniawati, and S. Singh, “An online and approximate solver for pomdps with continuous action space,” in Proc. IEEE Int. Conf. on Robotics & Automation, 2015.
 [22] D. Silver and J. Veness, “MonteCarlo planning in large POMDPs,” in Advances in Neural Information Processing Systems, 2010.
 [23] T. Smith and R. Simmons, “Pointbased POMDP algorithms: Improved analysis and implementation,” in Proc. Conf. on Uncertainty in Artificial Intelligence, 2005.
 [24] A. Somani, N. Ye, D. Hsu, and W. Lee, “DESPOT: Online POMDP planning with regularization,” in Advances in Neural Information Processing Systems, 2013.
 [25] E. Sondik, “The optimal control of partially observable Markov processes,” Ph.D. dissertation, Stanford University, Stanford, California, USA, 1971.

[26]
J. Sorg, S. P. Singh, and R. L. Lewis, “Variancebased rewards for approximate bayesian reinforcement learning,” in
UAI, Proceedings of the TwentySixth Conference on Uncertainty in Artificial Intelligence, 2010.  [27] A. L. Strehl, L. Li, and M. L. Littman, “Incremental modelbased learners with formal learningtime guarantees,” in UAI, Proceedings of the 22nd Conference in Uncertainty in Artificial Intelligence, 2006.
 [28] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. The MIT Press, 2005.
Techinical Proofs
Via Proof of Lemma 1
Let denote , and let denote . According the definition of the Dirichlet distribution, . The reward bonus term can be described as
As for the sample complexity, if , we have
ViB Proof of Lemma 3
We first introduce some notations which will be used in the proof. Denote a step history as , where and is the belief and state at time step . The following is a definition of divergence on reward function and transition function.
Definition 4
Denote as the set of mean reward function, as the set of mean transition function. i.e., given belief , , . Suppose the belief changes from to , the divergence of and are denoted as:
Based on Definition 4, we introduce the regret of a step history if the belief is not updated each step.
Definition 5
Given a step history , if the belief is not updated, the regret of the action is defined as . Define as the total regret of the history.
The following definition measures the extra value from reward bonus term when we are using internal reward.
Definition 6
Given a policy step history , the reward bonus for the action is . Define as the total extra value from reward bonus.
In the next lemma, we are going to bound the regret using the extra value from reward bonus.
Lemma 5
We begin the proof by showing that the divergence of the reward function is bounded by the reward bonus if was chosen properly.
The first inequality above follows from the fact that is bounded in , and triangle inequality. The second inequality also follows from triangle inequality, i.e., . The third inequality follows from our monotonicity assumption on the reward bonus: .
Similarly, we can show that the divergence of the transition function is bounded by the reward bonus.
Finally, we are going to show that the total regret of can be bounded by the extra value from reward bonus :
The first inequality above follows from the fact that the divergence of reward function and transition is bounded by the reward bonus for some value of . The second inequality follows .
Now, we are ready to prove Lemma 3.
[Proof of Lemma 3] Let , then (see Lemma 2 of [9]), where and is the initial belief and state. Consider some state , let be the new belief formed by updating after steps, then
(3) 
The first inequality transformation in Equation 3 follows:
Comments
There are no comments yet.