1 Introduction
The de facto model for decision making under uncertainty are partiallyobservable Markov decision processes (POMDPs) [Lit96, PT87], and they have been applied in diverse applications ranging from planning [RN10]
[KLM96], to robotics [KGFP09, KLC98]. One of the classical and fundamental payoff function for POMDPs is the discountedsum payoff that aggregates the rewards of the transitions as a discounted sum. The traditional objective in POMDPs has been to obtain policies that maximize the expected discountedsum payoff.One crucial drawback of the traditional objective (that asks for expectation maximization) is that it allows for undesirable events that can happen with low probability. For example, consider a policy
that with probability achieves payoff and with probability achieves payoff , and a different policy that achieves payoff with probability . If payoff values below are undesirable, then the first policy, though better for expected payoff, allows undesirable events with significant probability, and hence the second policy is preferable. Hence, there has been a recent interest to study objectives where, instead of maximizing the expected payoff [HYV16], the goal is to maximize the probability that the payoff is above a threshold.A drawback of the approach to maximize the probability that the payoff exceeds a threshold is that it ignores the optimization aspect of maximizing the expectation. In this work we consider an objective for POMDPs where both aspects are present. More precisely, we consider a “guaranteed payoff optimization (GPO)” problem for POMDPs, where given a threshold , the goal is to maximize the expectation while ensuring that the payoff is at least .
As a concrete motivation for the GPO problem, consider planning under uncertainty (e.g., selfdriving cars) where certain events are catastrophic (e.g., crashes), and in the model they are assigned low payoffs. Such catastrophic events must be avoided even at the expense of expected payoff. That is, policies must maximize the expected payoff, ensuring the avoidance of catastrophic events. Hence, for planning in safetycritical applications the GPO problem is natural.
In this work, our main contributions are as follows:

We study the GPO problem for POMDPs, and present a practical solution approach for the problem. In particular, given a POMDP with the GPO problem, we present a transformation to a different POMDP where it suffices to solve the traditional expectation objective. Our solution approach first constructs a representation of all strategies that satisfies item a) of the GPO problem, and then we extend the partiallyobservable Monte Carlo planning (POMCP) approach to obtain optimal policies w.r.t. expectation among the above strategies.

We present experimental results on several classical POMDP examples from the literature to show how our approach can efficiently solve the GPO problem for POMDPs.
Related Works.
Works studying POMDPs with discounted sum range from theoretical results (see, e.g., [PT87, Lit96]) to practical tools (e.g. [KHL08, SV10]). Recent works focus on extracting policies which ensure that, with a given probability bound, the obtained discountedsum payoff is above a threshold (see, e.g., [HYV16]). The problem of ensuring the payoff is above a given threshold while optimizing the expectation has been considered for fullyobservable MDPs and the longrun average and stochastic shortest path objectives [BFRR14, RRS15]; and also with probabilistic thresholds for longrun average payoff [CKK15]. As for POMDPs, we mention constrained POMDPs [UH10, PMP15], where the aim is to maximize the expected payoff while ensuring that the expectation of some other quantity is bounded. In contrast, our constraints are hard, i.e. they must hold always, not just on average. The work probably closest to ours is [STW16] that also considers maximizing expected payoff among all policies satisfying a given constraint, but there are two key differences from our work: they consider finite horizon POMDPs, while we consider infinite horizon ones, and more importantly, their constraints are statebased, i.e. their policy must ensure that the execution of the POMDP does not go through certain “violating” states. In contrast, our “threshold constraint” is executionbased: whether a execution yields payoff at least cannot be determined solely by looking at the set of states appearing in the execution, but the whole infinite execution has to be considered. This requires very different techniques. To our best knowledge, the GPO problem has never been considered for POMDPs with discounted sum.
2 Preliminaries
2.1 POMDPs
We denote by
the set of all probability distributions on a finite set
, i.e. all functions such that . For we denote by the support of , i.e. the set .Definition 1.
POMDPs. A POMDP is defined as a tuple where is a finite set of states, is a finite alphabet of actions, is a probabilistic transition function that given a state and an action gives the probability distribution over the successor states, is a reward function, is a finite set of observations, is a probabilistic observation function that maps every state to a distribution over observations, and is the initial belief. We abbreviate by ,
Remark 1 (Deterministic observation function).
Deterministic observation functions of type are sufficient in POMDPs (see Remark in [CCGK14]). Informally, the probabilistic aspect of the observation function can be encoded into the transition function and, by letting the product of the states and observations be the new statespace, we obtain a deterministic observation function. Thus, without loss of generality, we will always consider observation functions of type , which greatly simplifies the notation.
Plays & Histories.
A play (or an infinite path) in a POMDP is an infinite sequence of states and actions such that and for all we have . We write for the set of all plays. A finite path (or just path) is a finite prefix of a play ending with a state, i.e. a sequence from . A history is a finite sequence of actions and observations such that there is a path with for each . We write to indicate that history corresponds to a path . The length of a path (or history) , denoted by , is the number of actions in , and the length of a play is .
Beliefs.
A belief is a distribution on states (i.e. an element of ) indicating the probability of being in each particular state given the current history. The initial belief is given as part of the POMDP. Then, in each step, when the history observed so far is , the current belief is , an action is played and an observation is received, the updated belief for history can be computed by a standard formula [Cas98].
Infinitehorizon Discounted Payoff.
Given a play and a discount factor , the infinitehorizon discounted payoff of is:
We also define a discounted payoff of a finite path as
Policies.
A policy is a blueprint for selecting actions based on the past history of observations and actions. Formally, it is a function which assigns to a history a probability distribution over the actions, i.e. is the probability of selecting action after observing history (we often abbreviate to ).
Consistent Plays.
A play or a path is consistent with a policy if it can be obtained by extending its finite prefixes using . Formally, is consistent with if for each there is action such that and . A history is consistent with if there is a path consistent with such that .
Expected Value of Policies.
Given a POMDP , a policy , a discount factor , and an initial belief , the expected value of from is the expected value of the infinitehorizon discounted sum under policy when starting in a state sampled from : This definition can be formalized by a standard construction of a probability measure induced by over the set of all plays, which also gives rise to the expectation operator (see, e.g., [Put05]).
WorstCase Value of Policies.
The worstcase value of a policy from belief is where the infimum is taken over the set of all plays that are consistent with and start in a state sampled from .
Example 1.
Figure 1 shows a toy POMDP: A mining robot has to mine ore, which can be of two types (states and ). The exact type is unknown, but is more likely to occur (initial belief ). The goal is to reach the “ore mined” () state, in which a lumpsum reward is received. The robot can use several mining modes: safe mode (action ), which succeeds with probability and does not do anything if it fails, or typespecific mining modes ( and ) which succeed if applied on the correct type but result in a catastrophic failure if used on a wrong type. It can also use a sensor to accurately determine the type (after which a typespecific action can be safely used), at a cost of a onestep delay.
An exhaustive analysis of possible policies reveals that the expected value is maximized by any policy which selects in the first step (we then have ). However, the worstcase value of such a policy is , as it can result in entering after the first step. On the other hand, a policy which plays in the first step has .
Main Computational Questions.
The standard POMDP planning problem asks to compute (or approximate) the policy maximizing the expected value. In online POMDP planning, instead of computing the whole policy we have to compute, in each time step, the best action in the current situation. In other words, we must compute a good local approximation of a (near)optimal policy. [RPPCd08]. In contrast, in the threshold planning problem we are asked to compute a policy maximizing the worstcase value and thus provide strict guarantees on the performance of the system [ZP96]. In this paper, we combine these two approaches and study the guaranteed payoff optimization (GPO) problem, where we are given a POMDP and a threshold and we have to compute a policy such that

satisfies a threshold constraint: is at least .

Let denote the best expected value obtainable while ensuring a worstcase payoff of at least , i.e. . Among all policies that satisfy item a), has maximal expected value, i.e.
To efficiently tackle the GPO problem we aim to compute, in an online fashion, a local approximation of policy above. However, we do not relax requirement a). Approximations notwithstanding, the online planning algorithm we seek is such that given , the discounted payoff of every single play that can be produced by the algorithm is at least .
Example 2.
Take the POMDP in Figure 1 and a threshold . As shown in Example 1, a policy playing in the first step satisfies . However, there are better (w.r.t. the expected value) policies satisfying this constraint. The best such policy is a policy which twice plays and then plays . This policy satisfies and . (Also note that the optimal policy to maximize the expected payoff plays at the very start. However, with nonzero probability, this strategy violates the worstcase threshold .)
3 Policies for GPO Problem
We first show the GPO problem is different from the classical expectation maximization.
Example 3 (Beliefs are not sufficient for GPO.).
It is known that beliefs form a sufficient statistic of history for achieving the optimal expected value, i.e. there is always a deterministic beliefbased policy — that is, a policy such that for each history the distribution is Dirac and determined solely by the belief after observing — with optimal expected value [Son71]. However, beliefs are not a sufficient statistic for the GPO problem, as witnessed by Example 2: suppose that we use policy and consider histories and , where is the observation received in and . The beliefs and are identical, and yet , i.e. is not beliefbased.
Overview of Policy Representation.
We show (in Corollary 1) that a sufficient statistic for solving the GPO problem is a tuple , where is the belief after history and is the “remaining” distance to the threshold which we need to accumulate in the future. Formally,
This is similar to other (PO)MDP planning problems that work with thresholds [Whi93, HYV16]. However, we prove more: we obtain a precise local characterization of policies that satisfy the threshold constraint. More precisely, we show that for each history , there is a set of allowed actions such that a policy satisfies if and only if for each history it holds . We show that the function can be finitely represented and, for any history , its value can be computed algorithmically. This permits us to split the solution of the GPO problem into two separate parts: 1.) We compute the function , and 2.) we use it to restrict a standard online planning algorithm so that it always returns an action allowed for the current history.
Allowed Actions .
Intuitively, an action should be allowed after some history only if the payoff we are guaranteed to accumulate using in the current step (i.e. ) plus the best payoff which we can guarantee from the next step onward is at least . To formalize the “best payoff guaranteed from the next step on” we define the future value of any history as
where is a POMDP identical to except for having initial belief and the supremum is taken over all policies in .
Belief Supports Suffice for the Worst Case.
The crucial observation is that the future value of a history is determined only by the support of .
Lemma 1.
If histories in a POMDP are such that , then .
Intuitively, this is because the worstcase value of a policy (and thus also a future value of a history) does not depend on any transition probabilities. In a slight abuse of notation, we sometimes treat as a function from to , i.e. , for , is equal to for all histories such that .
as an Approximation of .
Since computing exactly can be inefficient in practice, we often need to work with approximations of , without relaxing the threshold constraint. We thus introduce a notion of a allowed action. Let be a function assigning numbers to belief supports. We say that an action is allowed for after history , and write it , if for all states and all observations such that is a history it holds that
(1) 
If is the function , we write simply . We typically aim at computing a lower bound on , i.e. a function such that for each . Then, as shown below, playing allowed actions still guarantees that the threshold is eventually surpassed.
Correctness of the Approximation.
The correctness of the definition is summarized in the following proposition. We say that a policy is safe for if for each history consistent with it holds that .
Proposition 1.
Let be a function such that for each . Then any policy that is safe for satisfies . Moreover a policy is safe for if and only if .
Corollary 1.
Assume that there is a policy with . Then there is also a policy such that and , and moreover, is beliefandpayoff, based, i.e. for all histories such that it holds .
From (1) we see that to compute we have to keep track of (which can be easily done online) and to compute (or a suitable underapproximation thereof). In the next section we show how to do the latter.
Example 4.
Consider the POMDP from Figure 1 with a threshold . Then , , , and . Initially, for the empty history, we have and therefore the only allowed actions are and because for all we have Suppose that is played and that the next observation witnessed is (thus, the belief is the same as before). We have . In this case, the only allowed action is because for all and and are still not allowed (since we have not accumulated any payoff and have the same belief as before). Hence, is played and consequently we obtain a payoff of (because of discounting). We remark that is, as required, above the threshold .
4 Computing Future Values
The threshold constraint in the GPO problem is global, i.e. it talks about all runs compatible with a policy. Hence, solving the GPO problem is unlikely to be amenable to purely online methods, which compute only local approximations of policies. In this section we show how to compute future values in an offline preprocessing step. Although this requires a global analysis of a POMDP, the preprocessing step can be done efficiently since computation of future values only requires working with belief supports rather than beliefs.
Belief Supports & Valid Belief Supports .
A belief support is valid if either or there is a history such that . Only valid supports can be encountered during the planning process and thus we only need to compute future values thereof. We denote by the set of valid belief supports of POMDP ; the set can be computed by a simple iterative procedure.
Obsevable Rewards.
We present efficient computation of future values under the assumption that rewards are observable. This holds for many realworld applications, see, e.g. examples in [HYV16, CCGK15]. Formally, POMDP has observable rewards if whenever . From a theoretical point of view, observability of rewards is necessary since without it, the computation of future values is at least as hard as solving a longstanding open problem in algebraic number theory. More precisely, if the rewards of a given POMDP are not observable, the computation of future values is at least as hard as solving the target discounted sum problem, a longstanding open problem in automata theory related to other open problems in algebra [BHO15]. However, for POMDPs with unobservable rewards we can at least obtain an underapproximation of , and hence our framework is also applicable to them.
Lemma 2.
If rewards in are observable, then for each and each it holds .
We thus define as for some .
Future Value Characterization.
We start by providing a characterization of future values. A successor of a belief support under action and observation is a belief support . Consider the following system of  equations with variables , :
(2) 
(Each appears on the LHS of exactly one equation in the system.)
Proposition 2.
The system (2) has a unique solution , and it satisfies .
Game Perspective for the Worst Case.
Hence, it suffices to find a solution to system (2). But the form of the system is identical to the one characterizing optimal values in 2player zerosum discounted games [ZP96]. These games can be imagined as fullyobservable MDPs in which the outcomes of actions are not resolved by a random choice but by a malicious adversary. The system (2) per se corresponds to a game where elements of are the states, actions are the same as in , and possible effects of actions are given by the function .
Algorithms to Compute Future Values.
Hence, to compute future values in practice we can employ one of several efficient algorithms for solving discountedsum games (e.g. [Bre16]). A simple yet efficient approach is to use the standard value iteration for games: we compute a sequence of functions of type such that for each , and for we inductively define
From [ZP96] it follows there is always such that for all we have , i.e. is the solution to (2), and moreover , where is a denominator of in its reduced form. Hence, the value iteration converges in at most exponentially many steps.^{1}^{1}1Since the number can be exponential in the bitsize of .
Theorem 1.
Future values of all valid belief supports in can be computed in time exponential in the size of .
Although the theoretical bound is exponential, there are several reasons for the method to work well in practice: (1.) In a concrete instance, the number of valid supports can be significantly smaller than exponential. (2.) Reaching the fixedpoint of the value iteration may also require significantly smaller number of steps than the theoretical upper bound suggests. (3.) One can show that for each , . Hence, even if reaching the fixed point takes too much time, we can set up a suitable timeout after which the value iteration is stopped, say at iteration . Then, by Proposition 1 any policy that is safe for has worstcase value . (4.) Value iteration is a simple and standard algorithm for which efficient implementations exist (see, e.g., [LDK95, SV05]).
Important note on :
generally, does not guarantee that a safe policy exists, which is necessary to apply Proposition 1. The following lemma resolves this.
Lemma 3.
For any the following holds for the functions produced by game value iteration: if , then there exists a policy which is safe for .
In particular, if then a safe policy for exists, irrespective of the way in which is computed.
as accumulated payoff. The vertical bars show the mean and standard deviation per worstcase threshold. (We have plotted at least 100 datapoints per worstcase threshold for the RockSample benchmark; 1000 for Example
1; 20 for the hallway benchmark.)5 Solving the GPO problem
We solve the GPO problem by modifying the partiallyobservable Monte Carlo planning (POMCP) algorithm [SV10].
Pomcp.
POMCP is an online planning method which in each decision epoch aims to select the best action given the current history
. In each epoch, POMCP performs a number of finitehorizon simulations starting from belief in order to compute a local approximation of the optimal expected value function: each simulation extends historyby selecting actions according to certain rules until the horizon is reached. The payoff of the produced path is then evaluated, and the result is used to update the optimal value approximation. After all the simulations proceed, the best action according to the estimated values is played, a new observation is received, and the process continues as above.
POMCP datastructure.
POMCP stores the information gained in past simulations in a search tree, in which each node corresponds to some history and contains belief , the number of times the history has been observed in previous simulations, and an approximation of the optimal expected value from . The search tree is used to guide simulations: each step in which the current history corresponds to an internal node of the tree is treated as a multiarmed bandit with parameters determined by numbers stored in children of this node, which balances exploration of new branches and exploitation of previous simulations (akin to the UCT algorithm for MDPs [KS06]). Once the simulation runs out of the scope of the search tree, it enters a rollout phase, where a fixed policy (e.g. selecting actions at random) is used to extend paths.
GPOMCP: Adapting POMCP for GPO.
We propose an augmentation of POMCP, which we call GPOMCP (guaranteed POMCP), specified as follows: First we enrich the nodes of the search tree so that a node corresponding to a history additionally includes the set and the number . When adding a new node to a search tree by extending history with action and observation , these attributes for the new node are updated as follows: and . Note that updating to requires just discrete set operations; as a matter of fact, the function is computed already during the offline computation of future values, after which it can be stored and used to efficiently update during GPOMCP execution. In particular, updating is independent of updating , which is important so as not to compromise the threshold constraints with issues of belief precision and particle deprivation.
GPOMCP: playing safe.
The execution of GPOMCP then proceeds in almost the same way as in POMCP, with a crucial exception: Whenever GPOMCP is to select a (real or simulated) action it selects only among those in , where is the current history. Note that checking whether an action is allowed is easy for histories within the search tree, since the necessary information ( and ) is stored in nodes of the tree. Out of the scope of the search tree, we need to update the current belief support and remaining payoff online, as the simulation proceeds. While this somewhat increases the complexity of rollouts, as current belief supports must be kept updated (POMCP only keeps track of the current state and of payoff won so far), as noted above, updating belief supports is easier than updating beliefs. Moreover, this increase in complexity is only an issue in the initial steps of the algorithm, where rollout steps dominate over tree traversal. Previous sections yield the following result:
Theorem 2.
For each threshold the following holds: for each play resulting from using GPOMCP on ad infinitum it holds . This holds independently of how precisely the algorithm approximates beliefs.
So unless it is impossible to satisfy the threshold constraint at all, it can be surely satisfied by using GPOMCP.
Convergence.
Another question is the one of convergence. An algorithm is said to be convergent in the limit if, assuming precise belief representation, the local approximation of optimal value converges to true optimal value (in our case to ) as the number of simulations and their depth increases. The limit convergence of GPOMCP can be proved by a straightforward adaptation of the limit convergence proof of POMCP [SV10]: we map executions of GPOMCP on POMDP to the executions of UCT on a treeshaped MDP , whose states are histories of (with the empty history as root) and where finite paths correspond to extending histories in by playing allowed actions.
6 Experiments
We tested our algorithm on two classical sets of benchmarks. The first, Hallway, was introduced in [LCK95]. In a hallway POMDP, a robot navigates a gridworld with walls and traps. We have considered variants in which traps cause nonrecoverable damage and another in which they just “spin” the robot — making him more uncertain about his current location in the grid. Additionally, we have run our algorithm on RockSample POMDPs. The latter corresponds to the classical scenario described first in [SS04]. (We use a slight adaptation with a single imprecise sensing action.) Our experimental results are summarized in Figure 2 and Table 1.
Test Environment Specifications:
WorstCase vs. Expected Payoff.
In Figure 2 we have plotted the results of running our GPOMCP algorithm on several benchmarks. In all three graphics, the tradeoff between worstcase guarantees and expected payoff is clearly visible: In the left figure, the expected payoff stays around for worstcase thresholds between and ; then drops to for threshold values above . In the center figure, the expected payoff is when the worstcase threshold is ; stays around for thresholds between and (with a slightly negative slope); then drops to for threshold values above . Finally, in the right figure, the expected payoff steadily decreases for increasing worstcase threshold values. In particular, for threshold the expected payoff is while for threshold it is .
Latency.
In Table 1 we show the latency — the amount of time it takes to determine, at each epoch, which action to play next — of GPOMCP on three of the benchmarks we considered. (Though we have run the tool on several others, these are the biggest.) Observe that, even for relatively big POMDPs, the average latency is in the order of seconds. Also, note that the preprocessing step is not too costly.
No.  states  act.  obs.  pre. proc.  avg. lat. 

tiger  s  s  
r.sample  s  s  
hallway  s  s 
Tool Availability.
Our implementation of the GPOMCP algorithm can be fetched from https://github.com/gaperez64/GPOMCP.
7 Discussion
In this work we have given a practical solution for the GPO problem. Our algorithm, GPOMCP, allows to obtain a policy which ensures a worstcase discountedsum payoff value while optimizing the expected payoff. We have implemented GPOMCP and evaluated its performance on classical families of benchmarks. Our experiments show that our approach is efficient despite the exact GPO problem being fundamentally more complicated.
Acknowledgements
The research leading to these results was supported by the Austrian Science Fund (FWF) NFN Grant no. S11407N23 (RiSE/SHiNE); two ERC Starting grants (279307: Graph Games, 279499: inVEST); the Vienna Science and Technology Fund (WWTF) through project ICT15003; and the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme (FP7/20072013) under REA grant agreement no. [291734].
References
 [BFRR14] Véronique Bruyère, Emmanuel Filiot, Mickael Randour, and JeanFrançois Raskin. Meet Your Expectations With Guarantees: Beyond WorstCase Synthesis in Quantitative Games. In Ernst W. Mayr and Natacha Portier, editors, STACS, volume 25 of LIPIcs, pages 199–213. Schloss Dagstuhl  LeibnizZentrum fuer Informatik, 2014.
 [BHO15] U. Boker, T. A. Henzinger, and J. Otop. The Target DiscountedSum Problem. In LICS, pages 750–761, July 2015.
 [Bre16] Romain Brenguier. A solver for Mean Payoff Games, based on gain and bias equations and the Z3 SMT solver. https://github.com/romainbrenguier/MeanPayoffSolver, 2016. Accessed date: 20160807.
 [Cas98] A.R. Cassandra. Exact and approximate algorithms for partially observable Markov decision processes. Brown University, 1998.
 [CCGK14] Krishnendu Chatterjee, Martin Chmelik, Raghav Gupta, and Ayush Kanodia. Optimal Cost Almostsure Reachability in POMDPs. CoRR, abs/1411.3880, 2014.
 [CCGK15] K. Chatterjee, M. Chmelik, R. Gupta, and A. Kanodia. Optimal Cost Almostsure Reachability in POMDPs. In AAAI. AAAI Press, 2015.
 [CKK15] Krishnendu Chatterjee, Zuzana Komárková, and Jan Kretínský. Unifying Two Views on Multiple MeanPayoff Objectives in Markov Decision Processes. In LICS, pages 244–256. IEEE Computer Society, 2015.
 [HYV16] Ping Hou, William Yeoh, and Pradeep Varakantham. Solving RiskSensitive POMDPs With and Without Cost Observations. In Dale Schuurmans and Michael P. Wellman, editors, AAAI, pages 3138–3144. AAAI Press, 2016.
 [KGFP09] H. KressGazit, G. E. Fainekos, and G. J. Pappas. TemporalLogicBased Reactive Mission and Motion Planning. IEEE Transactions on Robotics, 25(6):1370–1381, 2009.
 [KHL08] H. Kurniawati, D. Hsu, and W.S. Lee. SARSOP: Efficient PointBased POMDP Planning by Approximating Optimally Reachable Belief Spaces. In Robotics: Science and Systems, pages 65–72, 2008.
 [KLC98] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1):99–134, 1998.
 [KLM96] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
 [KS06] Levente Kocsis and Csaba Szepesvári. Bandit Based MonteCarlo Planning. In Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors, ECML, volume 4212 of LNCS, pages 282–293. Springer, 2006.
 [LCK95] M. L. Littman, A. R. Cassandra, and L. P Kaelbling. Learning Policies for Partially Observable Environments: Scaling Up. In ICML, pages 362–370, 1995.
 [LDK95] Michael L. Littman, Thomas L. Dean, and Leslie Pack Kaelbling. On the Complexity of Solving Markov Decision Problems. In Philippe Besnard and Steve Hanks, editors, UAI, pages 394–402. Morgan Kaufmann, 1995.
 [Lit96] M. L. Littman. Algorithms for Sequential Decision Making. PhD thesis, Brown University, 1996.

[PMP15]
Pascal Poupart, Aarti Malhotra, Pei Pei, KeeEung Kim, Bongseok Goh, and
Michael Bowling.
Approximate Linear Programming for Constrained Partially Observable Markov Decision Processes.
In AAAI, pages 3342–3348. AAAI Press, 2015.  [PT87] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov Decision Processes. Mathematics of Operations Research, 12:441–450, 1987.
 [Put05] M. L. Puterman. Markov Decision Processes. WileyInterscience, 2005.
 [RN10] Stuart J. Russell and Peter Norvig. Artificial Intelligence  A Modern Approach (3. internat. ed.). Pearson Education, 2010.
 [RPPCd08] Stéphane Ross, Joelle Pineau, Sébastien Paquet, and Brahim Chaibdraa. Online Planning Algorithms for POMDPs. J. Artif. Intell. Res. (JAIR), 32:663–704, 2008.
 [RRS15] Mickael Randour, JeanFrançois Raskin, and Ocan Sankur. Variations on the Stochastic Shortest Path Problem. In Deepak D’Souza, Akash Lal, and Kim Guldstrand Larsen, editors, VMCAI, volume 8931 of LNCS, pages 1–18. Springer, 2015.
 [Son71] E. J. Sondik. The Optimal Control of Partially Observable Markov Processes. Stanford University, 1971.
 [SS04] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In UAI, pages 520–527. AUAI Press, 2004.
 [STW16] Pedro Henrique de Rodrigues Quemel e Assis Santana, Sylvie Thiébaux, and Brian C. Williams. RAO*: An Algorithm for ChanceConstrained POMDP’s. In AAAI, pages 3308–3314. AAAI Press, 2016.
 [SV05] Matthijs T. J. Spaan and Nikos A. Vlassis. Perseus: Randomized Pointbased Value Iteration for POMDPs. J. Artif. Intell. Res. (JAIR), 24:195–220, 2005.
 [SV10] David Silver and Joel Veness. MonteCarlo Planning in Large POMDPs. In J. D. Lafferty, C. K. I. Williams, J. ShaweTaylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2164–2172. Curran Associates, Inc., 2010.
 [UH10] Aditya Undurti and Jonathan P How. An online algorithm for constrained POMDPs. In Robotics and Automation (ICRA), 2010 IEEE International Conference on, pages 3966–3973. IEEE, 2010.
 [Whi93] D.J. White. Minimizing a Threshold Probability in Discounted Markov Decision Processes. Journal of Mathematical Analysis and Applications, 173(2):634–646, March 1993.
 [ZP96] U. Zwick and M. Paterson. The Complexity of Mean Payoff Games on Graphs. Theoretical Computer Science, 158(1–2):343–359, 1996.
Appendix A Examples of Section 2
Here is presented a detailed analysis of all possible policies, and the best policy in terms of optimized expected payoff. Firstly observe that a policy is uniquely determined if the first performed action is in the set . The remaining case is to perform action times for some (if we successfully make transition to before performing all actions , policy is still uniquely determined), and then perform some action in the set . Alternatively, it is possible to just perform until is successfully reached. Below are computed expected payoffs for each of the cases listed above.

: performed first

: performed first

: performed first

: performed until transition to is successful

: performed times, then

: performed times, then

: performed times, then
It is hence clear that in Example 1 the expected payoff is optimized for . In Example 2 though, if we introduce a threshold , this policy does not work as if the initial state is , payoff is . Looking above at possible policies, , , and do not satisfy the imposed worstcase condition as we may have payoff . If returns us to the initial state for at least three times, total payoff is at most , so and also do not satisfy the condition for . Hence, policies satisfying the worst case condition are and for . It is easily verified from above that optimizes expected payoff with , and the worst case is achieved if both fail with .
Appendix B On the assumption of observable rewards (Section 4)
If the rewards of a given POMDP are not observable, the computation of future values is at least as hard as solving the target discounted sum problem, a longstanding open problem in automata theory related to other open problems in algebra BHO15:targetdiscsum.
Underapproximation of .
For POMDPs with nonobservable rewards, there is a straightforward way of obtaining an underapproximation of . Following the value iteration algorithm for discountedsum games outlined in Section 4 and detailed in HM15, it is possible to obtain the exact future values. Furthermore, it is easy to see that the functions generated by the algorithm get ever closer to the actual future values. Hence, stopping the iteration at any yields the desired underapproximation. (Note that for this argument to be valid, the reward function must assign to every transition a nonnegative value. However, this assumption is no loss of generality since, for any given POMDP, the threshold and the rewards of all the transitions can be “shifted and scaled” so that the assumption holds.)
Appendix C Formal Proof of Lemma 1 and Theorem 1
In this section we argue that, for POMDPs with observable rewards, we can reduce the computation of a policy with worstcase value above a given threshold to the computation of a policy, with the same property, in a fullobservation discountedsum game. This will give us access to the theoretical tools developed for that kind of game by the formal verification community. The idea is simple: we will construct a weighted arena in which states correspond to subsets of states from the POMDP with the same observation, and the new transitions model transitions with nonzero probability in the POMDP. This subset construction captures the fact that in a POMDP, after any history, any one from a set of possible states with the same observation could be the actual state of the system. The assumption that the POMDP has observable rewards will then allow us to weight the transitions of the arena without losing information about the original POMDP.
We observe that this reduction, and the fact that the policy we are looking for in the original POMDP can be directly obtained from the constructed discountedsum game, imply that the probabilities of the POMDP do not really matter when considering the worstcase value. Thus, Lemma 1 follows.
Given a POMDP with observable rewards, we construct the weighted arena where:

is a finite set of states;

is the set of initial states;

includes transitions of the form if and for any ;

is a weight function of the form determined by as follows: for any .
A play or infinite path in a weighted arena is a sequence of states and actions s.t. and for all we have . We denote by the set of all plays. A (finite) path is a finite prefix of a play ending in a state. Since the game has full observation, a history in a weighted arena is simply a path. The discounted sum of a play is defined as for POMDPs but using the weight function instead of . The definitions for policy and worstcase value are then identical. (For clarity, we write instead of when referring to the worstcase value in .)
From histories of the POMDP to histories in the game.
We now define a mapping from observationaction sequences to stateaction sequences in the constructed weighted arena. For a history from we let where and for all we have .
Claim 1.
The function is a bijective function from histories in to paths in .
Proof.
Clearly is injective. We will argue that it is also bijective. Consider a path from . We have that where for any and for all . It remains to show that there is a path in s.t. , to conclude that is a valid history in . By construction of we have that, for all , for all states there is s.t. . The result follows by induction. ∎
It follows that there are bijective mappings from policies in to policies in , and from plays in to plays in . For a policy in , let us denote by the corresponding policy in ; for a play in , for the play in .
Lemma 4.
For any policy in and for any policy in , if then .
Proof.
First, note that since has observable rewards, then for all histories we have that for any two paths s.t. the following holds:
Furthermore, by construction of we also have that
Thus, for the result to follow, it suffices for us to show that for any policy in and corresponding in , if then is also bijective when restricted to plays consistent with and in the respective structures. We proceed by induction. Note that for any history in with only one observation and consistent with we have that is consistent with since no choice has been made by the policies. Conversely, for any path in with only one element, and consistent with , is consistent with for the same reason. Hence, for some , is a bijective function from histories in to paths in , all of length at most . Consider a history in consistent with and let us write . By induction hypothesis, we know is consistent with . Observe that:

and therefore since is consistent with ;

by definition of a history, there is some path in with ; and

by construction of and definition of we have that and .
It follows that is also consistent with . To show the other direction, we now take a path in consistent with and write . It follows from inductive hypothesis that is consistent with . Since , we have that . Also, for any we have . Hence the claim holds and the result follows by induction. ∎
It follows from the above arguments that computing the worstcase value can be done in exponential time for POMDPs with discounted sum and observable rewards. This is, in fact, a tight complexity result. Indeed, safety and reachability games with partial observation are EXPhard cd10 even if the objective is observable. One can easily reduce either of them to a discountedsum objective in a POMDP by placing rewards or costs on target (or unsafe) transitions (depending of the game we reduce from) and asking for nonnegative worstcase value. Therefore, deciding a threshold problem for the worstcase value in POMDPs with discounted sum is EXPcomplete.
Theorem 3.
The worstcase threshold problem for POMDPs with discounted sum and observable rewards is EXPcomplete.
Appendix D Formal Proof of Proposition 1
Assume we are given a POMDP with observable rewards and we have constructed the corresponding weighted arena
Comments
There are no comments yet.