# Optimizing Expectation with Guarantees in POMDPs (Technical Report)

A standard objective in partially-observable Markov decision processes (POMDPs) is to find a policy that maximizes the expected discounted-sum payoff. However, such policies may still permit unlikely but highly undesirable outcomes, which is problematic especially in safety-critical applications. Recently, there has been a surge of interest in POMDPs where the goal is to maximize the probability to ensure that the payoff is at least a given threshold, but these approaches do not consider any optimization beyond satisfying this threshold constraint. In this work we go beyond both the "expectation" and "threshold" approaches and consider a "guaranteed payoff optimization (GPO)" problem for POMDPs, where we are given a threshold t and the objective is to find a policy σ such that a) each possible outcome of σ yields a discounted-sum payoff of at least t, and b) the expected discounted-sum payoff of σ is optimal (or near-optimal) among all policies satisfying a). We present a practical approach to tackle the GPO problem and evaluate it on standard POMDP benchmarks.

• 63 publications
• 20 publications
• 23 publications
• 33 publications
• 11 publications
04/27/2018

### Expectation Optimization with Probabilistic Guarantees in POMDPs with Discounted-sum Objectives

Partially-observable Markov decision processes (POMDPs) with discounted-...
08/29/2021

### Tight Guarantees for Static Threshold Policies in the Prophet Secretary Problem

In the prophet secretary problem, n values are drawn independently from ...
02/27/2020

### Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes

Markov decision processes (MDPs) are the defacto frame-work for sequenti...
11/03/2021

### Probing to Minimize

We develop approximation algorithms for set-selection problems with dete...
06/30/2020

### Enforcing Almost-Sure Reachability in POMDPs

Partially-Observable Markov Decision Processes (POMDPs) are a well-known...
05/15/2019

### Exploration-Exploitation Trade-off in Reinforcement Learning on Online Markov Decision Processes with Global Concave Rewards

We consider an agent who is involved in a Markov decision process and re...
01/29/2018

### Bounded Policy Synthesis for POMDPs with Safe-Reachability Objectives

Planning robust executions under uncertainty is a fundamental challenge ...

## 1 Introduction

The de facto model for decision making under uncertainty are partially-observable Markov decision processes (POMDPs) [Lit96, PT87], and they have been applied in diverse applications ranging from planning [RN10]

[KLM96], to robotics [KGFP09, KLC98]. One of the classical and fundamental payoff function for POMDPs is the discounted-sum payoff that aggregates the rewards of the transitions as a discounted sum. The traditional objective in POMDPs has been to obtain policies that maximize the expected discounted-sum payoff.

One crucial drawback of the traditional objective (that asks for expectation maximization) is that it allows for undesirable events that can happen with low probability. For example, consider a policy

that with probability achieves payoff and with probability achieves payoff , and a different policy that achieves payoff  with probability . If payoff values below  are undesirable, then the first policy, though better for expected payoff, allows undesirable events with significant probability, and hence the second policy is preferable. Hence, there has been a recent interest to study objectives where, instead of maximizing the expected payoff [HYV16], the goal is to maximize the probability that the payoff is above a threshold.

A drawback of the approach to maximize the probability that the payoff exceeds a threshold is that it ignores the optimization aspect of maximizing the expectation. In this work we consider an objective for POMDPs where both aspects are present. More precisely, we consider a “guaranteed payoff optimization (GPO)” problem for POMDPs, where given a threshold , the goal is to maximize the expectation while ensuring that the payoff is at least .

As a concrete motivation for the GPO problem, consider planning under uncertainty (e.g., self-driving cars) where certain events are catastrophic (e.g., crashes), and in the model they are assigned low payoffs. Such catastrophic events must be avoided even at the expense of expected payoff. That is, policies must maximize the expected payoff, ensuring the avoidance of catastrophic events. Hence, for planning in safety-critical applications the GPO problem is natural.

In this work, our main contributions are as follows:

1. We study the GPO problem for POMDPs, and present a practical solution approach for the problem. In particular, given a POMDP with the GPO problem, we present a transformation to a different POMDP where it suffices to solve the traditional expectation objective. Our solution approach first constructs a representation of all strategies that satisfies item a) of the GPO problem, and then we extend the partially-observable Monte Carlo planning (POMCP) approach to obtain optimal policies w.r.t. expectation among the above strategies.

2. We present experimental results on several classical POMDP examples from the literature to show how our approach can efficiently solve the GPO problem for POMDPs.

#### Related Works.

Works studying POMDPs with discounted sum range from theoretical results (see, e.g., [PT87, Lit96]) to practical tools (e.g. [KHL08, SV10]). Recent works focus on extracting policies which ensure that, with a given probability bound, the obtained discounted-sum payoff is above a threshold (see, e.g., [HYV16]). The problem of ensuring the payoff is above a given threshold while optimizing the expectation has been considered for fully-observable MDPs and the long-run average and stochastic shortest path objectives [BFRR14, RRS15]; and also with probabilistic thresholds for long-run average payoff [CKK15]. As for POMDPs, we mention constrained POMDPs [UH10, PMP15], where the aim is to maximize the expected payoff while ensuring that the expectation of some other quantity is bounded. In contrast, our constraints are hard, i.e. they must hold always, not just on average. The work probably closest to ours is [STW16] that also considers maximizing expected payoff among all policies satisfying a given constraint, but there are two key differences from our work: they consider finite horizon POMDPs, while we consider infinite horizon ones, and more importantly, their constraints are state-based, i.e. their policy must ensure that the execution of the POMDP does not go through certain “violating” states. In contrast, our “threshold constraint” is execution-based: whether a execution yields payoff at least cannot be determined solely by looking at the set of states appearing in the execution, but the whole infinite execution has to be considered. This requires very different techniques. To our best knowledge, the GPO problem has never been considered for POMDPs with discounted sum.

## 2 Preliminaries

Throughout this work, we follow standard (PO)MDP notations from [Put05, Lit96].

### 2.1 POMDPs

We denote by

the set of all probability distributions on a finite set

, i.e. all functions such that . For we denote by the support of , i.e. the set .

###### Definition 1.

POMDPs. A POMDP is defined as a tuple where is a finite set of states, is a finite alphabet of actions, is a probabilistic transition function that given a state and an action gives the probability distribution over the successor states, is a reward function, is a finite set of observations, is a probabilistic observation function that maps every state to a distribution over observations, and is the initial belief. We abbreviate by ,

###### Remark 1 (Deterministic observation function).

Deterministic observation functions of type are sufficient in POMDPs (see Remark in [CCGK14]). Informally, the probabilistic aspect of the observation function can be encoded into the transition function and, by letting the product of the states and observations be the new state-space, we obtain a deterministic observation function. Thus, without loss of generality, we will always consider observation functions of type , which greatly simplifies the notation.

#### Plays & Histories.

A play (or an infinite path) in a POMDP is an infinite sequence of states and actions such that and for all we have . We write for the set of all plays. A finite path (or just path) is a finite prefix of a play ending with a state, i.e. a sequence from . A history is a finite sequence of actions and observations such that there is a path with for each . We write to indicate that history corresponds to a path . The length of a path (or history) , denoted by , is the number of actions in , and the length of a play is .

#### Beliefs.

A belief is a distribution on states (i.e. an element of ) indicating the probability of being in each particular state given the current history. The initial belief is given as part of the POMDP. Then, in each step, when the history observed so far is , the current belief is , an action is played and an observation is received, the updated belief for history can be computed by a standard formula [Cas98].

#### Infinite-horizon Discounted Payoff.

Given a play and a discount factor , the infinite-horizon discounted payoff of is:

 Discγ(ρ)=∑∞i=0γir(si,ai).

We also define a discounted payoff of a finite path as

#### Policies.

A policy is a blueprint for selecting actions based on the past history of observations and actions. Formally, it is a function which assigns to a history a probability distribution over the actions, i.e. is the probability of selecting action after observing history (we often abbreviate to ).

#### Consistent Plays.

A play or a path is consistent with a policy if it can be obtained by extending its finite prefixes using . Formally, is consistent with if for each there is action such that and . A history is consistent with if there is a path consistent with such that .

#### Expected Value eValP of Policies.

Given a POMDP , a policy , a discount factor , and an initial belief , the expected value of from is the expected value of the infinite-horizon discounted sum under policy when starting in a state sampled from : This definition can be formalized by a standard construction of a probability measure induced by over the set of all plays, which also gives rise to the expectation operator  (see, e.g., [Put05]).

#### Worst-Case Value wValP of Policies.

The worst-case value of a policy from belief is where the infimum is taken over the set of all plays that are consistent with and start in a state sampled from .

###### Example 1.

Figure 1 shows a toy POMDP: A mining robot has to mine ore, which can be of two types (states and ). The exact type is unknown, but is more likely to occur (initial belief ). The goal is to reach the “ore mined” () state, in which a lump-sum reward is received. The robot can use several mining modes: safe mode (action ), which succeeds with probability and does not do anything if it fails, or type-specific mining modes ( and ) which succeed if applied on the correct type but result in a catastrophic failure if used on a wrong type. It can also use a sensor to accurately determine the type (after which a type-specific action can be safely used), at a cost of a one-step delay.

An exhaustive analysis of possible policies reveals that the expected value is maximized by any policy which selects in the first step (we then have ). However, the worst-case value of such a policy is , as it can result in entering after the first step. On the other hand, a policy which plays in the first step has .

#### Main Computational Questions.

The standard POMDP planning problem asks to compute (or approximate) the policy maximizing the expected value. In online POMDP planning, instead of computing the whole policy we have to compute, in each time step, the best action in the current situation. In other words, we must compute a good local approximation of a (near-)optimal policy. [RPPCd08]. In contrast, in the threshold planning problem we are asked to compute a policy maximizing the worst-case value and thus provide strict guarantees on the performance of the system [ZP96]. In this paper, we combine these two approaches and study the guaranteed payoff optimization (GPO) problem, where we are given a POMDP and a threshold and we have to compute a policy such that

1. satisfies a threshold constraint: is at least .

2. Let denote the best expected value obtainable while ensuring a worst-case payoff of at least , i.e. . Among all policies that satisfy item a), has -maximal expected value, i.e.

To efficiently tackle the GPO problem we aim to compute, in an online fashion, a local approximation of policy above. However, we do not relax requirement a). Approximations notwithstanding, the online planning algorithm we seek is such that given , the discounted payoff of every single play that can be produced by the algorithm is at least .

###### Example 2.

Take the POMDP in Figure 1 and a threshold . As shown in Example 1, a policy playing in the first step satisfies . However, there are better (w.r.t. the expected value) policies satisfying this constraint. The best such policy is a policy which twice plays and then plays . This policy satisfies and . (Also note that the optimal policy to maximize the expected payoff plays at the very start. However, with non-zero probability, this strategy violates the worst-case threshold .)

## 3 Policies for GPO Problem

We first show the GPO problem is different from the classical expectation maximization.

###### Example 3 (Beliefs are not sufficient for GPO.).

It is known that beliefs form a sufficient statistic of history for achieving the optimal expected value, i.e. there is always a deterministic belief-based policy — that is, a policy such that for each history the distribution is Dirac and determined solely by the belief after observing — with optimal expected value [Son71]. However, beliefs are not a sufficient statistic for the GPO problem, as witnessed by Example 2: suppose that we use policy and consider histories and , where is the observation received in and . The beliefs and are identical, and yet , i.e. is not belief-based.

#### Overview of Policy Representation.

We show (in Corollary 1) that a sufficient statistic for solving the GPO problem is a tuple , where is the belief after history and is the “remaining” distance to the threshold which we need to accumulate in the future. Formally,

 remtγ(h)=(t−min{Discγ(w)∣H(w)=h})/γlen(h).

This is similar to other (PO)MDP planning problems that work with thresholds [Whi93, HYV16]. However, we prove more: we obtain a precise local characterization of policies that satisfy the threshold constraint. More precisely, we show that for each history , there is a set of allowed actions such that a policy satisfies if and only if for each history it holds . We show that the function can be finitely represented and, for any history , its value can be computed algorithmically. This permits us to split the solution of the GPO problem into two separate parts: 1.) We compute the function , and 2.) we use it to restrict a standard online planning algorithm so that it always returns an action allowed for the current history.

#### Allowed Actions Allowtγ.

Intuitively, an action should be allowed after some history only if the payoff we are guaranteed to accumulate using in the current step (i.e. ) plus the best payoff which we can guarantee from the next step onward is at least . To formalize the “best payoff guaranteed from the next step on” we define the future value of any history as

 fVal(h)=supσwValP[bh](σ),

where is a POMDP identical to except for having initial belief and the supremum is taken over all policies in .

#### Belief Supports Suffice for the Worst Case.

The crucial observation is that the future value of a history is determined only by the support of .

###### Lemma 1.

If histories in a POMDP are such that , then .

Intuitively, this is because the worst-case value of a policy (and thus also a future value of a history) does not depend on any transition probabilities. In a slight abuse of notation, we sometimes treat as a function from to , i.e. , for , is equal to for all histories such that .

#### Ψ as an Approximation of fVal.

Since computing exactly can be inefficient in practice, we often need to work with approximations of , without relaxing the threshold constraint. We thus introduce a notion of a -allowed action. Let be a function assigning numbers to belief supports. We say that an action is -allowed for after history , and write it , if for all states and all observations such that is a history it holds that

 r(s,a)+γ⋅Ψ(Supp(bhao))≥remtγ(h). (1)

If is the function , we write simply . We typically aim at computing a lower bound on , i.e. a function such that for each . Then, as shown below, playing -allowed actions still guarantees that the threshold is eventually surpassed.

#### Correctness of the Approximation.

The correctness of the definition is summarized in the following proposition. We say that a policy is -safe for if for each history consistent with it holds that .

###### Proposition 1.

Let be a function such that for each . Then any policy that is -safe for satisfies . Moreover a policy is -safe for if and only if .

###### Corollary 1.

Assume that there is a policy with . Then there is also a policy such that and , and moreover, is belief-and-payoff, based, i.e. for all histories such that it holds .

From (1) we see that to compute we have to keep track of (which can be easily done online) and to compute (or a suitable under-approximation thereof). In the next section we show how to do the latter.

###### Example 4.

Consider the POMDP from Figure 1 with a threshold . Then , , , and . Initially, for the empty history, we have and therefore the only allowed actions are and because for all we have Suppose that is played and that the next observation witnessed is (thus, the belief is the same as before). We have . In this case, the only allowed action is because for all and and are still not allowed (since we have not accumulated any payoff and have the same belief as before). Hence, is played and consequently we obtain a payoff of (because of discounting). We remark that is, as required, above the threshold .

## 4 Computing Future Values

The threshold constraint in the GPO problem is global, i.e. it talks about all runs compatible with a policy. Hence, solving the GPO problem is unlikely to be amenable to purely online methods, which compute only local approximations of policies. In this section we show how to compute future values in an offline pre-processing step. Although this requires a global analysis of a POMDP, the pre-processing step can be done efficiently since computation of future values only requires working with belief supports rather than beliefs.

#### Belief Supports & Valid Belief Supports VBelSup.

A belief support is valid if either or there is a history such that . Only valid supports can be encountered during the planning process and thus we only need to compute future values thereof. We denote by the set of valid belief supports of POMDP ; the set can be computed by a simple iterative procedure.

#### Obsevable Rewards.

We present efficient computation of future values under the assumption that rewards are observable. This holds for many real-world applications, see, e.g. examples in [HYV16, CCGK15]. Formally, POMDP has observable rewards if whenever . From a theoretical point of view, observability of rewards is necessary since without it, the computation of future values is at least as hard as solving a long-standing open problem in algebraic number theory. More precisely, if the rewards of a given POMDP are not observable, the computation of future values is at least as hard as solving the target discounted sum problem, a long-standing open problem in automata theory related to other open problems in algebra [BHO15]. However, for POMDPs with unobservable rewards we can at least obtain an under-approximation of , and hence our framework is also applicable to them.

###### Lemma 2.

If rewards in are observable, then for each and each it holds .

We thus define as for some .

#### Future Value Characterization.

We start by providing a characterization of future values. A successor of a belief support under action and observation is a belief support . Consider the following system of - equations with variables , :

 xB=maxa∈Amino∈ZΔ(B,a,o)≠∅r(B,a)+γ⋅xΔ(B,a,o). (2)

(Each appears on the LHS of exactly one equation in the system.)

###### Proposition 2.

The system (2) has a unique solution , and it satisfies .

#### Game Perspective for the Worst Case.

Hence, it suffices to find a solution to system (2). But the form of the system is identical to the one characterizing optimal values in 2-player zero-sum discounted games [ZP96]. These games can be imagined as fully-observable MDPs in which the outcomes of actions are not resolved by a random choice but by a malicious adversary. The system (2) per se corresponds to a game where elements of are the states, actions are the same as in , and possible effects of actions are given by the function .

#### Algorithms to Compute Future Values.

Hence, to compute future values in practice we can employ one of several efficient algorithms for solving discounted-sum games (e.g. [Bre16]). A simple yet efficient approach is to use the standard value iteration for games: we compute a sequence of functions of type such that for each , and for we inductively define

 f(i)(B)=maxa∈Amino∈ZΔ(B,a,o)≠∅r(B,a)+γ⋅f(i−1)(Δ(B,a,o)).

From [ZP96] it follows there is always such that for all we have , i.e. is the solution to (2), and moreover , where is a denominator of in its reduced form. Hence, the value iteration converges in at most exponentially many steps.111Since the number can be exponential in the bitsize of .

###### Theorem 1.

Future values of all valid belief supports in can be computed in time exponential in the size of .

Although the theoretical bound is exponential, there are several reasons for the method to work well in practice: (1.) In a concrete instance, the number of valid supports can be significantly smaller than exponential. (2.) Reaching the fixed-point of the value iteration may also require significantly smaller number of steps than the theoretical upper bound suggests. (3.) One can show that for each , . Hence, even if reaching the fixed point takes too much time, we can set up a suitable timeout after which the value iteration is stopped, say at iteration . Then, by Proposition 1 any policy that is -safe for has worst-case value . (4.) Value iteration is a simple and standard algorithm for which efficient implementations exist (see, e.g., [LDK95, SV05]).

#### Important note on Ψ:

generally, does not guarantee that a -safe policy exists, which is necessary to apply Proposition 1. The following lemma resolves this.

###### Lemma 3.

For any the following holds for the functions produced by game value iteration: if , then there exists a policy which is -safe for .

In particular, if then a -safe policy for exists, irrespective of the way in which is computed.

## 5 Solving the GPO problem

We solve the GPO problem by modifying the partially-observable Monte Carlo planning (POMCP) algorithm [SV10].

#### Pomcp.

POMCP is an online planning method which in each decision epoch aims to select the best action given the current history

. In each epoch, POMCP performs a number of finite-horizon simulations starting from belief in order to compute a local approximation of the optimal expected value function: each simulation extends history

by selecting actions according to certain rules until the horizon is reached. The payoff of the produced path is then evaluated, and the result is used to update the optimal value approximation. After all the simulations proceed, the best action according to the estimated values is played, a new observation is received, and the process continues as above.

#### POMCP data-structure.

POMCP stores the information gained in past simulations in a search tree, in which each node corresponds to some history and contains belief , the number of times the history has been observed in previous simulations, and an approximation of the optimal expected value from . The search tree is used to guide simulations: each step in which the current history corresponds to an internal node of the tree is treated as a multi-armed bandit with parameters determined by numbers stored in children of this node, which balances exploration of new branches and exploitation of previous simulations (akin to the UCT algorithm for MDPs [KS06]). Once the simulation runs out of the scope of the search tree, it enters a rollout phase, where a fixed policy (e.g. selecting actions at random) is used to extend paths.

#### G-POMCP: Adapting POMCP for GPO.

We propose an augmentation of POMCP, which we call G-POMCP (guaranteed POMCP), specified as follows: First we enrich the nodes of the search tree so that a node corresponding to a history additionally includes the set and the number . When adding a new node to a search tree by extending history with action and observation , these attributes for the new node are updated as follows: and . Note that updating to requires just discrete set operations; as a matter of fact, the function is computed already during the off-line computation of future values, after which it can be stored and used to efficiently update during G-POMCP execution. In particular, updating is independent of updating , which is important so as not to compromise the threshold constraints with issues of belief precision and particle deprivation.

#### G-POMCP: playing safe.

The execution of G-POMCP then proceeds in almost the same way as in POMCP, with a crucial exception: Whenever G-POMCP is to select a (real or simulated) action it selects only among those in , where is the current history. Note that checking whether an action is allowed is easy for histories within the search tree, since the necessary information ( and ) is stored in nodes of the tree. Out of the scope of the search tree, we need to update the current belief support and remaining payoff online, as the simulation proceeds. While this somewhat increases the complexity of rollouts, as current belief supports must be kept updated (POMCP only keeps track of the current state and of payoff won so far), as noted above, updating belief supports is easier than updating beliefs. Moreover, this increase in complexity is only an issue in the initial steps of the algorithm, where rollout steps dominate over tree traversal. Previous sections yield the following result:

###### Theorem 2.

For each threshold the following holds: for each play resulting from using G-POMCP on ad infinitum it holds . This holds independently of how precisely the algorithm approximates beliefs.

So unless it is impossible to satisfy the threshold constraint at all, it can be surely satisfied by using G-POMCP.

#### Convergence.

Another question is the one of convergence. An algorithm is said to be convergent in the limit if, assuming precise belief representation, the local approximation of optimal value converges to true optimal value (in our case to ) as the number of simulations and their depth increases. The limit convergence of G-POMCP can be proved by a straightforward adaptation of the limit convergence proof of POMCP [SV10]: we map executions of G-POMCP on POMDP to the executions of UCT on a tree-shaped MDP , whose states are histories of (with the empty history as root) and where finite paths correspond to extending histories in by playing allowed actions.

## 6 Experiments

We tested our algorithm on two classical sets of benchmarks. The first, Hallway, was introduced in [LCK95]. In a hallway POMDP, a robot navigates a gridworld with walls and traps. We have considered variants in which traps cause non-recoverable damage and another in which they just “spin” the robot — making him more uncertain about his current location in the grid. Additionally, we have run our algorithm on RockSample POMDPs. The latter corresponds to the classical scenario described first in [SS04]. (We use a slight adaptation with a single imprecise sensing action.) Our experimental results are summarized in Figure 2 and Table 1.

#### Test Environment Specifications:

CPU: -Core Intel Zeon, GHz, cores; Memory: KB of L Cache, MB of L Cache, GB; OS: Mac OS X .

#### Worst-Case vs. Expected Payoff.

In Figure 2 we have plotted the results of running our G-POMCP algorithm on several benchmarks. In all three graphics, the trade-off between worst-case guarantees and expected payoff is clearly visible: In the left figure, the expected payoff stays around for worst-case thresholds between and ; then drops to for threshold values above . In the center figure, the expected payoff is when the worst-case threshold is ; stays around for thresholds between and (with a slightly negative slope); then drops to for threshold values above . Finally, in the right figure, the expected payoff steadily decreases for increasing worst-case threshold values. In particular, for threshold the expected payoff is while for threshold it is .

#### Latency.

In Table 1 we show the latency — the amount of time it takes to determine, at each epoch, which action to play next — of G-POMCP on three of the benchmarks we considered. (Though we have run the tool on several others, these are the biggest.) Observe that, even for relatively big POMDPs, the average latency is in the order of seconds. Also, note that the pre-processing step is not too costly.

#### Tool Availability.

Our implementation of the G-POMCP algorithm can be fetched from https://github.com/gaperez64/GPOMCP.

## 7 Discussion

In this work we have given a practical solution for the GPO problem. Our algorithm, G-POMCP, allows to obtain a policy which ensures a worst-case discounted-sum payoff value while optimizing the expected payoff. We have implemented G-POMCP and evaluated its performance on classical families of benchmarks. Our experiments show that our approach is efficient despite the exact GPO problem being fundamentally more complicated.

## Acknowledgements

The research leading to these results was supported by the Austrian Science Fund (FWF) NFN Grant no. S11407-N23 (RiSE/SHiNE); two ERC Starting grants (279307: Graph Games, 279499: inVEST); the Vienna Science and Technology Fund (WWTF) through project ICT15-003; and the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme (FP7/2007-2013) under REA grant agreement no. [291734].

## References

• [BFRR14] Véronique Bruyère, Emmanuel Filiot, Mickael Randour, and Jean-François Raskin. Meet Your Expectations With Guarantees: Beyond Worst-Case Synthesis in Quantitative Games. In Ernst W. Mayr and Natacha Portier, editors, STACS, volume 25 of LIPIcs, pages 199–213. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2014.
• [BHO15] U. Boker, T. A. Henzinger, and J. Otop. The Target Discounted-Sum Problem. In LICS, pages 750–761, July 2015.
• [Bre16] Romain Brenguier. A solver for Mean Payoff Games, based on gain and bias equations and the Z3 SMT solver. Accessed date: 2016-08-07.
• [Cas98] A.R. Cassandra. Exact and approximate algorithms for partially observable Markov decision processes. Brown University, 1998.
• [CCGK14] Krishnendu Chatterjee, Martin Chmelik, Raghav Gupta, and Ayush Kanodia. Optimal Cost Almost-sure Reachability in POMDPs. CoRR, abs/1411.3880, 2014.
• [CCGK15] K. Chatterjee, M. Chmelik, R. Gupta, and A. Kanodia. Optimal Cost Almost-sure Reachability in POMDPs. In AAAI. AAAI Press, 2015.
• [CKK15] Krishnendu Chatterjee, Zuzana Komárková, and Jan Kretínský. Unifying Two Views on Multiple Mean-Payoff Objectives in Markov Decision Processes. In LICS, pages 244–256. IEEE Computer Society, 2015.
• [HYV16] Ping Hou, William Yeoh, and Pradeep Varakantham. Solving Risk-Sensitive POMDPs With and Without Cost Observations. In Dale Schuurmans and Michael P. Wellman, editors, AAAI, pages 3138–3144. AAAI Press, 2016.
• [KGFP09] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas. Temporal-Logic-Based Reactive Mission and Motion Planning. IEEE Transactions on Robotics, 25(6):1370–1381, 2009.
• [KHL08] H. Kurniawati, D. Hsu, and W.S. Lee. SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces. In Robotics: Science and Systems, pages 65–72, 2008.
• [KLC98] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1):99–134, 1998.
• [KLM96] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
• [KS06] Levente Kocsis and Csaba Szepesvári. Bandit Based Monte-Carlo Planning. In Johannes Fürnkranz, Tobias Scheffer, and Myra Spiliopoulou, editors, ECML, volume 4212 of LNCS, pages 282–293. Springer, 2006.
• [LCK95] M. L. Littman, A. R. Cassandra, and L. P Kaelbling. Learning Policies for Partially Observable Environments: Scaling Up. In ICML, pages 362–370, 1995.
• [LDK95] Michael L. Littman, Thomas L. Dean, and Leslie Pack Kaelbling. On the Complexity of Solving Markov Decision Problems. In Philippe Besnard and Steve Hanks, editors, UAI, pages 394–402. Morgan Kaufmann, 1995.
• [Lit96] M. L. Littman. Algorithms for Sequential Decision Making. PhD thesis, Brown University, 1996.
• [PMP15] Pascal Poupart, Aarti Malhotra, Pei Pei, Kee-Eung Kim, Bongseok Goh, and Michael Bowling.

Approximate Linear Programming for Constrained Partially Observable Markov Decision Processes.

In AAAI, pages 3342–3348. AAAI Press, 2015.
• [PT87] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov Decision Processes. Mathematics of Operations Research, 12:441–450, 1987.
• [Put05] M. L. Puterman. Markov Decision Processes. Wiley-Interscience, 2005.
• [RN10] Stuart J. Russell and Peter Norvig. Artificial Intelligence - A Modern Approach (3. internat. ed.). Pearson Education, 2010.
• [RPPCd08] Stéphane Ross, Joelle Pineau, Sébastien Paquet, and Brahim Chaib-draa. Online Planning Algorithms for POMDPs. J. Artif. Intell. Res. (JAIR), 32:663–704, 2008.
• [RRS15] Mickael Randour, Jean-François Raskin, and Ocan Sankur. Variations on the Stochastic Shortest Path Problem. In Deepak D’Souza, Akash Lal, and Kim Guldstrand Larsen, editors, VMCAI, volume 8931 of LNCS, pages 1–18. Springer, 2015.
• [Son71] E. J. Sondik. The Optimal Control of Partially Observable Markov Processes. Stanford University, 1971.
• [SS04] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In UAI, pages 520–527. AUAI Press, 2004.
• [STW16] Pedro Henrique de Rodrigues Quemel e Assis Santana, Sylvie Thiébaux, and Brian C. Williams. RAO*: An Algorithm for Chance-Constrained POMDP’s. In AAAI, pages 3308–3314. AAAI Press, 2016.
• [SV05] Matthijs T. J. Spaan and Nikos A. Vlassis. Perseus: Randomized Point-based Value Iteration for POMDPs. J. Artif. Intell. Res. (JAIR), 24:195–220, 2005.
• [SV10] David Silver and Joel Veness. Monte-Carlo Planning in Large POMDPs. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 2164–2172. Curran Associates, Inc., 2010.
• [UH10] Aditya Undurti and Jonathan P How. An online algorithm for constrained POMDPs. In Robotics and Automation (ICRA), 2010 IEEE International Conference on, pages 3966–3973. IEEE, 2010.
• [Whi93] D.J. White. Minimizing a Threshold Probability in Discounted Markov Decision Processes. Journal of Mathematical Analysis and Applications, 173(2):634–646, March 1993.
• [ZP96] U. Zwick and M. Paterson. The Complexity of Mean Payoff Games on Graphs. Theoretical Computer Science, 158(1–2):343–359, 1996.

## Appendix A Examples of Section 2

Here is presented a detailed analysis of all possible policies, and the best policy in terms of optimized expected payoff. Firstly observe that a policy is uniquely determined if the first performed action is in the set . The remaining case is to perform action times for some (if we successfully make transition to before performing all actions , policy is still uniquely determined), and then perform some action in the set . Alternatively, it is possible to just perform until is successfully reached. Below are computed expected payoffs for each of the cases listed above.

• : performed first

 eValP(σ1) =0.9⋅[1⋅0+γ⋅100]=45.
• : performed first

 eValP(σ2) =0.1⋅[1⋅0+γ⋅100]=5.
• : performed first

 eValP(σsense) =0+γ⋅0+γ2⋅100=25.
• : performed until transition to is successful

 eValP(σms) =∞∑k=0(25)k⋅35⋅γk+1⋅100=37.5.
• : performed times, then

 eValP(σn1) =n−1∑k=0(25)k⋅35⋅γk+1⋅100 +(25)n⋅[0.9⋅γn+1⋅100+0.1⋅0]=37.5+7.55n.
• : performed times, then

 eValP(σn2) =n−1∑k=0(25)k⋅35⋅γk+1⋅100 +(25)n⋅[0.1⋅γn+1⋅100+0.9⋅0]=37.5−32.55n.
• : performed times, then

 eValP(σnsense) =n−1∑k=0(25)k⋅35⋅γk+1⋅100 +(25)n⋅γn+2⋅100=37.5−12.55n.

It is hence clear that in Example 1 the expected payoff is optimized for . In Example 2 though, if we introduce a threshold , this policy does not work as if the initial state is , payoff is . Looking above at possible policies, , , and do not satisfy the imposed worst-case condition as we may have payoff . If returns us to the initial state for at least three times, total payoff is at most , so and also do not satisfy the condition for . Hence, policies satisfying the worst case condition are and for . It is easily verified from above that optimizes expected payoff with , and the worst case is achieved if both fail with .

## Appendix B On the assumption of observable rewards (Section 4)

If the rewards of a given POMDP are not observable, the computation of future values is at least as hard as solving the target discounted sum problem, a long-standing open problem in automata theory related to other open problems in algebra BHO15:target-disc-sum.

#### Under-approximation of fVal.

For POMDPs with non-observable rewards, there is a straightforward way of obtaining an under-approximation of . Following the value iteration algorithm for discounted-sum games outlined in Section 4 and detailed in HM15, it is possible to obtain the exact future values. Furthermore, it is easy to see that the functions generated by the algorithm get ever closer to the actual future values. Hence, stopping the iteration at any yields the desired under-approximation. (Note that for this argument to be valid, the reward function must assign to every transition a non-negative value. However, this assumption is no loss of generality since, for any given POMDP, the threshold and the rewards of all the transitions can be “shifted and scaled” so that the assumption holds.)

## Appendix C Formal Proof of Lemma 1 and Theorem 1

In this section we argue that, for POMDPs with observable rewards, we can reduce the computation of a policy with worst-case value above a given threshold to the computation of a policy, with the same property, in a full-observation discounted-sum game. This will give us access to the theoretical tools developed for that kind of game by the formal verification community. The idea is simple: we will construct a weighted arena in which states correspond to subsets of states from the POMDP with the same observation, and the new transitions model transitions with non-zero probability in the POMDP. This subset construction captures the fact that in a POMDP, after any history, any one from a set of possible states with the same observation could be the actual state of the system. The assumption that the POMDP has observable rewards will then allow us to weight the transitions of the arena without losing information about the original POMDP.

We observe that this reduction, and the fact that the policy we are looking for in the original POMDP can be directly obtained from the constructed discounted-sum game, imply that the probabilities of the POMDP do not really matter when considering the worst-case value. Thus, Lemma 1 follows.

Given a POMDP with observable rewards, we construct the weighted arena where:

• is a finite set of states;

• is the set of initial states;

• includes transitions of the form if and for any ;

• is a weight function of the form determined by as follows: for any .

A play or infinite path in a weighted arena is a sequence of states and actions s.t. and for all we have . We denote by the set of all plays. A (finite) path is a finite prefix of a play ending in a state. Since the game has full observation, a history in a weighted arena is simply a path. The discounted sum of a play is defined as for POMDPs but using the weight function instead of . The definitions for policy and worst-case value are then identical. (For clarity, we write instead of when referring to the worst-case value in .)

#### From histories of the POMDP to histories in the game.

We now define a mapping from observation-action sequences to state-action sequences in the constructed weighted arena. For a history from we let where and for all we have .

###### Claim 1.

The function is a bijective function from histories in to paths in .

###### Proof.

Clearly is injective. We will argue that it is also bijective. Consider a path from . We have that where for any and for all . It remains to show that there is a path in s.t. , to conclude that is a valid history in . By construction of we have that, for all , for all states there is s.t. . The result follows by induction. ∎

It follows that there are bijective mappings from policies in to policies in , and from plays in to plays in . For a policy in , let us denote by the corresponding policy in ; for a play in , for the play in .

###### Lemma 4.

For any policy in and for any policy in , if then .

###### Proof.

First, note that since has observable rewards, then for all histories we have that for any two paths s.t. the following holds:

 n−1∑i=0γir(si,ai)=n−1∑i=0γir(s′i,ai).

Furthermore, by construction of we also have that

 n−1∑i=0γiw(qi,ai,qi+1)=n−1∑i=0γir(si,ai).

Thus, for the result to follow, it suffices for us to show that for any policy in and corresponding in , if then is also bijective when restricted to plays consistent with and in the respective structures. We proceed by induction. Note that for any history in with only one observation and consistent with we have that is consistent with since no choice has been made by the policies. Conversely, for any path in with only one element, and consistent with , is consistent with for the same reason. Hence, for some , is a bijective function from histories in to paths in , all of length at most . Consider a history in consistent with and let us write . By induction hypothesis, we know is consistent with . Observe that:

• and therefore since is consistent with ;

• by definition of a history, there is some path in with ; and

• by construction of and definition of we have that and .

It follows that is also consistent with . To show the other direction, we now take a path in consistent with and write . It follows from inductive hypothesis that is consistent with . Since , we have that . Also, for any we have . Hence the claim holds and the result follows by induction. ∎

It follows from the above arguments that computing the worst-case value can be done in exponential time for POMDPs with discounted sum and observable rewards. This is, in fact, a tight complexity result. Indeed, safety and reachability games with partial observation are EXP-hard cd10 even if the objective is observable. One can easily reduce either of them to a discounted-sum objective in a POMDP by placing rewards or costs on target (or unsafe) transitions (depending of the game we reduce from) and asking for non-negative worst-case value. Therefore, deciding a threshold problem for the worst-case value in POMDPs with discounted sum is EXP-complete.

###### Theorem 3.

The worst-case threshold problem for POMDPs with discounted sum and observable rewards is EXP-complete.

## Appendix D Formal Proof of Proposition 1

Assume we are given a POMDP with observable rewards and we have constructed the corresponding weighted arena .

Recall the statement says:
Let be a function s.t. for each .

1. Then any