 # Expectation Optimization with Probabilistic Guarantees in POMDPs with Discounted-sum Objectives

Partially-observable Markov decision processes (POMDPs) with discounted-sum payoff are a standard framework to model a wide range of problems related to decision making under uncertainty. Traditionally, the goal has been to obtain policies that optimize the expectation of the discounted-sum payoff. A key drawback of the expectation measure is that even low probability events with extreme payoff can significantly affect the expectation, and thus the obtained policies are not necessarily risk-averse. An alternate approach is to optimize the probability that the payoff is above a certain threshold, which allows obtaining risk-averse policies, but ignores optimization of the expectation. We consider the expectation optimization with probabilistic guarantee (EOPG) problem, where the goal is to optimize the expectation ensuring that the payoff is above a given threshold with at least a specified probability. We present several results on the EOPG problem, including the first algorithm to solve it.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

POMDPs and Discounted-Sum Objectives.

Decision making under uncertainty is a fundamental problem in artificial intelligence. Markov decision processes (MDPs) are the

de facto model that allows both decision-making choices as well as stochastic behavior [Howard1960, Puterman2005]. The extension of MDPs with uncertainty about information gives rise to the model of partially-observable Markov decision processes (POMDPs) [Littman1996, Papadimitriou and Tsitsiklis1987]. POMDPs are used in a wide range of areas, such as planning [Russell and Norvig2010][Kaelbling et al.1996], robotics [Kress-Gazit et al.2009, Kaelbling et al.1998], to name a few. In decision making under uncertainty, the objective is to optimize a payoff function. A classical and basic payoff function is the discounted-sum payoff, where every transition of the POMDP is assigned a reward, and for an infinite path (that consists of an infinite sequence of transitions) the payoff is the discounted-sum of the rewards of the transitions.

Expectation Optimization and Drawback. Traditionally, POMDPs with discounted-sum payoff have been studied, where the goal is to obtain policies that optimize the expected payoff. A key drawback of the expectation optimization is that it is not robust with respect to risk measures. For example, a policy that achieves with probability payoff and with the remaining probability payoff 0 has higher expectation than a policy that achieves with probability payoff  and with the remaining probability payoff 0. However the second policy is more robust and less risk-prone, and is desirable in many scenarios.

Probability Optimization and Drawback. Due to the drawback of expectation optimization, there has been recent interest to study the optimization of the probability to ensure that the payoff is above a given threshold [Hou et al.2016]. While this ensures risk-averse policies, it ignores the expectation optimization.

Expectation Optimization with Probabilistic Guarantee. A formulation that retains the advantages of both the above optimization criteria, yet removes the associated drawbacks, is as follows: given a payoff threshold and risk bound

, the objective is the expectation maximization w.r.t. to all policies that ensure the payoff is at least

with probability at least . We study this expectation optimization with probabilistic guarantee (EOPG) problem for discounted-sum POMDPs.

Motivating Examples. We present some motivating examples for the EOPG formulation.

• Bad events avoidance. Consider planning under uncertainty (e.g., self-driving cars) where certain events are dangerous (e.g., the distance between two cars less than a specified distance), and it must be ensured that such events happen with low probability. Thus, desirable policies aim to maximize the expected payoff, ensuring the avoidance of bad events with a specified high probability.

• Gambling. In gambling, while the goal is to maximize the expected profit, a desirable risk-averse policy would ensure that the loss is less than a specified amount with high probability (say, with probability ).

Thus, the EOPG problem for POMDPs with discounted-sum payoff is an important problem which we consider.

Previous Results. Several related problems have been considered, and two most relevant works are the following:

1. Chance-constrained (CC) problem. In the CC problem, certain bad states of the POMDP must not be reached with some probability threshold. That is, in the CC problem the probabilistic constraint is state-based (some states should be avoided) rather than the execution-based discounted-sum payoff. This problem was considered in [Santana et al.2016], but only with deterministic policies. As already noted in [Santana et al.2016], randomized (or mixed) policies are more powerful.

2. Probability 1 bound. The special case of the EOPG problem with has been considered in [Chatterjee et al.2017]. This formulation represents the case with no risk.

Our Contributions. Our main contributions are as follows:

1. Algorithm. We present a randomized algorithm for approximating (up to any given precision) the optimal solution to the EOPG problem. This is the first approach to solve the EOPG problem for discounted-sum POMDPs.

2. Practical approach. We present a practical approach where certain searches of our algorithm are only performed for a time bound. This gives an anytime algorithm which approximates the probabilistic guarantee and then optimizes the expectation.

3. Experimental results. We present experimental results of our algorithm on classical POMDPs.

Due to space constraints, details such as full proofs are deferred to the appendix.

Related Works. POMDPs with discounted-sum payoff have been widely studied, both for theoretical results [Papadimitriou and Tsitsiklis1987, Littman1996] as well as practical tools [Kurniawati et al.2008, Silver and Veness2010, Ye et al.2017]. Traditionally, expectation optimization has been considered, and recent works consider policies that optimize probabilities to ensure discounted-sum payoff above a threshold [Hou et al.2016]. Several problems related to the EOPG problem have been considered before: (a) for probability threshold 1 for long-run average and stochastic shortest path problem in fully-observable MDPs [Bruyère et al.2014, Randour et al.2015]; (b) for risk bound 0 for discounted-sum payoff for POMDPs [Chatterjee et al.2017]; and (c) for general probability threshold for long-run average payoff in fully-observable MDPs [Chatterjee et al.2015b]. The chance constrained-optimization was studied in [Santana et al.2016]. The general EOPG problem for POMDPs with discounted-sum payoff has not been studied before, although development of similar objectives was proposed for perfectly observable MDPs [Defourny et al.2008]. A related approach for POMDPs is called constrained POMDPs [Undurti and How2010, Poupart et al.2015], where the aim is to maximize the expected payoff ensuring that the expectation of some other quantity is bounded. In contrast, in the EOPG problem the constraint is probabilistic rather than an expectation constraint, and as mentioned before, the probabilistic constraint ensures risk-averseness as compared to the expectation constraint. Thus, the constrained POMDPs and the EOPG problem, though related, consider different optimization criteria.

## 2 Preliminaries

Throughout this work, we mostly follow standard (PO)MDP notations from [Puterman2005, Littman1996]. We denote by

the set of all probability distributions on a finite set

, i.e. all functions s.t. .

###### Definition 1

(POMDPs.) A POMDP is a tuple where is a finite set of states, is a finite alphabet of actions, is a probabilistic transition function that given a state and an action gives the probability distribution over the successor states, is a reward function, is a finite set of observations, is a probabilistic observation function that maps every state to a distribution over observations, and is the initial belief. We abbreviate and by and , respectively.

Plays & Histories. A play (or an infinite path) in a POMDP is an infinite sequence of states and actions s.t. and for all we have . A finite path (or just path) is a finite prefix of a play ending with a state, i.e. a sequence from . A history is a finite sequence of actions and observations s.t. there is a path with for each . We write to indicate that history corresponds to a path . The length of a path (or history) , denoted by , is the number of actions in , and the length of a play is .

Discounted Payoff. Given a play and a discount factor , the finite-horizon discounted payoff of for horizon is The infinite-horizon discounted payoff of is

Policies. A policy (or strategy) is a blueprint for selecting actions based on the past history. Formally, it is a function which assigns to a history a probability distribution over the actions, i.e. is the probability of selecting action after observing history (we abbreviate to ). A policy is deterministic if for each history the distribution selects a single action with probability 1. For deterministic we write to indicate that .

Beliefs. A belief is a distribution on states (i.e. an element of ) indicating the probability of being in each particular state given the current history. The initial belief is given as a part of the POMDP. Then, in each step, when the history observed so far is , the current belief is , an action is played, and an observation is received, the updated belief for history can be computed by a standard Bayesian formula [Cassandra1998].

Expected Value of a Policy. Given a POMDP , a policy , a horizon , and a discount factor , the expected value of from is the expected value of the infinite-horizon discounted sum under policy when starting in a state sampled from the initial belief of of :

Risk. A risk level of a policy at threshold w.r.t. payoff function is the probability that the payoff of a play generated by is below , i.e.

 rl(σ,τ,Discγ,N)=Pσλ(Discγ,N<τ).

EOPG Problem. We now define the problem of expectation optimization with probabilistic guarantees (the EOPG problem for short). We first define a finite-horizon variant, and then discuss the infinite-horizon version in Section 3. In EOPG problem, we are given a threshold , a risk bound , and a horizon . A policy is a feasible solution of the problem if . The goal of the EOPG problem is to find a feasible solution maximizing among all feasible solutions, provided that feasible solutions exist.

### Observable Rewards.

We solve the EOPG problem under the assumption that rewards in the POMDP are observable. This means that whenever or if both and have a positive probability under the initial belief. This is a natural assumption satisfied by many standard benchmarks [Hou et al.2016, Chatterjee et al.2015a]. At the end of Section 4 we discuss how could be our results extended to unobservable rewards.

Efficient Algorithms. A standard way of making POMDP planning more efficient is to design an algorithm that is online (i.e., it computes a local approximation of the -optimal policy, selecting the best action for the current belief [Ross et al.2008]) and anytime, i.e. computing better and better approximation of the -optimal policy over its runtime, returning a solution together with some guarantee on its quality if forced to terminate early.

## 3 Relationship to CC-POMDPs

We present our first result showing that an approximate infinite-horizon (IH) EOPG problem can be reduced to a finite-horizon variant. While similar reductions are natural when dealing with discounted payoff, for the EOPG problem the reduction is somewhat subtle due to the presence of the risk constraint. We then show that the EOPG problem can be reduced to chance-constrained POMDPs and solved using the RAO* algorithm [Santana et al.2016], but we also present several drawbacks of this approach.

Formally, we define the IH-EOPG problem as follows: we are given and as before and in addition, an error term . We say that an algorithm -solves the IH-EOPG problem if, whenever the problem has a feasible solution (feasibility is defined as before, with replaced by ), the algorithm finds a policy s.t. and , where

 rVal(τ,α)=sup{Eπλ[Discγ]∣rl(π,τ,Discγ)≤α}.

### Infinite to Finite Horizon.

Let be a -discounted POMDP, a payoff threshold, a risk bound, and an error term. Let be a horizon such that the following holds: , where and are the maximal and minimal rewards appearing in , respectively.

###### Lemma 1

If there exists a feasible solution of the IH-EOPG problem with threshold and risk bound . Then there exists a policy satisfying . Moreover, let be an optimal solution to the EOPG problem with the risk bound , horizon , and with threshold . Then is an -optimal solution to the IH-EOPG problem.

The previous lemma effectively shows that to solve an approximate version of the EOPG problem, it suffices to solve its finite horizon version.

### Chance-Constrained POMDPs.

In the chance constrained (CC) optimization problem [Santana et al.2016], we are given a POMDP , a finite-horizon bound , and a set of constraint-violating states , which is a subset of the set of states of . We are also given a risk bound

. The goal is to optimize the expected finite-horizon payoff, i.e. the expectation of the following random variable:

 PayoffN(s0a1s1a2s2…)=∑Ni=0r(si,ai+1).

The optimization is subject to a constraint that the probability of entering a state from (so-called execution risk) stays below the risk bound .

### From EOPGs to CC-POMDPs.

We sketch how the FH-EOPG relates to CC-POMDP optimization. In the EOPG problem, the constraint violation occurs when the finite-horizon discounted payoff in step is smaller than a threshold . To formulate this in a CC-POMDP setting, we need to make the constraint violation a property of a state of the POMDP. Hence, we construct a new POMDP with an extended state space: the states of are triples of the form , where is a state of the original POMDP , is a time index, and is a number representing the discounted reward accumulated before reaching the state . The remaining components of are then extended in a natural way from . By solving the CC-POMDP problem for , where the set contains extended states where , we obtain a policy in which can be carried back to where it forms an optimal solution of the FH-EOPG problem.

### Discussion of the CC-POMDP Approach.

It follows that we could, in principle, reduce the EOPG problem to CC-POMDP optimization and then solve the latter using the known RAO* algorithm [Santana et al.2016]. However, there are several issues with this approach.

First, RAO* aims to find an optimal deterministic policy in CC-POMDPs. But as already mentioned in [Santana et al.2016], the optimal solution to the CC-POMDP (and thus also to EOPG) problem might require randomization, and deterministic policies may have arbitrarily worse expected payoff than randomized ones (it is well-known that randomization might be necessary for optimality in constrained (PO)MDPs, see [Feinberg and Shwartz1995, Kim et al.2011, Sprauel et al.2014]).

Second, although RAO* converges to an optimal constrained deterministic policy, it does not provide anytime guarantees about the risk of the policy it constructs. RAO* is an AO*-like algorithm that iteratively searches the belief space and in each step computes a greedy policy that is optimal on the already explored fragment of the belief space. During its execution, RAO* works with an under-approximation of a risk taken by the greedy policy: this is because an optimal risk to be taken in belief states that were not yet explored is under-approximated by an

. So if the algorithm is stopped prematurely, the actual risk taken by the current greedy policy can be much larger than indicated by the algorithm. This is illustrated in the following example.

###### Example 1

Consider the MDP in Figure 1, and consider and ; that is, in the chance-constrained reformulation, we seek an optimal policy for which the probability of hitting and is at most . Consider the execution of RAO* which explores states , , , , and (since the MDP is perfectly observable, we work directly with states) and then is prematurely terminated. (The order in which unexplored nodes are visited is determined by a heuristic, and in general we cannot guarantee that

is explored earlier in RAO*’s execution). At this moment, the risk taken by an optimal policy from

is under-approximated using an admissible heuristic. In case of using a myopic heuristic, as suggested in [Santana et al.2016], the risk from is under-approximated by . Hence, at this moment the best greedy policy satisfying the risk constraint is the one which plays action in state

: the risk-estimate of this policy is

(the probability of reaching ). However, any deterministic policy that selects in state takes an overall risk .

## 4 Risk-Aware POMCP

The previous example illustrates the main challenge in designing an online and anytime algorithm for the (finite horizon) EOPG problem: we need to keep upper bounds on the minimal risk achievable in the POMDP. Initially, the upper bound is , and to decrease it, we need to discover a sufficiently large probability mass of paths that yield payoff above .

We propose an algorithm for the EOPG problem based on the popular POMCP [Silver and Veness2010] planning algorithm: the risk-aware POMCP (or RAMCP for short). RAMCP solves the aforementioned challenge by performing, in each decision step, a large number of simulations using the POMCP heuristic to explore promising histories first. The key feature of RAMCP is that it extends POMCP with a new data structure, so called explicit tree, which contains those histories explored during simulations that have payoff above the required threshold . The explicit tree allows us to keep track of the upper bound on the risk that needs to be taken from the initial belief. After the simulation phase concludes, RAMCP uses the explicit tree to construct a perfectly observable tree-shaped constrained MDP [Altman1999]

encoding the EOPG problem on the explored fragment of the history tree of the input POMDP. The optimal distribution on actions is then computed using a linear program for constrained MDP optimization

[Altman1999]. In the rest of this section, we present details of the algorithm and formally state its properties. In the following, we fix a POMDP , a horizon , a threshold and a risk bound .

### Ramcp.

The main loop of RAMCP is pictured in Algorithm 1. In each decision step, RAMCP performs a search phase followed by action selection followed by playing the selected action (the latter two performed within procedure). We describe the three phases separately.

### RAMCP: Search Phase.

The search phase is shown in Algorithms 1 and 2. In the following, we first introduce the data structures the algorithm works with, then the elements of these structures, and finally we sketch how the search phase executes.

Data Structures. In the search phase, RAMCP explores, by performing simulations, the history tree of the input POMDP . Nodes of the tree are the histories of of length . The tree is rooted in the empty history, and for each history of length at most , each action , and observation , the node has a child . RAMCP works with two data structures, that are both sub-trees of : a search tree and explicit tree Intuitively, corresponds to the standard POMCP search tree while is a sub-tree of containing histories leading to payoff above . The term “explicit” stems from the fact that we explicitly compute beliefs and transition probabilities for the nodes in . Initially (before the first search phase), both and contain a single node: empty history.

Elements of Data Structures. Each node of has these attributes: for each action there is , the average expected payoff obtained from the node after playing during past simulations, and , the number of times action was selected in node in past simulations. Next, we have , the number of times the node was visited during past simulations. Each node of also contains a particle-filter approximation of the corresponding belief . A node of the explicit tree has an attribute , the upper bound on the risk from belief , and, for each action , attribute , the upper bound on the risk when playing from belief . Also, each node of contains an exact representation of the corresponding belief, and each edge of the explicit tree is labelled by numbers , i.e. by the probability of observing when playing action after history , and , where is equal to for any state with (here we use the facts that rewards are observable).

Execution of Search Phase. Procedures and are basically the same as in POMCP—within the search tree we choose actions heuristically (in line 2 of , the number is POMCP’s exploration constant), outside of it we choose actions uniformly at random. However, whenever a simulation succeeds in surpassing the threshold , we add the observed history and all its prefixes to both the explicit and search trees (procedure ). Note that computing on line 2 entails computing full Bayesian updates on the path from to so as to compute exact beliefs of the corresponding nodes. Risk bounds for the nodes corresponding to prefixes of are updated accordingly using a standard dynamic programming update (lines 22), starting from the newly added leaf whose risk is (as it corresponds to a history after which is surpassed). We have the following:

###### Lemma 2
1. At any point there exists a policy such that .

2. As , the probability that becomes equal to before expires converges to .

Proof (sketch). For part (1.) we prove the following stronger statement: Fix any point of algorithm’s execution, and let be the length of (the history at the root of and ) at this point. Then for any node of there exists a policy s.t. . The proof proceeds by a rather straightforward induction. The statement of the lemma then follows by plugging the root of into .

For part (2.), the crucial observation is that as , with probability converging to the tree will at some point contain all histories of length (that have as a prefix) whose payoff is above the required threshold. It can be easily shown that at such a point , and will not change any further.

### RAMCP: Action Selection.

The action selection phase is sketched in Algorithm 3. If the current risk bound is , there is no real constraint and we select an action maximizing the expected payoff. Otherwise, to compute a distribution on actions to select, we construct and solve a certain constrained MDP.

Constructing Constrained MDP. RAMCP first computes a closure of . That is, first we set and then for each node and each action such that has a successor of the form (in such a case, we say that is allowed in ), the algorithm checks if there exists a successor of the form that is not in ; all such “missing” successors of under are added to . Such a tree defines a perfectly observable constrained MDP :

• the states of are the nodes of ;

• for each internal node of and each action allowed in there is probability of transitioning from to under (these probabilities sum up to for each and thanks to computing the closure). If is a leaf of and , playing any action in leads with probability to a new sink state ( has self-loops under all actions).

• Rewards in are given by the function ; self-loop on the sink and state-action pairs of the form with have reward . Transitions from the other leaf nodes to the sink state have reward . That is, from nodes that were never explored explicitly (and thus have -attribute equal to ) we estimate the optimal payoff by previous POMCP simulations.

• We also have a constraint function assigning penalties to state-action pairs: assigns to pairs such that is a leaf of of length , and to all other state-action pairs.

Solving MDP . Using a linear programming formulation of constrained MDPs [Altman1999], RAMCP computes a randomized policy in maximizing under the constraint , where is a discounted sum of incurred penalties. The distribution is then the distribution on actions used by in the first step. An examination of the LP in [Altman1999] shows that each solution of the LP yields not only the policy , but also for each action allowed in the root, a risk vector , i.e. an

-dimensional vector such that

.

###### Remark 1 (Conservative risk minimization.)

Note that might have no policy satisfying the penalty constraint. This happens when the -attribute of the root is greater than . In such a case, the algorithm falls back to a policy that minimizes the risk, which means choosing action minimizing (line 3 of SelectAction). In such a case, all are set to zero, to enforce that in the following phases the algorithm behaves conservatively (i.e., keep minimizing the risk).

###### Remark 2 (No feasible solution.)

When our algorithm fails to obtain a feasible solution, it “silently” falls back to a risk-minimizing policy. This might not be the preferred option for safety-critical applications. However, the algorithm exactly recognizes when it cannot guarantee meeting the original risk-constraint—this happens exactly when at the entry to the SelectAction procedure, the -attribute in the root of is . Thus, our algorithm has two desirable properties: (a) it can report that it has not obtained a feasible solution; (b) along with that, it presents a risk-minimizing policy. Figure 2: Plots of results obtained from simulating (1.) the larger hallway POMDP benchmark (left), (2.) the MDP hallway benchmark (middle), and (3.) the smaller hallway POMDP benchmark (right). The horizontal axis represents a risk bound α.
###### Lemma 3

Assume that the original EOPG problem has a feasible solution. For a suitable exploration constant , as , the distribution converges, with probability 1, to a distribution on actions used in the first step by some optimal solution to the EOPG problem.

Proof (sketch). Assuming the existence of a feasible solution, we show that at the time point in which the condition in Lemma 2 (2.) holds (such an event happens with probability converging to ), the constrained MDP has a feasible solution. It then remains to prove that the optimal constrained payoff achievable in converges to the optimal risk-constrained payoff achievable in . Since rewards in are in correspondence with rewards in , it suffices to show that for each leaf of with and for each action the attribute converges with probability 1 to the optimal expected payoff for horizon achievable in after playing action from belief . But since the attributes are updated by POMCP simulations, this follows (for a suitable exploration constant) from properties of POMCP (Theorem 1 in [Silver and Veness2010], see also [Kocsis and Szepesvári2006]).

RAMCP: Playing an Action. The action-playing phase is shown in Algorithm 3. An action is played in the actual POMDP and a new observation and reward are obtained, and and are updated. Then, both the tree data structures are pruned so that the node corresponding to the previous history extended by becomes the new root of the tree. After this, we proceed to the next decision step.

###### Theorem 1

Assume that an EOPG problem instance with a risk bound and threshold has a feasible solution. As , the probability that RAMCP returns a payoff smaller than converges to a number . For a suitable exploration constant , the expected return of a RAMCP execution converges to .

Proof (sketch). The proof proceeds by an induction on the length of the horizon , using Lemma 2 (2.) and Lemma 3.

RAMCP also provides the following anytime guarantee.

###### Theorem 2

Let be the value of the -attribute of the root of after the end of the first search phase of RAMCP execution. Then the probability that the remaining execution of RAMCP returns a payoff smaller than is at most .

Unobservable Rewards. RAMCP could be adapted to work with unobservable rewards, at the cost of more computations. The difference is in the construction of the constraint function : for unobservable rewards, the same history of observations and actions might encompass both paths that have payoff above the threshold and paths that do not. Hence, we would need to compute the probability of paths corresponding to a given branch that satisfy the threshold condition. This could be achieved by maintaining beliefs over accumulated payoffs.

## 5 Experiments

We implemented RAMCP on top of the POMCP implementation in AI-Toolbox [AI-Toolbox2017] and tested on three sets of benchmarks. The first two are the classical Tiger [Kaelbling et al.1998] and Hallway [Smith and Simmons2004] benchmarks naturally modified to contain a risk taking aspect. In our variant of the Hallway benchmark, we again have a robot navigating a grid maze, oblivious to the exact heading and coordinates but able to sense presence of walls on neighbouring cells. Some cells of the maze are tasks. Whenever such a task cell is entered, the robot attempts to perform the task. When performing a task, there is a certain probability of a good outcome, after which a positive reward is gained, as well as a chance of a bad outcome, after which a negative penalty is incurred. There are different types of tasks in the maze with various expected rewards and risks of bad outcomes. Once a task is completed, it disappears from the maze. There are also “traps” that probabilistically spin the robot around.

As a third benchmark we consider an MDP variant of the Hallway benchmark. Since the Tiger benchmark is small, we present results for the larger benchmarks. Our implementation and the benchmarks are available on-line.

We ran RAMCP benchmarks with different risk thresholds, starting with unconstrained POMCP and progressively decreasing risk until RAMCP no longer finds a feasible solution. For each risk bound we average outcomes of 1000 executions. In each execution, we used a timeout of 5 seconds in the first decision step and 0.1 seconds for the remaining steps. Intuitively, in the first step the agent is allowed a “pre-processing” phase before it starts its operation, trying to explore as much as possible. Once the agent performes the first action, it aims to select actions as fast as possible. We set the exploration constant to , where is the difference between largest and smallest undiscounted payoffs achievable in a given instance. The test configuration was CPU: Intel-i5-3470, 3.20GHz, 4 cores; 8GB RAM; OS: Linux Mint 18 64-bit.

### Discussion.

In Figure 2, we present results of three of the benchmarks: (1.) The Hallway POMDP benchmark (); (2.) the perfectly observable version of our Hallway benchmark (); and (3.) smaller POMDP instance of the Hallway benchmark (, ,). In each figure, the axis represents the risk bound – the leftmost number is typically close to the risk achieved in POMCP trials. For each considered we plot the following quantities: average payoff (secondary, i.e. right, axis), empirical risk (the fraction of trials in which RAMCP returned payoff smaller than , primary axis) and stated risk (the average of -value of the root of after first search phase, primary axis). As expected, the stated risk approximates a lower bound on the empirical risk. Also, when a risk bound is decreased, average payoff tends to decrease as well, since the risk bound constraints the agent’s behaviour. This trend is somewhat violated in some datapoints: this is because in particular for larger benchmarks, the timeout does not allow for enough exploration so as to converge to a tight approximation of the optimal policy. The main obstacle here is the usage of exact belief updates within the explicit tree, which is computationally expensive. An interesting direction for the future is to replace these updates with a particle-filter approximation (in line with POMCP) and thus increase search speed in exchange for weaker theoretical guarantees. Nonetheless, already the current version of RAMCP demonstrates the ability to perform risk vs. expectation trade-off in POMDP planning.

### Comparison with deterministic policies.

As illustrated in Example 1, the difference of values for randomized vs deterministic policies can be large. We ran experiments on Hallway POMDP benchmarks to compute deterministic policies (by computing, in action selection phase, an optimal deterministic policy in the constrained MDP , which entails solving a MILP problem). For instance, in benchmark (1.) for the deterministic policy yields expected payoff compared to achieved by randomized policy. In benchmark (3.) with we have expected payoff for deterministic vs. for randomized policies.

## 6 Conclusion

In this work, we studied the expected payoff optimization with probabilistic guarantees in POMDPs. We introduced an online algorithm with anytime risk guarantees for the EOPG problem, implemented this algorithm, and tested it on variants of classical benchmarks. Our experiments show that our algorithm, RAMCP, is able to perform risk-averse planning in POMDPs.

## Acknowledgements

The research presented in this paper was supported by the Vienna Science and Technology Fund (WWTF) grant ICT15-003; Austrian Science Fund (FWF): S11407-N23 (RiSE/SHiNE); and an ERC Start Grant (279307: Graph Games).

## References

• [AI-Toolbox2017] AI-Toolbox. AI-Toolbox [Computer Software].
• [Altman1999] Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
• [Bruyère et al.2014] Véronique Bruyère, Emmanuel Filiot, Mickael Randour, and Jean-François Raskin. Meet your expectations with guarantees: Beyond worst-case synthesis in quantitative games. In STACS, pages 199–213. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2014.
• [Cassandra1998] A.R. Cassandra. Exact and approximate algorithms for partially observable Markov decision processes. PhD thesis, 1998.
• [Chatterjee et al.2015a] K. Chatterjee, M. Chmelik, R. Gupta, and A. Kanodia. Optimal cost almost-sure reachability in POMDPs. In AAAI. AAAI Press, 2015.
• [Chatterjee et al.2015b] Krishnendu Chatterjee, Zuzana Komárková, and Jan Kretínský. Unifying two views on multiple mean-payoff objectives in Markov decision processes. In LICS, pages 244–256. IEEE Computer Society, 2015.
• [Chatterjee et al.2017] Krishnendu Chatterjee, Petr Novotný, Guillermo A. Pérez, Jean-François Raskin, and Dorde Zikelic. Optimizing expectation with guarantees in POMDPs. In AAAI, pages 3725–3732. AAAI Press, 2017.
• [Defourny et al.2008] Boris Defourny, Damien Ernst, and Louis Wehenkel. Risk-aware decision making and dynamic programming. Technical report, 2008.
• [Feinberg and Shwartz1995] Eugene A Feinberg and Adam Shwartz. Constrained markov decision models with weighted discounted rewards. Mathematics of Operations Research, 20(2):302–320, 1995.
• [Hou et al.2016] Ping Hou, William Yeoh, and Pradeep Varakantham. Solving risk-sensitive POMDPs with and without cost observations. In AAAI, pages 3138–3144. AAAI Press, 2016.
• [Howard1960] H. Howard. Dynamic Programming and Markov Processes. MIT Press, 1960.
• [Kaelbling et al.1996] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.
• [Kaelbling et al.1998] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1):99–134, 1998.
• [Kim et al.2011] Dongho Kim, Jaesong Lee, Kee-Eung Kim, and Pascal Poupart. Point-based value iteration for constrained POMDPs. In IJCAI, pages 1968–1974, 2011.
• [Kocsis and Szepesvári2006] Levente Kocsis and Csaba Szepesvári. Bandit Based Monte-Carlo Planning. In ECML, volume 4212 of LNCS, pages 282–293. Springer, 2006.
• [Kress-Gazit et al.2009] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas. Temporal-logic-based reactive mission and motion planning. IEEE Transactions on Robotics, 25(6):1370–1381, 2009.
• [Kurniawati et al.2008] H. Kurniawati, D. Hsu, and W.S. Lee. SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems, pages 65–72, 2008.
• [Littman1996] M. L. Littman. Algorithms for Sequential Decision Making. PhD thesis, Brown University, 1996.
• [Papadimitriou and Tsitsiklis1987] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision processes. Math. Oper. Res., 12:441–450, 1987.
• [Poupart et al.2015] Pascal Poupart, Aarti Malhotra, Pei Pei, Kee-Eung Kim, Bongseok Goh, and Michael Bowling. Approximate linear programming for constrained partially observable Markov decision processes. In AAAI, pages 3342–3348. AAAI Press, 2015.
• [Puterman2005] M L Puterman. Markov Decision Processes. Wiley-Interscience, 2005.
• [Randour et al.2015] Mickael Randour, Jean-François Raskin, and Ocan Sankur. Variations on the stochastic shortest path problem. In VMCAI, volume 8931 of LNCS, pages 1–18. Springer, 2015.
• [Ross et al.2008] Stéphane Ross, Joelle Pineau, Sébastien Paquet, and Brahim Chaib-draa. Online planning algorithms for POMDPs. J. Artif. Intell. Res. (JAIR), 32:663–704, 2008.
• [Russell and Norvig2010] Stuart J. Russell and Peter Norvig. Artificial Intelligence - A Modern Approach (3. internat. ed.). Pearson Education, 2010.
• [Santana et al.2016] Pedro Santana, Sylvie Thiébaux, and Brian C. Williams. RAO*: An algorithm for chance-constrained POMDP’s. In AAAI, pages 3308–3314. AAAI Press, 2016.
• [Silver and Veness2010] David Silver and Joel Veness. Monte-Carlo planning in large POMDPs. In NIPS 23, pages 2164–2172. Curran Associates, Inc., 2010.
• [Smith and Simmons2004] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In UAI, pages 520–527. AUAI Press, 2004.
• [Sprauel et al.2014] Jonathan Sprauel, Andrey Kolobov, and Florent Teichteil-Königsbuch. Saturated path-constrained mdp: Planning under uncertainty and deterministic model-checking constraints. In AAAI, pages 2367–2373, 2014.
• [Undurti and How2010] Aditya Undurti and Jonathan P How. An online algorithm for constrained POMDPs. In ICRA, pages 3966–3973. IEEE, 2010.
• [Ye et al.2017] Nan Ye, Adhiraj Somani, David Hsu, and Wee Sun Lee. DESPOT: Online POMDP planning with regularization. J. Artif. Intell. Res., 58:231–266, 2017.

## Appendix A Proof of Lemma 1

First, assume that there exists a feasible solution of the infinite-horizon problem. Due to the choice of , each play that satisfies also satisfies , so satisfies .

Assume that has the desired properties. First we show that . But due to the choice of , each play that satisfies also satisfies , from which the desired inequality easily follows.

Next, assume, for the sake of contradiction, that there is a policy such that and . Each play that satisfies also satisfies , from which it follows that . Moreover, , from which it follows that , a contradiction with the constrained optimality of .

## Appendix B Proof of Lemma 2

Before we proceed with the proof, we fix additional notation. For a history and we denote by the and the prefix of of length . Next, denote the history obtained by removing prefix of length from . We also denote .

Proof of part (1.). We prove a more general statement: Fix any point of algorithm’s execution, and let be the length of at this point. Then for any node that is at this point a node of the explicit tree there exists a policy whose risk threshold at when starting from belief and playing for steps is at most , formally . The statement of the lemma then follows by plugging the root of into .

Before the very first call of procedure Simulate, the statement holds, as only contains an empty history with trivial upper risk bound .

Next, assume that the statement holds after each execution of the Simulate procedure. Then, in procedure PlayAction, is pruned so that now it contains a sub-tree of the original tree. At this point is incremented by one but at the same time is set to , where is the common prefix for all histories that remain in after the pruning. Hence, for all such histories the term is unchanged in the PlayAction procedure and the statement still holds. Hence, it is sufficient to prove that the validity of the statement is preserved whenever a new node is added to or the -attribute of some node in is changed inside procedure UpdateTrees.

Now when a new with is added to , it is node that corresponds to a history such that all paths consistent with have reward at least . For such , the terms and evaluate to and thus the statement holds for .

So finally, suppose that is changed on line 2. Let be the action realizing the minimum. For each child of in the explicit tree there is a policy surpassing threshold with probability . By selecting action in and then continuing with when observation is received (if we receive observation s.t. is not in , we can continue with an arbitrary policy), we get a policy with the desired property. This is because of the dynamic programming update on the previous line.

Proof of part (2.). We again start by fixing some notation. Fix any point in execution of the Search procedure. Let be the current root of . The safe sub-tree of rooted in is a sub-tree of satisfying the following property: a history belongs to if and only if is a prefix of and at the same time can be extended into a history of length such that (in particular, all such histories belong to ). That is contains exactly those histories that lead to surpassing the current threshold.

We start with the following lemma, which will be also handy later.

###### Lemma 4

Let . Fix a concrete call of procedure Search, and let be the root of and in this call. Then, with probability , each node of the sub-tree of rooted is visited in infinitely many calls of procedure Simulate.

###### Proof.

Assume, for the sake of contradiction, that there exists a node of that is visited only in finitely many calls with positive probability. Let be such a node of minimal length. It cannot be that , since is visited in each call of Simulate and when , there are infinitely many such calls. So for some . Due to our assumptions, is visited in infinitely many calls of Simulate. This means that is eventually added to . Now assume that is selected on line 2 infinitely often with probability . Since there is a positive probability of observing after selecting for history , this would been that is also visited infinitely often with probability 1, a contradiction. But the fact that is selected infinitely often with probability stems from the fact that is sampled according to POMCP simulations. POMCP is essentially the UCT algorithm applied to the history tree of a POMDP, and UCT, when run indefinitely, explores each node of the tree infinitely often (Theorem 4 in [Kocsis and Szepesvári2006]).

We proceed with the following lemma.

###### Lemma 5

Fix a concrete call of procedure Search, and let be the root of and in this call. Then, as , the probability that becomes equal to before expires converges .

###### Proof.

We prove a slightly different statement: if , then the probability that eventually becomes equal to is 1. Clearly, this entails the lemma, since

 P(Texp becomes % equal to Tsafe(hroot)) =∞∑i=0P(Texp becomes equal to Tsafe(hroot) in i-th % call of {Simulate} ).

(Here, denotes the probability measure over executions of our randomized algorithm).

So let . From Lemma 4 it follows that with probability , each node of the history tree is visited infinitely often. In particular, each node representing history of length such that is visited, with probability 1, in at least one call of procedure Simulate. Hence , with probability 1, this node and all its predecessors, are added to during the sub-call UpdateTrees(h). ∎

Lemma 2 then follows from the previous and the following lemma.

###### Lemma 6

Assume that during that during some call of Simulate it happens that becomes equal to . Then at this point it holds that , and will not change any further.

###### Proof.

Let . We again prove a somewhat stronger statement: given assumptions of the lemma, for each it holds that . Since an easy induction shows that -attributes can never increase, from part (1.) of Lemma 2 we get that once this happens, the attributes of all nodes in now represent the minimal achievable risk at given thresholds and thus these attributes can never change again.

Denote . We proceed by backward induction on the depth of . Clearly, for each leaf of it holds , so the statement holds. Now let be any internal node of