# The Complexity of POMDPs with Long-run Average Objectives

We study the problem of approximation of optimal values in partially-observable Markov decision processes (POMDPs) with long-run average objectives. POMDPs are a standard model for dynamic systems with probabilistic and nondeterministic behavior in uncertain environments. In long-run average objectives rewards are associated with every transition of the POMDP and the payoff is the long-run average of the rewards along the executions of the POMDP. We establish strategy complexity and computational complexity results. Our main result shows that finite-memory strategies suffice for approximation of optimal values, and the related decision problem is recursively enumerable complete.

• 61 publications
• 2 publications
• 3 publications
07/01/2021

### Strategy Complexity of Mean Payoff, Total Payoff and Point Payoff Objectives in Countable MDPs

We study countably infinite Markov decision processes (MDPs) with real-v...
08/09/2014

### POMDPs under Probabilistic Semantics

We consider partially observable Markov decision processes (POMDPs) with...
08/17/2021

### On the equivalence of holding cost and response time for evaluating performance of queues

This self-contained discussion relates the long-run average holding cost...
03/10/2022

### Strategy Complexity of Point Payoff, Mean Payoff and Total Payoff Objectives in Countable MDPs

We study countably infinite Markov decision processes (MDPs) with real-v...
10/26/2020

### Multi-objective Optimization of Long-run Average and Total Rewards

This paper presents an efficient procedure for multi-objective model che...
07/11/2017

### Synthesis of Optimal Resilient Control Strategies

Repair mechanisms are important within resilient systems to maintain the...
04/11/2018

### When optimizing nonlinear objectives is no harder than linear objectives

Most systems and learning algorithms optimize average performance or ave...

## 1 Introduction

In this work we present results related to strategy complexity and computational complexity of partially-observable Markov decision processes (POMDPs) with long-run average objectives.

POMDPs. Markov decision processes (MDPs) [17] are standard models for dynamic systems with probabilistic and nondeterministic behavior. MDPs provide the appropriate model to solve problems in sequential decision making such as control of probabilistic systems and probabilistic planning [27], where the nondeterminism represents the choice of actions (of the controller or planner) while the stochastic response of the system to control actions is represented by the probabilistic behavior. In perfect-observation (or perfect-information) MDPs, to resolve the nondeterministic choices among control actions the controller observes the current state of the system precisely. However, to model uncertainty in system observation the relevant model is partially observable MDPs (POMDPs) where the state space is partitioned according to observations that the controller can observe, i.e., the controller can only view the observation of the current state (the partition the state belongs to), but not the precise state [25]. POMDPs are widely used in several applications, such as in computational biology [12], speech processing [24], image processing [11], software verification [5], robot planning [21, 19]

[20], to name a few. Even special cases of POMDPs, namely, probabilistic automata or blind MDPs, where there is only one observation, is also a standard model in many applications [28, 26, 4].

Infinite-horizon objectives. A transition in a POMDP is the update of the current state to the next state given a control action, and a play is an infinite sequence of transitions. Transitions are associated with rewards and an objective function maps plays to real-valued rewards. Classical infinite-horizon objectives are  [15, 27]: (a) the discounted-sum of the rewards of the plays; and (b) the long-run average of the rewards of the plays. Another related class of infinite-horizon objectives are -regular objectives, that defines an -regular set

of plays, and the objective is the characteristic function that assigns reward 1 to plays in

and 0 otherwise.

Strategies. A strategy (or policy) is a recipe that describes how to extend finite prefixes of plays, i.e., given a sequence of past actions and observations, it describes the next action. A strategy is finite-memory if it only remembers finite amount of information of the prefixes, i.e., formally, a finite-memory strategy can be described as a transducer with finite state space. A special class of finite-memory strategies is finite-recall strategies that only remembers the information of last -steps, for some finite .

Values and approximation problems. Given a POMDP, an objective, and a starting state, the optimal value is the supremum over all strategies of the expected value of the objective.

• The approximation problem given a POMDP, an objective function, a starting state and asks to compute the optimal value within an additive error of .

• The decision version of the approximation problem, given a POMDP, an objective function, a starting state, a number and , with the promise that the optimal value is either (a) at least or (b) at most , asks to decide which is the case.

Previous results. The approximation problem for discounted-sum objectives is decidable for POMDPs (by reduction to finite-horizon objectives whose complexity has been well-studied [25, 2]). However, both for long-run average objectives and -regular objectives the approximation problem is undecidable, even for blind MDPs [22]. Finer characterization have been studied for probabilistic automata (or blind MDPs) with finite words and -regular objectives specified in its canonical form such as parity or Rabin objectives [1, 16, 6, 7, 9, 14]

. It has been established that even qualitative problems which ask whether there is a strategy that ensures the objective with positive probability or probability 1 is

-complete for blind MDPs with -regular objectives [7] (i.e., complete for the second level of the arithmetic hierarchy): first the undecidability for qualitative problems was established in [3] and the precise characterization was established in [7]. The -completeness result is quite interesting because typically undecidable problems for automata on infinite words tend to be harder and lie in the analytical hierarchy (beyond the arithmetic hierarchy); for details see [6, 7].

Open questions. There are several open questions related to the approximation problem for POMDPs with long-run average objectives. The two most important ones are as follows: First, while for POMDPs with -regular objectives there are characterization in the arithmetic hierarchy, there are no such results for POMDPs with long-run average objectives. Second, given the importance of POMDPs with long-run average objectives, there are several algorithmic approaches based on belief state-space discretization 111

The belief state-space represents an infinite-state space perfect information MDP where a belief state represents a probability distribution over states of the POMDP.

; however none of these approaches provide any theoretical guarantee.

Our contributions. Our main contributions are as follows:

• Strategy complexity. First, we show that for every blind MDP with long-run average objectives, for every , there is finite-recall strategy that achieves expected reward within of the optimal value. Second, we show that for every POMDP with long-run average objectives, for every , there is a finite-memory strategy that achieves expected reward within of the optimal value. We also show that, in contrast to blind MDPs, finite-recall strategies do not suffice for -approximation in general POMDPs.

• Computational complexity. An important consequence of our above result is that the decision version of the approximation problem for POMDPs with long-run average objectives is recursively enumerable (R.E.)-complete. Our results on strategy complexity imply the recursively enumerable upper bound, by simply enumerating over all finite-memory strategies; and the recursively enumerable lower bound is a consequence of the undecidability result of [22].

Significance. To clarify the significance of our result, we mention some relationship of parity objectives (a canonical form for -regular objectives) and long-run average objectives.

• First, there is a polynomial-time reduction of parity objectives to mean-reward objectives in perfect-information games [18], however, no such polynomial-time reduction is known in the other direction.

• Second, for perfect-information MDPs there is a strongly polynomial-time reduction from parity objectives to long-run average objectives [10] 222The results of [10] show that the optimal value in perfect-information MDPs with parity objectives is the optimal value to reach the set of states from where the objective can be satisfied almost-surely, and the almost-sure set can be computed in strongly polynomial time. Since reachability objectives are special case of long-run average objectives this establishes a strongly polynomial-time reduction., however, no such strongly polynomial-time reduction is known in the other direction.

• Third, in terms of topological complexity, parity objectives lie in the Boolean closure of the second levels of the Borel hierarchy [23, 32], whereas long-run average objectives are complete for the third level of the Borel hierarchy [8].

Thus in terms of analysis in related models (games and MDPs) and topological complexity, long-run average objectives are harder than parity objectives. In this light our results are significant and surprising in the following ways:

• First, there are examples of blind MDPs with parity objectives [3, 1] where there is an infinite-memory strategy to ensure the objective with probability 1, whereas for every finite-memory strategy the objective is satisfied with probability 0. Hence, finite-memory strategies do no guarantee any approximation for POMDPs with parity objectives. In contrast, we establish that for POMDPs with long-run average objectives finite-memory strategies suffice for -approximation, for all .

• Second, for POMDPs with parity objectives, even qualitative problems are -complete, whereas we show that for POMDPs with long-run average objectives the approximation problem is R.E.-complete (or -complete).

• Finally, for POMDPs with parity objectives, algorithms based on belief-state discretization do not guarantee convergence to the value. In contrast our result on existence of finite-memory strategies for approximation suggest that algorithms based on appropriately chosen belief-state space discretization can ensure convergence to the optimal value for POMDPs with long-run average objectives.

In summary, though in related models and in terms of topological complexity the long-run average objectives are harder than parity objectives, quite surprisingly we show that for the approximation problem in POMDPs, in terms of strategy complexity, computational and algorithmic complexity, long-run average objectives are simpler than parity objectives.

## 2 Model and statement of results

### 2.1 Model

For a finite set, denote the set of probability distributions over .

Consider a POMDP , with finite state space , finite action set , finite signal set , transition function , and reward function .
Given called initial belief, the POMDP starting from is denoted by and proceeds as follows:

• An initial state is drawn from . The decision-maker knows but does not know .

• At each stage , the decision-maker takes some action . He gets a stage reward , where is the state at stage . Then, a pair is drawn from the probability : is the next state, and the decision-maker is informed of the signal , but neither of the reward nor of the state .

At stage , the decision-maker remembers all the past actions and signals before stage , which is called history before stage . Let be the set of histories before stage , with the convention . A strategy is a mapping . The set of strategies is denoted by . The infinite sequence is called a play. For and , define the law induced by and the initial belief on the set of plays of the game , and the expectation with respect to this law. For simplicity, denote , for all , where stands for the Dirac measure over state .

Let

 γp1∞(σ):=Ep1σ(liminfn→+∞1nn∑m=1gm),

and

 v∞(p1)=supσ∈Σγp1∞(σ).

The term is the long-term payoff given by strategy and is the optimal long-term payoff, called value, defined as the supremum payoff over all strategies. For simplicity, denote and , for every .

###### Remark 2.1.

In literature, the concept of strategy that we defined is often called pure strategy, by contrast with mixed strategies, that use randomness. By Kuhn’s theorem, enlarging the set of pure strategies to mixed strategies does not change (see [33] and [13]), and thus does not change our results. Moreover, coincides with the limit of the value of the -stage problem and -discounted problem, as well as the uniform value and weighted uniform value (see [31, 29, 30, 33]).

###### Definition 2.2.

A POMDP is called blind MDP if the signal set is a singleton.

Thus, in a blind MDP, the decision-maker does not get any relevant signal at any time, and a strategy is simply an infinite sequence of actions .

### 2.2 Contribution

We start by defining several classes of strategies.

###### Definition 2.3.

Let and . A strategy is -optimal in if it satisfies

 γp1∞(σ)≥v∞(p1)−ε.
###### Definition 2.4 (Finite memory strategy).

A strategy is said to have finite memory if it can be modeled by a finite-state transducer. Formally, , where is a finite set of memory states, is the initial memory state, is the action selection function and is the memory update function.

###### Definition 2.5 (Finite recall strategy).

A strategy is said to have finite recall if it only needs the last portion of the history to make decisions. Formally, there exists such that for all , does not depend on .

###### Definition 2.6 (Eventually periodic strategy).

Assume that is a blind MDP. A strategy is eventually periodic if there exists and such that for all , .

###### Remark 2.7.

An eventually periodic strategy has finite recall, and a finite recall strategy has finite memory. But the inverse is not true: finite memory strategies do not necessarily have finite recall.

Our main result is the following:

###### Theorem 2.8.

For every POMDP , initial belief and , there exists an -optimal strategy in that has finite memory.

In addition, the result for the blind MDP case is stronger:

###### Theorem 2.9.

For every blind MDP , initial belief and , there exists an -optimal strategy in that is eventually periodic, and thus has finite recall.

An important consequence of Theorem 2.8 is R.E.-completeness of the decision version of the approximation problem as mentioned in Section 1.

The rest of the paper is organized as follows. Section 3 recalls a result in [33] that plays an important role in the proof. Section 4 proves Theorem 2.9. The proof is also useful to understand the main ingredients of the proof of Theorem 2.8, which is presented in Section 5.

## 3 Beliefs and an ergodic property

The following concept plays a key role in the study of POMDPs and in our paper.

###### Definition 3.1.

Given a strategy and an initial belief , the belief at stage

of the decision-maker is the random variable

taking values in , such that for all ,

 pm(k)=Pp1σ(km=k | hm),

where is the (random) history up to stage .

Conditional on signal , the belief is updated at the end of stage into belief , using Bayes rule. To avoid heavy notations, we omit the dependence of in and .

###### Remark 3.2.

The belief synthesizes all the relevant information available to the player about the current state. Note that it is not known whether for each , there exist -optimal strategies that are belief-stationary: these are strategies such that at each stage , the action taken depends on the current belief only. Nonetheless, even if this result was known to hold, it would not be of much help in the proof of Theorem 2.8. Indeed, the set of beliefs is infinite, and computing at each stage may require memory that grows to infinity as goes to infinity. Thus, a belief-stationary strategy may not have finite memory.

For , the support of , denoted by , is the set of such that . In [33], it was proven that is equal to the limit of the -stage value, as grows to infinity. In this paper, we make use of the results in [33, Lemmas 33-34, p. 2004-2005], that are summarized below:

###### Lemma 3.3.

(Venel and Ziliotto 2016) For any initial belief and , there exists and and a belief (which depends on the history up to stage ) such that:

1. There is an event with , where

 ∥∥pmε0−p∗∥∥1≤ε,
2. There exists such that for all

 1nn∑m=1gm→γk∞(σ)Pkσ−a.s. (1)

Moreover, and .

###### Remark 3.4.

Note that represents the law on plays induced by the strategy , conditional on the fact that the initial state is . This does not mean that we consider that the decision-maker knows . In the same fashion, is the reward given by the strategy , conditional on the fact that the initial state is . Even though is 0-optimal in , this does not imply that is -optimal in : we may have .
Another important point of this result is that all the average payoffs converge almost surely to a limit that only depends on the initial state . This is a key ingredient in the proof of Theorems 2.8 and 2.9.

Intuitively, this result means that for any initial belief , after a finite number of stages, we can get -close to a belief such that the optimal payoff from is in expectation almost the same as from , and moreover from there exists a 0-optimal strategy that induces a strong ergodic behavior on the state dynamics. Thus, there is a natural way to build a -optimal strategy in : first, apply the strategy for stages, then apply . Since after steps the current belief is -close to with probability higher than , the reward from playing is at least the expectation of , which is higher than . Thus, this procedure yields a -optimal strategy. Nonetheless, may not have finite memory, and thus this strategy may not have either.

The idea of the proofs in both Theorems 2.8 and 2.9 is to turn into a finite memory strategy, making use of the ergodic property in Lemma 3.3. We first consider the case of blind MDPs, as it introduces all key ideas needed to prove Theorem 2.8.

## 4 Blind MDPs

As described in Section 2.1, a blind MDP is a POMDP where there is only one possible signal, this way the decision-maker gets no information about the state, at any stage. In this section, we assume that is a blind MDP. In this framework, the transition can be viewed as a function , and a strategy is simply an infinite sequence of actions: . Fix an initial belief .

An important feature of this model is that, for every strategy, the belief process is deterministic. Consequently, Lemma 3.3 simplifies as follows:

###### Lemma 4.1.

There exists a (deterministic) belief such that:

1. can be (approximately) reached from : for all , there exists and such that under ,

2. There exists such that for all

 1nn∑m=1gm→γk∞(σ)Pkσ−a.s., (2)

and .

From now on, is fixed, and let , , and be like in Lemma 4.1. We are going to prove that in the problem starting from , there exists an eventually periodic -optimal strategy . As we shall see, this property easily implies Theorem 2.9: indeed, playing until stage , and then switching to constitutes an eventually periodic -optimal strategy.

The following concept plays a key role in the proof.

###### Definition 4.2.

The -shift of is the strategy .

Define an equivalence relation over as follows: if and only if , and let , be the associated partition: . Denote for , .

###### Definition 4.3.

The super-support at stage is the -tuple , where for each , is the set of states that are reached with positive probability at stage under , starting from an initial state in . Formally, .

Given a very large integer and starting from , we are going to build an -optimal strategy such that the sequence of actions played between stages and depends on only. Since the set of possible super-support is finite and the evolution of super-supports in time is deterministic, this type of strategy is eventually periodic. For this we prove the following crucial property of super-support which shows that from any initial state lying in , the strategy yields the same reward as the strategy starting from an initial state in .

###### Lemma 4.4.

For all and ,

 γk∞(σm)=γi∞.

Consequently, are disjoint.

###### Proof.

Let and . There exists such that is reached with positive probability from : . By Lemma 4.1, we have

 1nm+n−1∑m′=mgm′→γk′∞(σ)=γi∞Pk′σ−a.s.,

thus in particular

 1nn∑m′=1gm′→γi∞Pkσm−a.s.,

and .

Consequently, if , then , thus . ∎

The set is a finite set, thus it can be written , where and each partition is different.

#### Definition of the strategy σ′.

Let . By Lemma 4.4, there exists such that for all , and ,

 Ekσmj(1n0n0∑m=1gm)≥γi∞−ε.

Define recursively a sequence by: and is such that . Define the strategy by: for all , play the sequence of actions between stages and . In words, during each time block , the decision-maker plays as if he was playing from stage .

For sake of notation, denote and . The following lemma states the properties of this collection of partitions.

###### Lemma 4.5.

For all , and we have that -almost surely

 ksn0+1∈Bijs if and only if k1∈Ki.

Consequently, is a partition of , and

 Pp∗σ′(ksn0+1∈Bijs)=p∗(Ki).
###### Proof.

By Lemma 4.4, are disjoint. Let us prove the result by induction on . For , and , thus the result holds. Assume . Starting from and playing during stages leads the state to be in . Consequently, if , then .

Conversely, assume . By induction hypothesis, lies in for some . By what precedes, , and thus by Lemma 4.4, .

To summarize, we have proved that if and only if . By induction hypothesis, we deduce that if and only if . ∎

To finish the proof of Theorem 2.9, the idea is that if at stage the state is , and is played, this induces an expected average reward of approximately in this block. The key point is that this is true for any . Thus, if the belief at stage is , the expected average reward in this block is approximately , and thus summing the reward on all the blocks yield approximately . In the former inequality, we have used the fundamental property stated in the previous lemma: under strategy , the weight on is independent of .

The above informal argument is the intuition behind the following lemma, and the proof is relegated to the appendix:

###### Lemma 4.6.
 Ep∗σ′(1Sn0Sn0∑m=1gm)≥v∞(p1)−ε.
###### Proof of Theorem 2.9.

Since the strategy is eventually periodic, the average rewards converge almost surely. Thus, the above lemma yields

 γp∗∞(σ′)=Ep∗σ′(limn→+∞1nn∑m=1gm)≥v∞(p1)−ε.

Define the strategy as follows: play until stage , then switch to . Note that is eventually periodic. We have

 γp1∞(σ0) = γpmε0∞(σ′) ≥ γp∗∞(σ′)−ε ≥ v∞(p1)−2ε,

where the first inequality stems from Lemma 4.1, and the theorem is proved. ∎

###### Remark 4.7.

The finite memory strategy we built mainly relies on the computation of the super-support. We would like to emphasize that it is crucial to consider super-support, and that keeping track of the support is not enough, as shown in the following example.

###### Example 4.8.

Consider two states, and , two actions, and , and initial belief . Regardless of the actions played, the state goes from to with probability 1, and to with probability 1. The action is good in state (reward 1) and bad in state (reward 0), and action is good in state (reward 1) and bad in state (reward 0).

A strategy that only depends on the support of the belief plays either always, or always, and thus achieves a long-term payoff of . By contrast, if at each stage , the decision-maker knows not only the support of the belief, but knows whether the weight on state comes from state or , then he knows whether the weight of the belief on is or , and can play in the first case and in the second case, securing a long-term payoff of .

## 5 POMDPs

In this setting, the decision-maker is given a signal after each action which is related to the current state of the system. Thus, strategies can depend on past actions and received signals. We refer to Section 2.1 for the formal definition of a POMDP.
In comparison with the blind MDP case, the belief process is now random, since it depends on the signals obtained. In other words, the same sequence of actions may lead us to a different beliefs by receiving different signals. Therefore, the concepts introduced before of m-shift and super-support need to depend on history instead of time. Making this slight adjustment, the proof follows naturally. The finite recall property is lost in this proof by the fact that whenever we have to “restart” the strategy, we still need to know how we got there in the first place. Therefore, instead of remembering the last steps in history, one needs the first steps to keep track of the correct actions. This is explained further in Remark 5.6.

Fix an initial belief and , and consider , , , and as defined in Lemma 3.3. By Lemma 3.3, it is enough to prove that there exists a finite-memory -optimal strategy starting from .

The counterpart of m-shift and super-support are formally defined as follows.

###### Definition 5.1.

For , the -shift of is the strategy defined as follows: for all , .

Define an equivalence relation over as follows: if and only if , and let , be the associated partition: . Denote for , .

###### Definition 5.2.

The super-support after history is the -uple , where for each , is the set of states that are reached with positive probability after history , under and starting from an initial state in . Formally, .

Informally, replacing this notation in Section 4 leads to the desired result: both sections have the same structures and proof were made as similar as possible for sake of understanding. Nonetheless, there are some more technicalities in the POMDP case that must be addressed.

Given a very large integer and starting from , we are going to build an -optimal strategy such that the actions taken between stages and depends only on the signals received after stage and of . Since the set of possible super-supports is finite, this type of strategy requires only finite memory. For , define .

###### Lemma 5.3.

For all and for all ,

 γk∞(σ[hm])=γi∞.
###### Proof.

Let . There exists such that is reached with positive probability from : . We have

 1nm+n−1∑m′=mgm′→γk′∞(σ)=γi∞Pk′σ−a.s.,

thus in particular it holds for all possible evolutions of . Then,

 1nn∑m′=1gm′→γi∞Pkσ[hm]−a.s.,

and . ∎

The set is a finite set, thus it can be written as , where the are distinct.

#### Definition of the strategy σ′.

Let . By Lemma 5.3, there exists such that for all , and ,

 Ekσ[hj](1n0n0∑m=1gm)≥γi∞(σ)−ε.

Define the strategy by induction such that for all , for all , plays only according to the history between stage and , and on a variable that is updated at each stage according to history.
Set . Let and . Let , and update in the following way: define such that , where is the history between stages and . Note that takes always values in .
In words, during each time block , the decision-maker plays as if he was playing from history .
For sake of notation, denote and .

Now we can state another property of the super-support.

###### Lemma 5.4.

For all and we have that -almost surely

 ksn0+1∈Bijs if and only if k1∈Ki.

Consequently, is a partition of , and

 Pp∗σ′(ksn0+1∈Bijs)=p∗(Ki).
###### Proof.

By Lemma 5.3, the are disjoint. Let us prove by induction that they form a partition of . For , and , thus the result holds. Assume . Starting from , playing during stages and collecting history (the history between stages and ), lead the state to be in , by definition of the index . Consequently, if , then .

Conversely, assume . By induction hypothesis, lies in for some . By what precedes, , and thus by Lemma 5.3, .

To summarize, we have proved that if and only if . By induction hypothesis, we deduce that if and only if . ∎

To finish the proof, the idea is that at stage the state is , and is played, this induces an expected average reward of approximately in this block. Thus, if the belief at stage is , the expected average reward in this block is approximately , and thus summing the reward on all the blocks yield approximately . This is the intuition behind the following lemma, which proof is relegated to the appendix:

###### Lemma 5.5.
 Ep∗σ′(1Sn0Sn0∑m=1gm)≥γp∗∞(σ)−ε.
###### Proof Theorem 2.8.

Since the strategy has finite memory, the average rewards converge almost surely. It follows from Lemma 5.5 that

 γp∗∞(σ′)=Ep∗σ′(limn→+∞1nn∑m=1gm)≥γp∗∞(σ)−ε.

Define the strategy as follows: play until stage , then switch to . Note that has finite memory. We have

 γp1∞(σ0) = Ep1σε(γpmε0∞(σ′)) ≥ Ep1σε(γp∗∞(σ′))−ε ≥ Ep1σε(γp∗∞(σ))−2ε ≥ v∞(p1)−3ε,

and the theorem is proved. ∎

###### Remark 5.6.

Finite recall is not enough to obtain -optimality in the general POMDP case, as shown in the example below.

###### Example 5.7.

The idea described in this example is that one needs to know how one got to the current situation to decide what actions to take, and it is not enough to know the last steps. This POMDP has five states: an initial state from where a random initial signal is given, a dump state from where it is impossible to get out and reward is zero and two sub-systems where the optimal strategy is trivial. The key idea is that there is an arbitrarily long sequence of actions and signals which can be gotten in both systems, but the optimal strategies behaves differently in each of them, therefore, to forget the initial signal of the POMDP leads to at most half of the optimal value.

The following is a representation of : first under action and then action . Each state is followed by the corresponding reward, and the arrows include the probability for the corresponding transition along with the signal obtained. In this case, each transition allows only one signal.