1 Introduction
In this work we present results related to strategy complexity and computational complexity of partiallyobservable Markov decision processes (POMDPs) with longrun average objectives.
POMDPs. Markov decision processes (MDPs) [17] are standard models for dynamic systems with probabilistic and nondeterministic behavior. MDPs provide the appropriate model to solve problems in sequential decision making such as control of probabilistic systems and probabilistic planning [27], where the nondeterminism represents the choice of actions (of the controller or planner) while the stochastic response of the system to control actions is represented by the probabilistic behavior. In perfectobservation (or perfectinformation) MDPs, to resolve the nondeterministic choices among control actions the controller observes the current state of the system precisely. However, to model uncertainty in system observation the relevant model is partially observable MDPs (POMDPs) where the state space is partitioned according to observations that the controller can observe, i.e., the controller can only view the observation of the current state (the partition the state belongs to), but not the precise state [25]. POMDPs are widely used in several applications, such as in computational biology [12], speech processing [24], image processing [11], software verification [5], robot planning [21, 19]
[20], to name a few. Even special cases of POMDPs, namely, probabilistic automata or blind MDPs, where there is only one observation, is also a standard model in many applications [28, 26, 4].Infinitehorizon objectives. A transition in a POMDP is the update of the current state to the next state given a control action, and a play is an infinite sequence of transitions. Transitions are associated with rewards and an objective function maps plays to realvalued rewards. Classical infinitehorizon objectives are [15, 27]: (a) the discountedsum of the rewards of the plays; and (b) the longrun average of the rewards of the plays. Another related class of infinitehorizon objectives are regular objectives, that defines an regular set
of plays, and the objective is the characteristic function that assigns reward 1 to plays in
and 0 otherwise.Strategies. A strategy (or policy) is a recipe that describes how to extend finite prefixes of plays, i.e., given a sequence of past actions and observations, it describes the next action. A strategy is finitememory if it only remembers finite amount of information of the prefixes, i.e., formally, a finitememory strategy can be described as a transducer with finite state space. A special class of finitememory strategies is finiterecall strategies that only remembers the information of last steps, for some finite .
Values and approximation problems. Given a POMDP, an objective, and a starting state, the optimal value is the supremum over all strategies of the expected value of the objective.

The approximation problem given a POMDP, an objective function, a starting state and asks to compute the optimal value within an additive error of .

The decision version of the approximation problem, given a POMDP, an objective function, a starting state, a number and , with the promise that the optimal value is either (a) at least or (b) at most , asks to decide which is the case.
Previous results. The approximation problem for discountedsum objectives is decidable for POMDPs (by reduction to finitehorizon objectives whose complexity has been wellstudied [25, 2]). However, both for longrun average objectives and regular objectives the approximation problem is undecidable, even for blind MDPs [22]. Finer characterization have been studied for probabilistic automata (or blind MDPs) with finite words and regular objectives specified in its canonical form such as parity or Rabin objectives [1, 16, 6, 7, 9, 14]
. It has been established that even qualitative problems which ask whether there is a strategy that ensures the objective with positive probability or probability 1 is
complete for blind MDPs with regular objectives [7] (i.e., complete for the second level of the arithmetic hierarchy): first the undecidability for qualitative problems was established in [3] and the precise characterization was established in [7]. The completeness result is quite interesting because typically undecidable problems for automata on infinite words tend to be harder and lie in the analytical hierarchy (beyond the arithmetic hierarchy); for details see [6, 7].Open questions. There are several open questions related to the approximation problem for POMDPs with longrun average objectives. The two most important ones are as follows: First, while for POMDPs with regular objectives there are characterization in the arithmetic hierarchy, there are no such results for POMDPs with longrun average objectives. Second, given the importance of POMDPs with longrun average objectives, there are several algorithmic approaches based on belief statespace discretization ^{1}^{1}1
The belief statespace represents an infinitestate space perfect information MDP where a belief state represents a probability distribution over states of the POMDP.
; however none of these approaches provide any theoretical guarantee.Our contributions. Our main contributions are as follows:

Strategy complexity. First, we show that for every blind MDP with longrun average objectives, for every , there is finiterecall strategy that achieves expected reward within of the optimal value. Second, we show that for every POMDP with longrun average objectives, for every , there is a finitememory strategy that achieves expected reward within of the optimal value. We also show that, in contrast to blind MDPs, finiterecall strategies do not suffice for approximation in general POMDPs.

Computational complexity. An important consequence of our above result is that the decision version of the approximation problem for POMDPs with longrun average objectives is recursively enumerable (R.E.)complete. Our results on strategy complexity imply the recursively enumerable upper bound, by simply enumerating over all finitememory strategies; and the recursively enumerable lower bound is a consequence of the undecidability result of [22].
Significance. To clarify the significance of our result, we mention some relationship of parity objectives (a canonical form for regular objectives) and longrun average objectives.

First, there is a polynomialtime reduction of parity objectives to meanreward objectives in perfectinformation games [18], however, no such polynomialtime reduction is known in the other direction.

Second, for perfectinformation MDPs there is a strongly polynomialtime reduction from parity objectives to longrun average objectives [10] ^{2}^{2}2The results of [10] show that the optimal value in perfectinformation MDPs with parity objectives is the optimal value to reach the set of states from where the objective can be satisfied almostsurely, and the almostsure set can be computed in strongly polynomial time. Since reachability objectives are special case of longrun average objectives this establishes a strongly polynomialtime reduction., however, no such strongly polynomialtime reduction is known in the other direction.
Thus in terms of analysis in related models (games and MDPs) and topological complexity, longrun average objectives are harder than parity objectives. In this light our results are significant and surprising in the following ways:

First, there are examples of blind MDPs with parity objectives [3, 1] where there is an infinitememory strategy to ensure the objective with probability 1, whereas for every finitememory strategy the objective is satisfied with probability 0. Hence, finitememory strategies do no guarantee any approximation for POMDPs with parity objectives. In contrast, we establish that for POMDPs with longrun average objectives finitememory strategies suffice for approximation, for all .

Second, for POMDPs with parity objectives, even qualitative problems are complete, whereas we show that for POMDPs with longrun average objectives the approximation problem is R.E.complete (or complete).

Finally, for POMDPs with parity objectives, algorithms based on beliefstate discretization do not guarantee convergence to the value. In contrast our result on existence of finitememory strategies for approximation suggest that algorithms based on appropriately chosen beliefstate space discretization can ensure convergence to the optimal value for POMDPs with longrun average objectives.
In summary, though in related models and in terms of topological complexity the longrun average objectives are harder than parity objectives, quite surprisingly we show that for the approximation problem in POMDPs, in terms of strategy complexity, computational and algorithmic complexity, longrun average objectives are simpler than parity objectives.
2 Model and statement of results
2.1 Model
For a finite set, denote the set of probability distributions over .
Consider a POMDP , with finite state space , finite action set , finite signal set , transition function , and reward function .
Given called initial belief, the POMDP starting from is denoted by and proceeds as follows:

An initial state is drawn from . The decisionmaker knows but does not know .

At each stage , the decisionmaker takes some action . He gets a stage reward , where is the state at stage . Then, a pair is drawn from the probability : is the next state, and the decisionmaker is informed of the signal , but neither of the reward nor of the state .
At stage , the decisionmaker remembers all the past actions and signals before stage , which is called history before stage . Let be the set of histories before stage , with the convention . A strategy is a mapping . The set of strategies is denoted by . The infinite sequence is called a play. For and , define the law induced by and the initial belief on the set of plays of the game , and the expectation with respect to this law. For simplicity, denote , for all , where stands for the Dirac measure over state .
Let
and
The term is the longterm payoff given by strategy and is the optimal longterm payoff, called value, defined as the supremum payoff over all strategies. For simplicity, denote and , for every .
Remark 2.1.
In literature, the concept of strategy that we defined is often called pure strategy, by contrast with mixed strategies, that use randomness. By Kuhn’s theorem, enlarging the set of pure strategies to mixed strategies does not change (see [33] and [13]), and thus does not change our results. Moreover, coincides with the limit of the value of the stage problem and discounted problem, as well as the uniform value and weighted uniform value (see [31, 29, 30, 33]).
Definition 2.2.
A POMDP is called blind MDP if the signal set is a singleton.
Thus, in a blind MDP, the decisionmaker does not get any relevant signal at any time, and a strategy is simply an infinite sequence of actions .
2.2 Contribution
We start by defining several classes of strategies.
Definition 2.3.
Let and . A strategy is optimal in if it satisfies
Definition 2.4 (Finite memory strategy).
A strategy is said to have finite memory if it can be modeled by a finitestate transducer. Formally, , where is a finite set of memory states, is the initial memory state, is the action selection function and is the memory update function.
Definition 2.5 (Finite recall strategy).
A strategy is said to have finite recall if it only needs the last portion of the history to make decisions. Formally, there exists such that for all , does not depend on .
Definition 2.6 (Eventually periodic strategy).
Assume that is a blind MDP. A strategy is eventually periodic if there exists and such that for all , .
Remark 2.7.
An eventually periodic strategy has finite recall, and a finite recall strategy has finite memory. But the inverse is not true: finite memory strategies do not necessarily have finite recall.
Our main result is the following:
Theorem 2.8.
For every POMDP , initial belief and , there exists an optimal strategy in that has finite memory.
In addition, the result for the blind MDP case is stronger:
Theorem 2.9.
For every blind MDP , initial belief and , there exists an optimal strategy in that is eventually periodic, and thus has finite recall.
3 Beliefs and an ergodic property
The following concept plays a key role in the study of POMDPs and in our paper.
Definition 3.1.
Given a strategy and an initial belief , the belief at stage
of the decisionmaker is the random variable
taking values in , such that for all ,where is the (random) history up to stage .
Conditional on signal , the belief is updated at the end of stage into belief , using Bayes rule. To avoid heavy notations, we omit the dependence of in and .
Remark 3.2.
The belief synthesizes all the relevant information available to the player about the current state. Note that it is not known whether for each , there exist optimal strategies that are beliefstationary: these are strategies such that at each stage , the action taken depends on the current belief only. Nonetheless, even if this result was known to hold, it would not be of much help in the proof of Theorem 2.8. Indeed, the set of beliefs is infinite, and computing at each stage may require memory that grows to infinity as goes to infinity. Thus, a beliefstationary strategy may not have finite memory.
For , the support of , denoted by , is the set of such that . In [33], it was proven that is equal to the limit of the stage value, as grows to infinity. In this paper, we make use of the results in [33, Lemmas 3334, p. 20042005], that are summarized below:
Lemma 3.3.
(Venel and Ziliotto 2016) For any initial belief and , there exists and and a belief (which depends on the history up to stage ) such that:

There is an event with , where

There exists such that for all
(1) Moreover, and .
Remark 3.4.
Note that represents the law on plays induced by the strategy , conditional on the fact that the initial state is . This does not mean that we consider that the decisionmaker knows . In the same fashion, is the reward given by the strategy , conditional on the fact that the initial state is . Even though is 0optimal in , this does not imply that is optimal in :
we may have .
Another important point of this result is that all the average payoffs converge almost surely to a limit that only depends on the initial state . This is a key ingredient in the proof of Theorems 2.8 and 2.9.
Intuitively, this result means that for any initial belief , after a finite number of stages, we can get close to a belief such that the optimal payoff from is in expectation almost the same as from , and moreover from there exists a 0optimal strategy that induces a strong ergodic behavior on the state dynamics. Thus, there is a natural way to build a optimal strategy in : first, apply the strategy for stages, then apply . Since after steps the current belief is close to with probability higher than , the reward from playing is at least the expectation of , which is higher than . Thus, this procedure yields a optimal strategy. Nonetheless, may not have finite memory, and thus this strategy may not have either.
4 Blind MDPs
As described in Section 2.1, a blind MDP is a POMDP where there is only one possible signal, this way the decisionmaker gets no information about the state, at any stage. In this section, we assume that is a blind MDP.
In this framework, the transition can be viewed as a function , and a strategy is simply an infinite sequence of actions: . Fix an initial belief .
An important feature of this model is that, for every strategy, the belief process is deterministic. Consequently, Lemma 3.3 simplifies as follows:
Lemma 4.1.
There exists a (deterministic) belief such that:

can be (approximately) reached from : for all , there exists and such that under ,

There exists such that for all
(2) and .
From now on, is fixed, and let , , and be like in Lemma 4.1. We are going to prove that in the problem starting from , there exists an eventually periodic optimal strategy . As we shall see, this property easily implies Theorem 2.9: indeed, playing until stage , and then switching to constitutes an eventually periodic optimal strategy.
The following concept plays a key role in the proof.
Definition 4.2.
The shift of is the strategy .
Define an equivalence relation over as follows: if and only if , and let , be the associated partition: . Denote for , .
Definition 4.3.
The supersupport at stage is the tuple , where for each , is the set of states that are reached with positive probability at stage under , starting from an initial state in . Formally, .
Given a very large integer and starting from , we are going to build an optimal strategy such that the sequence of actions played between stages and depends on only. Since the set of possible supersupport is finite and the evolution of supersupports in time is deterministic, this type of strategy is eventually periodic. For this we prove the following crucial property of supersupport which shows that from any initial state lying in , the strategy yields the same reward as the strategy starting from an initial state in .
Lemma 4.4.
For all and ,
Consequently, are disjoint.
Proof.
Let and . There exists such that is reached with positive probability from : . By Lemma 4.1, we have
thus in particular
and .
Consequently, if , then , thus . ∎
The set is a finite set, thus it can be written , where and each partition is different.
Definition of the strategy .
Let . By Lemma 4.4, there exists such that for all , and ,
Define recursively a sequence by: and is such that . Define the strategy by: for all , play the sequence of actions between stages and . In words, during each time block , the decisionmaker plays as if he was playing from stage .
For sake of notation, denote and . The following lemma states the properties of this collection of partitions.
Lemma 4.5.
For all , and we have that almost surely
Consequently, is a partition of , and
Proof.
By Lemma 4.4, are disjoint. Let us prove the result by induction on . For , and , thus the result holds. Assume . Starting from and playing during stages leads the state to be in . Consequently, if , then .
Conversely, assume . By induction hypothesis, lies in for some . By what precedes, , and thus by Lemma 4.4, .
To summarize, we have proved that if and only if . By induction hypothesis, we deduce that if and only if . ∎
To finish the proof of Theorem 2.9, the idea is that if at stage the state is , and is played, this induces an expected average reward of approximately in this block. The key point is that this is true for any . Thus, if the belief at stage is , the expected average reward in this block is approximately , and thus summing the reward on all the blocks yield approximately . In the former inequality, we have used the fundamental property stated in the previous lemma: under strategy , the weight on is independent of .
The above informal argument is the intuition behind the following lemma, and the proof is relegated to the appendix:
Lemma 4.6.
Proof of Theorem 2.9.
Since the strategy is eventually periodic, the average rewards converge almost surely. Thus, the above lemma yields
Define the strategy as follows: play until stage , then switch to . Note that is eventually periodic. We have
where the first inequality stems from Lemma 4.1, and the theorem is proved. ∎
Remark 4.7.
The finite memory strategy we built mainly relies on the computation of the supersupport. We would like to emphasize that it is crucial to consider supersupport, and that keeping track of the support is not enough, as shown in the following example.
Example 4.8.
Consider two states, and , two actions, and , and initial belief . Regardless of the actions played, the state goes from to with probability 1, and to with probability 1. The action is good in state (reward 1) and bad in state (reward 0), and action is good in state (reward 1) and bad in state (reward 0).
A strategy that only depends on the support of the belief plays either always, or always, and thus achieves a longterm payoff of . By contrast, if at each stage , the decisionmaker knows not only the support of the belief, but knows whether the weight on state comes from state or , then he knows whether the weight of the belief on is or , and can play in the first case and in the second case, securing a longterm payoff of .
5 POMDPs
In this setting, the decisionmaker is given a signal after each action which is related to the current state of the system. Thus, strategies can depend on past actions and received signals. We refer to Section 2.1 for the formal definition of a POMDP.
In comparison with the blind MDP case, the belief process is now random, since it depends on the signals obtained. In other words, the same sequence of actions may lead us to a different beliefs by receiving different signals. Therefore, the concepts introduced before of mshift and supersupport need to depend on history instead of time. Making this slight adjustment, the proof follows naturally. The finite recall property is lost in this proof by the fact that whenever we have to “restart” the strategy, we still need to know how we got there in the first place. Therefore, instead of remembering the last steps in history, one needs the first steps to keep track of the correct actions. This is explained further in Remark 5.6.
Fix an initial belief and , and consider , , , and as defined in Lemma 3.3. By Lemma 3.3, it is enough to prove that there exists a finitememory optimal strategy starting from .
The counterpart of mshift and supersupport are formally defined as follows.
Definition 5.1.
For , the shift of is the strategy defined as follows: for all , .
Define an equivalence relation over as follows: if and only if , and let , be the associated partition: . Denote for , .
Definition 5.2.
The supersupport after history is the uple , where for each , is the set of states that are reached with positive probability after history , under and starting from an initial state in . Formally, .
Informally, replacing this notation in Section 4 leads to the desired result: both sections have the same structures and proof were made as similar as possible for sake of understanding. Nonetheless, there are some more technicalities in the POMDP case that must be addressed.
Given a very large integer and starting from , we are going to build an optimal strategy such that the actions taken between stages and depends only on the signals received after stage and of . Since the set of possible supersupports is finite, this type of strategy requires only finite memory. For , define .
Lemma 5.3.
For all and for all ,
Proof.
Let . There exists such that is reached with positive probability from : . We have
thus in particular it holds for all possible evolutions of . Then,
and . ∎
The set is a finite set, thus it can be written as , where the are distinct.
Definition of the strategy .
Let . By Lemma 5.3, there exists such that for all , and ,
Define the strategy by induction such that for all , for all , plays only according to the history between stage and , and on a variable that is updated at each stage according to history.
Set . Let and . Let , and update in the following way: define such that , where is the history between stages and . Note that takes always values in .
In words, during each time block , the decisionmaker plays as if he was playing from history .
For sake of notation, denote and .
Now we can state another property of the supersupport.
Lemma 5.4.
For all and we have that almost surely
Consequently, is a partition of , and
Proof.
By Lemma 5.3, the are disjoint. Let us prove by induction that they form a partition of . For , and , thus the result holds. Assume . Starting from , playing during stages and collecting history (the history between stages and ), lead the state to be in , by definition of the index . Consequently, if , then .
Conversely, assume . By induction hypothesis, lies in for some . By what precedes, , and thus by Lemma 5.3, .
To summarize, we have proved that if and only if . By induction hypothesis, we deduce that if and only if . ∎
To finish the proof, the idea is that at stage the state is , and is played, this induces an expected average reward of approximately in this block. Thus, if the belief at stage is , the expected average reward in this block is approximately , and thus summing the reward on all the blocks yield approximately . This is the intuition behind the following lemma, which proof is relegated to the appendix:
Lemma 5.5.
Proof Theorem 2.8.
Since the strategy has finite memory, the average rewards converge almost surely. It follows from Lemma 5.5 that
Define the strategy as follows: play until stage , then switch to . Note that has finite memory. We have
and the theorem is proved. ∎
Remark 5.6.
Finite recall is not enough to obtain optimality in the general POMDP case, as shown in the example below.
Example 5.7.
The idea described in this example is that one needs to know how one got to the current situation to decide what actions to take, and it is not enough to know the last steps. This POMDP has five states: an initial state from where a random initial signal is given, a dump state from where it is impossible to get out and reward is zero and two subsystems where the optimal strategy is trivial. The key idea is that there is an arbitrarily long sequence of actions and signals which can be gotten in both systems, but the optimal strategies behaves differently in each of them, therefore, to forget the initial signal of the POMDP leads to at most half of the optimal value.
The following is a representation of : first under action and then action . Each state is followed by the corresponding reward, and the arrows include the probability for the corresponding transition along with the signal obtained. In this case, each transition allows only one signal.
Comments
There are no comments yet.