Markov Decision processes (MDPs) are a classical model for decision-making inside stochastic environments [13, 1]. In this model, a stochastic model of the environment is formalized and we aim at finding strategies that maximize the expected performance of the system with that stochastic environment. This performance in turn is formalized by a function that maps each infinite path in the MDP to a value. One classical such function is the mean-payoff function that maps an infinite path to the limit of the means of the payoffs obtained on its prefixes. While this measure is classical, alternatives to the mean-payoff measure have been studied in the literature, e.g. one of the most studied alternative notion is the notion of discounted sum . The main drawback of the mean-payoff value is that it does not guarantee local stability of the values along the path: if the limit mean-value of an infinite path converges to value , it may be the case that for arbitrarily long infixes of the infinite path, the mean-payoff of the infix is largely away from this value. There have been several recent contributions [6, 3, 7, 4] in the literature to deal with this possible fluctuations from the mean-payoff value along a path. In this paper, we study the notion of window mean-payoff that was introduced in [6, 7] for two player games, in the context of MDPs, and we provide algorithms and prove computational complexities for the expected value of window mean-payoff objectives.
As introduced in , in a window mean-payoff objective instead of the limit of the mean-payoffs along the whole sequence of payoffs, we consider payoffs over a local finite length window sliding along the infinite sequence: the objective asks that the mean-payoff must always reach a given threshold within the window length . This objective is clearly a strengthening of the mean-payoff objective: for all lengths , and all infinite sequences , payoffs that satisfy the window mean-payoff objective for threshold imply that has a mean-payoff value of at least . It can also be shown that this additional stability property can always be met at the cost of a small degradation of mean-payoff performances in two-player games: whenever there exists a strategy that forces a mean-payoff value against any behavior of the adversary then for every , there is a window length and a strategy that ensure that the window mean-payoff objective for threshold is eventually satisfied for windows of length (see Proposition 1 in ).
In this paper, we study how to maximize the expected value of the window mean-payoff function in MDPs. The value of an infinite sequence of integer values for this function is defined as follows:
i.e., it returns the supremum of all window mean-payoff thresholds that are enforced by the sequence of payoffs over the window length . As in , we study natural variants of this measure: when the length of the window is fixed or it is left unspecified but needs to be bounded, and when the window property needs to be true from the beginning of the path, or a prefix independent variant which asks the window property to eventually hold from some position in the path (leading to a prefix independent variant).
Our results are as follows. First, for the prefix independent version of the measure and for a fixed window length , we provide an algorithm to compute the best expected value of with a time complexity that is polynomial in the size of the MDP and in (Theorem 15). It is worth to note that, since the main motivation for introducing window mean-payoff objective is to ensure strong stability over reasonable period of time, it is very natural to assume that is bounded polynomially by the size of the the MDP . This in turn implies that our algorithm is fully polynomial for the most interesting cases. We also note that this complexity matches the complexity of computing the value of the function for two-player games , and we provide a relative hardness result: the problem of deciding the existence of a winning strategy in a window mean-payoff game can be reduced to the problem of deciding if the maximal expected mean-payoff value of a MDP for is larger than or equal to a given threshold (Theorem 16).
Second, we consider the case where the length in the measure is not fixed but only required to be bounded. In that case, we provide an algorithm which is in NPcoNP (Theorem 21). In addition, we show that providing a polynomial time solution to our problem would also provide a polynomial time solution to the value problem in mean-payoff games (Theorem 22), this is a long-standing open problem in the area .
Third, we consider the case where the good window property needs to be imposed from the start of the path. In that case, surprisingly, the problem of computing if there is a strategy to obtain an expected value above a threshold is provably harder than for two-player games unless P=PP. Indeed, while the threshold problem for the worst-case value can be solved in time polynomial in the size of the game and in , we show that for the expected value in a MDP, the problem is PP-Hard even if is given in unary (Theorem 27). To solve the problem, we provide an algorithm that executes in time that is polynomial in the size of the MDP, polynomial in the largest payoff appearing in the MDP, and exponential in the length (Theorem 26).
Finally, while our main results concentrate on MDP, we also systematically provide results for the special case of Markov chains.
As already mentioned, the window mean-payoff objective was introduced in  for two-player games. We show in this paper that the complexity of computing maximal expected value for the window mean-payoff value function is closely related to the computation of the worst-case value of a game inside end-components of the MDP (see Theorem 12 and 19) for the prefix independant version of our objective. For the non-prefix independent version, surprinsingly, computing the expected value for MDP seems computationally more demanding than computing the mean-payoff value for games (unless P=PP). The window mean-payoff objectives were also considered in games with imperfect information in , and in combinaison with omega-regular constraints in .
Stability issues of the mean-payoff measure have been studied in several contributions. In 
, the authors study MDP where the objective is to optimize the expected mean-payoff performance and stability. They propose alternative definitions to the classical notions of statistical variance. The notion of stability that is offered by window mean-payoff objective and that has been studied in this paper is stronger than the one proposed in. The techniques needed to solve the two problems are very different too as they mainly rely on solving sets of quadratic constraints.
In , window-stability objectives have been introduced. Those objectives are inspired from the window mean-payoff objective of  but they are different in that they do not enjoy the so called inductive window property because of the stricter stability constraints that those objectives impose. The authors have considered the window-stability objectives in the context of games (2 players) and graphs (1 player) but they did not consider the case of MDP ( players).
Structure of the paper
Section 2 introduces the definitions and the formal concepts used in this paper. In Section 3, we study the expected window mean-payoff problems in weighted Markov chains, while in Section 4, we study the problems for weighted Markov decision processes. We give algorithms to solve the problems, as well as we give hardness results for the problems we study. In Section 5, we show that in an MDP, the value of the bounded window mean-payoff problem equals the supremum of the fixed window mean-payoff problem over all window lengths and over all strategies. Section 6 defines window mean-cost instead of window mean-payoff where the objective is to minimize the cost as opposed to maximizing payoff. Finally, we conclude in Section 7 with a summary of the complexity and the hardness results that we obtain for both Markov chains and MDPs. Some details related to some of the proofs and an additional algorithms for the direct fixed window mean-payoff problem in Markov chains have been moved to the appendix.
2.1 Weighted Markov Chains
Given a finite set , a (rational) probability distribution over is a function such that
. We denote the set of probability distributions overby .
A finite weighted Markov chain (MC, for short) is a tuple , where is the finite set of states, is the initial state, is the set of edges, the function defines the weights of the edges, and is a function that assigns a probability distribution – on the set of outgoing edges from – to all states . A Markov chain is infinite if the set is not finite. Unless stated otherwise, we always consider a Markov chain to be finite.
Given a state , we define the support of to be , that is, every edge has a non-zero probability. Then, for a state , we define the set of infinite paths in starting from as , and set of infinite paths in the whole Markov chain is denoted and is equal to . Then, the set of finite paths from a state is denoted and corresponds to the set of all prefixes of all the paths in . Similarly, the set of finite paths in the whole Markov chain is equal to . We denote by the last vertex of the finite path . Finally, the set paths of length starting from can be defined as , where denotes the number of edges in . The set of finite paths of length in , that is , is equal to .
If we consider a weighted Markov chain and a path , we denote by the cylinder set generated by and it is defined as . The interesting property of these cylinder sets is that, for any path , we have: .
The bottom strongly connected components (BSCCs for short) in a Markov chain are the strongly connected components from which it is impossible to exit. That is, a BSCC ensures that . We denote by the set of BSCCs of the Markov chain . A nice property of these components is that every infinite path eventually ends up in one of the BSCCs almost surely. Formally, we have the proposition:
For all state , we have: .
To compute the different expected values we are interested in, we will need some computations on Markov chains. More specifically, given a weighted Markov chain , we would like to compute or from a state , where and are two sets of states, denotes eventually reaching some state in and denotes until some state in is reached, only states in are visited. In the following, and are denoted and respectively. The corresponding probabilities can be computed by solving a linear equation system (see ). By using the Gauss-Jordan method, we have an algorithm that computes these probabilities in time . We denote by the time to computing these probabilities.
2.2 Weighted Markov Decision Processes
A finite weighted Markov decision process (MDP, for short) is a tuple , where is the set of states, is the initial state of this Markov decision process, is the set of actions, and is set of edges. The function defines the weights of the edges, and is a partial function that assigns a probability distribution – on the set of outgoing edges from – to all states if action is taken in . Given a state , we denote by the set of actions . Then, given an action , we define . We assume that, for all , .
A strategy in is a function such that , for all . A finite-memory strategy can be seen as tuple where:
is a set of modes;
selects an action such that, for all and , ;
is a mode update function;
selects the initial mode for any state .
The amount of memory used by such a strategy is defined to be . Once we fix a strategy in an MDP, we obtain an MC. If the strategy is not finite memory, we may obtain an infinite Markov chain. For , we denote by the Markov chain that we obtain by applying strategy to . A strategy is said to be memoryless if , that is, the choice of action only depends on the current state where the choice is made. Formally, a strategy is said to be memoryless if for all finite paths and in such that , we have . We denote by the set of all strategies, and by the set of all memoryless strategies. Note that this set is finite.
We now define maximal end components in an MDP. To do this, we first define the notion of sub-MDP and the directed graph induced by it. For an MDP , a pair where and such that:
for all state s ;
for , we have ;
is called a sub-MDP of . A sub-MDP induces a directed graph whose vertex are and whose edges are . Then, an end-component is a sub-MDP whose induced graph is strongly connected. Finally, a maximal end component (MEC, for short) is an end component that is included in no other end component. We denote by the set of all maximal end components of . Computing the set can be done in time (see ). Any infinite path will eventually end up in one maximal end component almost surely, whatever strategy is considered. That is, we have the following proposition:
For all strategy , for all state , we have: .
In several cases, we will need to compute the (optimal) expected value of the mean-payoff in an MDP . This can be done in polynomial time (see ). In the following, we will denote the complexity of computing the optimal expected value in by .
The size of the MDP , that is denoted , corresponds to the value where .
2.3 Weighted Two-Player Games
We introduce here the notion of a finite weighted two-player game.
A finite two-player game is a weighted graph where the set of vertices is partitioned into the vertices belonging to Player 1, that is , and the vertices belonging to Player 2, that is , and is the initial vertex. The set of edges is such that for all , there exists such that . The weight function is such that .
The strategies in a two-player game are analogous to the strategies in an MDP, except that Player chooses an edge from the states in , while Player 2 chooses an edge from the states in (instead of choosing an action, that leads to a probabilistic distribution over the edges as in an MDP). We denote by and the set of all strategies of Player 1 and 2 respectively available in the game .
Given an MDP , we denote by the two-player game resulting from where:
It has to be noted that any strategy of Player 1 in can be seen as a strategy in the MDP . Reciprocally, any strategy in the MDP can be seen as a strategy for Player 1 in . Therefore, we have .
In a two-player game, there is no randomness. Therefore, once the deterministic strategies are fixed, there is a unique path that can occurs n the game, from the initial state . For a game , given two strategies and , we denote by the path that occur in the two-player game under strategies and from state . Then, if we consider a function that associate a value to any infinite path of a two-player game, we denote by the value .
In the two-player game resulting from an MDP , the set of paths from a state that may occur when Player chooses a strategy from is defined as . Then, we may consider the function such that for where , we have with
That is, associate to a path its counterpart in the game where Player 1 opts for strategy .
2.4 Further Notations
For a Markov chain , a path and , by we refer to the sequence of states (that is also a sequence of edges) and by we refer to . Consider some measurable function associating a value to each infinite path starting from . For an interval , we denote by the set , and for refers to . Since the set of paths forms a probability space, and
is a random variable, we denote bythe expected value of over the set of paths starting from .
Finally, for , we denote respectively by and the set of natural numbers and respectively.
2.5 Decision Problems
Consider a set , a payoff function , and . We first define the function (that stands for Window Total-Payoff) such that:
The value is the maximum total payoff one can ensure over a window of length starting from the initial vertex of the sequence of edges.
Similarly, we define (that stands for Window Mean-Payoff) such that:
The value is the maximum mean-payoff one can ensure over a window of length . For a given infinite path , a threshold , a position and , we say that the window is closed if . Otherwise, the window is open. We note that for .
Given a Markov chain with an initial state and a rational threshold , we define the following objectives.
Given , the good window objective
requires that there exists a window starting in the first position of and of size at most over which the mean-payoff is bounded below by the threshold . Again, for .
Given , the direct fixed window mean-payoff objective
requires that good windows of size at most exist in all positions along the path.
The direct bounded window mean-payoff objective
requires that there exists a bound such that the path satisfies the direct fixed objective for the length .
Given , the fixed window mean-payoff objective
is the prefix-independent version of the direct fixed window objective: it requires for the existence of a suffix of the path satisfying it.
The bounded window mean-payoff objective
is the prefix-independent version of the direct bounded window objective.
2.6 Functions of Interest
For each of those objective, we associate a value to every infinite path. We define the following functions, respectively for the fixed, direct fixed, bounded and direct bounded window mean-payoff problem:
It is now possible to study the complexity of finding the expected value of these functions in a Markov chain and in a Markov decision process. In the following, in the MCs and MDPs w.l.o.g. we consider only non-negative integer weights. Note that if the weights belong to , then one can multiply them with the LCM of their denominators to obtain integer weights. Among the resultant set of integer weights, if the minimum integer weight is negative, then we add - to the weight of each edge so as to obtain weights that are natural numbers.
2.7 A First Result
The first observation is that the functions and always produce the same result. That is stated in the following theorem.
Let be a Markov chain and let . Then:
It is easy to see that . Now for every , a window mean-payoff value of can be ensured from the beginning of the path by considering appropriately large window length. Since has been defined as the supremum of the window mean-payoff values that can be ensured along the path with arbitrarily large window lengths, the result follows. The detailed proof is given in Appendix B.
3 Expected Window Mean-payoff in Markov Chains
In this section, we study the fixed, direct fixed and bounded window mean-payoff problems on weighted Markov chains. We show that while the bounded window mean-payoff problem can be solved in polynomial time, the algorithm for the fixed window mean-payoff problem is polynomial in the value of the window length, that is, it is polynomial when the window length is given in unary, and pseudopolynomial if it is given in binary. Then, the algorithm for the direct fixed window mean-payoff problem is polynomial in the value of the window length and the weights appearing in the Markov chains. Therefore, it is pseudopolynomial when they are given in binary.
3.1 Fixed Window Mean-Payoff
We are interested in the expected value of for a given in a Markov chain. More specifically, consider a Markov chain and an initial state . Then, we study
In order to compute this value, first consider the bottom strongly connected components (BSCCs for short) of the Markov chain . In the preliminaries, we mentioned that any infinite path will eventually end up in one of them almost surely. Since the fixed window mean-payoff is prefix independent, the value of a path , that is , is only determined by the BSCC in which it ends up. Moreover, we have the following result:
Let be a BSCC. Let . Then, every path in will almost surely visit the sequence of states infinitely often. Formally:
Consider an infinite path . Then, because we are in a BSCC, will almost surely go infinitely often through . Moreover, in , there is a positive probability of taking the path , that is, visiting the states in in sequence. It follows that will almost surely reach the sequence of states infinitely often. ∎
In particular, in a BSCC , the sequence of states that minimizes is visited infinitely often by any infinite path almost surely. Hence, the following theorem:
The expected value of over paths that are entirely contained in a BSCC does not depend on the starting state of the paths considered. Moreover, if we define the expected value in a BSCC as the expected value starting from any state, then we have:
Before proving this theorem, let us look at the example in Figure 1 of a weighted Markov chain to illustrate the definition of for some BSCC . Such an example can be seen on Figure 1. The probabilities are written next to each edge in black, the weights are in red. Let us consider the case where . In this Markov chain, there are two BSCCs: and . The value of is obvious since there is only one path of length in : with . Let us now consider . In the BSCC , there are eight paths of length .
, with ;
, with ;
, with ;
, with ;.
, with ;
, with ;
, with ;
, with .
The minimum over these paths is achieved for paths , and . Therefore, . Moreover, since both BSCCs have the same probability () of being reached, by applying Formula 7, we obtain that the expected value of the fixed window mean-payoff in this Markov chain is equal to .
Let . We prove:
For all , we have ;
For all , we have .
Consider a path . We prove that . Let and let . Then, by definition of , we have for all , . Therefore, we have that . This implies that . Hence, for all , we have and therefore, .
Let us prove the second point. Let us denote by the sequence of states such that . Because is a BSCC and according to Proposition 4, we know that , which is equivalent to say that the set of paths that reach infinitely often the sequence of states is such that . Let . Then . Hence, for all , we have and therefore, .
Concluding from 1 and 2, we thus have, for all . Therefore, the expected value in a BSCC does not depend on the starting state and is always equal to . This concludes the proof. ∎
To sum up, the fixed window mean-payoff of a path only depends on where it ends up. Moreover, almost surely, a path end up in a BSCC. Furthermore, the value of any path that ends up in a BSCC is, almost surely, equal to . Thus the expected value of in the Markov chain from the initial state is given by:
Let us focus on computing for a BSCC . Ideally, we would compute this value using dynamic programming. More specifically, for and , we would express as a function of for . However, it seemed not possible in some cases. For instance, consider the BSCC of Figure 2 with . It is obvious that, from state , the best window mean-payoff is by only considering the first edge. And from state the best window mean-payoff is equal to by considering three edges. Therefore, from state , the edge is not interesting since . But, from state , is interesting since . The problem comes from the fact that whether the edge is interesting or not depends on the current mean of the path.
Now consider the decision problem where we have to decide if for some . This is equivalent to decide whether or not in the BSCC where every weight have been reduced by . Now, an interesting property is that, for every , . Therefore, we can look at the total payoff, instead of looking at the mean-payoff. In that case, whether an edge is interesting to consider or not does not depend on the current sum of the path, it only depends on the value of the edge (if it is positive or negative). For example, let us consider the case where in the previous BSCC. The BSCC obtained by reducing every weight by 3 can be seen in Figure 3. Now, the edge is interesting from and since .
The following lemma generalize this result:
Let be a BSCC.
For and , we define . Then
with for all .
To prove this lemma, we just use the fact that a path of length consists of an edge followed by a path of length . More specifically, let and , then: