Log In Sign Up

Conditional Value-at-Risk for Reachability and Mean Payoff in Markov Decision Processes

We present the conditional value-at-risk (CVaR) in the context of Markov chains and Markov decision processes with reachability and mean-payoff objectives. CVaR quantifies risk by means of the expectation of the worst p-quantile. As such it can be used to design risk-averse systems. We consider not only CVaR constraints, but also introduce their conjunction with expectation constraints and quantile constraints (value-at-risk, VaR). We derive lower and upper bounds on the computational complexity of the respective decision problems and characterize the structure of the strategies in terms of memory and randomization.


page 1

page 2

page 3

page 4


The Complexity of Reachability in Parametric Markov Decision Processes

This article presents the complexity of reachability decision problems f...

Constrained Risk-Averse Markov Decision Processes

We consider the problem of designing policies for Markov decision proces...

Risk-aware Stochastic Shortest Path

We treat the problem of risk-aware control for stochastic shortest path ...

Farkas certificates and minimal witnesses for probabilistic reachability constraints

This paper introduces Farkas certificates for lower and upper bounds on ...

Multi-weighted Markov Decision Processes with Reachability Objectives

In this paper, we are interested in the synthesis of schedulers in doubl...

Solving Markov Decision Processes with Reachability Characterization from Mean First Passage Times

A new mechanism for efficiently solving the Markov decision processes (M...

Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures

In this paper, we consider a finite-horizon Markov decision process (MDP...

1. Introduction

Markov decision processes (MDP)

are a standard formalism for modelling stochastic systems featuring non-determinism. The fundamental problem is to design a strategy resolving the non-deterministic choices so that the systems’ behaviour is optimized with respect to a given objective function, or, in the case of multi-objective optimization, to obtain the desired trade-off. The objective function (in the optimization phrasing) or the query (in the decision-problem phrasing) consists of two parts. First, a payoff is a measurable function assigning an outcome to each run of the system. It can be real-valued, such as the long-run average reward (also called mean payoff), or a two-valued predicate, such as reachability. Second, the payoffs for single runs are combined into an overall outcome of the strategy, typically in terms of expectation

. The resulting objective function is then for instance the expected long-run average reward, or the probability to reach a given target state.

Risk-averse control

aims to overcome one of the main disadvantages of the expectation operator, namely its ignorance towards the incurred risks, intuitively phrased as a question “How bad are the bad cases?”

While the standard deviation (or variance) quantifies the spread of the distribution, it does not focus on the bad cases and thus fails to capture the risk. There are a number of quantities used to deal with this issue:

  • The worst-case analysis (in the financial context known as discounted maximum loss) looks at the payoff of the worst possible run. While this makes sense in a fully non-deterministic environment and lies at the heart of verification, in the probabilistic setting it is typically unreasonably pessimistic, taking into account events happening with probability , e.g., never tossing head on a fair coin.

  • The value-at-risk () denotes the worst -quantile for some . For instance, the value at the -quantile is the median, the -quantile (the vigintile or ventile) is the value of the best run among the worst ones. As such it captures the “reasonably possible” worst-case. See Fig. 1

    for an example of VaR for two given probability density functions. There has been an extensive effort spent recently on the analysis of MDP with respect to VaR and the re-formulated notions of quantiles, percentiles, thresholds, satisfaction view etc., see below. Although VaR is more realistic, it tends to ignore outliers too much, as seen in Fig. 

    1 on the right. VaR has been characterized as “seductive, but dangerous” and “not sufficient to control risk” (Beder, 1995).

  • The conditional value-at-risk (average value-at-risk, expected shortfall, expected tail loss) answers the question “What to expect in the bad cases?” It is defined as the expectation over all events worse than the value-at-risk, see Fig. 1. As such it describes the lossy tail, taking outliers into account, weighted respectively. In the degenerate cases, CVaR for is the expectation and for the (probabilistic) worst case. It is an established risk metric in finance, optimization and operations research, e.g. (Artzner et al., 1999; Rockafellar and Uryasev, 2000), and “is considered to be a more consistent measure of risk” (Rockafellar and Uryasev, 2000). Recently, it started permeating to areas closer to verification, e.g. robotics (Carpin et al., 2016).




Figure 1.

Illustration of VaR and CVaR for some random variables.

Our contribution

In this paper, we investigate optimization of MDP with respect to CVaR as well as the respective trade-offs with expectation and VaR. We study the VaR and CVaR operators for the first time with the payoff functions of weighted reachability and mean payoff, which are fundamental in verification. Moreover, we cover both the single-dimensional and the multi-dimensional case.

Particularly, we define CVaR for MDP and show the peculiarities of the concept. Then we study the computational complexity and the strategy complexity for various settings, proving the following:

  • The single dimensional case can be solved in polynomial time through linear programming, see Section 


  • The multi-dimensional case is NP-hard, even for CVaR-only constraints. Weighted reachability is NP-complete and we give PSPACE and EXPSPACE upper bounds for mean payoff with CVaR and expectation constraints, and with additional VaR constraints, respectively, see Section 6. (Note that already for the sole VaR constraints only an exponential algorithm is known; the complexity is an open question and not even NP-hardness is known (Randour et al., 2017; Chatterjee et al., 2015).)

  • We characterize the strategy requirements, both in terms of memory, ranging from memoryless, over constant-size to infinite memory, and the required degree of randomization, ranging from fully deterministic strategies to randomizing strategies with stochastic memory update.

While dealing with the CVaR operator, we encountered surprising behaviour, preventing us to trivially adapt the solutions to the expectation and VaR problems:

  • Compared to, e.g., expectation and VaR, CVaR does not behave linearly w.r.t. stochastic combination of strategies.

  • A conjunction of CVaR constraints already is NP-hard, since it can force a strategy to play deterministically.

1.1. Related work

Worst case

Risk-averse approaches optimizing the worst case together with expectation have been considered in beyond-worst-case and beyond-almost-sure analysis investigated in both the single-dimensional (Bruyère et al., 2017) and in the multi-dimensional (Clemente and Raskin, 2015) setup.


The decision problem related to VaR has been phrased in probabilistic verification mostly in the form “Is the probability that the payoff is higher than a given value threshold more than a given probability threshold?” The total reward gained attention both in the verification community (Ummels and Baier, 2013; Haase and Kiefer, 2015; Baier et al., 2017) and recently in the AI community (Gilbert et al., 2017; Li et al., 2017). Multi-dimensional percentile queries are considered for various objectives, such as mean-payoff, limsup, liminf, shortest path in (Randour et al., 2017); for the specifics of two-dimensional case and their interplay, see (Baier et al., 2014a). Quantile queries for more complex constraints have also been considered, namely their conjunctions (Filar et al., 1995a; Brázdil et al., 2014), conjunctions with expectations (Chatterjee et al., 2015) or generally Boolean expressions (Haase et al., 2017). Some of these approaches have already been practically applied and found useful by domain experts (Baier et al., 2014c; Baier et al., 2014b).


There is a body of work that optimizes CVaR in MDP. However, to the best of our knowledge, all the approaches (1) focus on the single-dimensional case, (2) disregard the expectation, and (3) treat neither reachability nor mean payoff. They focus on the discounted (Bäuerle and Ott, 2011), total (Carpin et al., 2016), or immediate (Kageyama et al., 2011) reward, as well as extend the results to continuous-time models (Huang and Guo, 2016; Miller and Yang, 2017). This work comes from the area of optimization and operations research, with the notable exception of (Carpin et al., 2016), which focuses on the total reward. Since the total reward generalizes weighted reachability, (Carpin et al., 2016) is related to our work the most. However, it provides only an approximation solution for the one-dimensional case, neglecting expectation and the respective trade-offs.

Further, CVaR is a topic of high interest in finance, e.g., (Rockafellar and Uryasev, 2000; Beder, 1995). The central difference is that there variations of portfolios (i.e. the objective functions) are considered while leaving the underlying random process (the market) unchanged. This is dual to our problem, since we fix the objective function and now search for an optimal random process (or the respective strategy).

Multi-objective expectation

In the last decade, MDP have been extensively studied generally in the setting of multiple objectives, which provides some of the necessary tools for our trade-off analysis. Multiple objectives have been considered for both qualitative payoffs, such as reachability and LTL (Etessami et al., 2008), as well as quantitative payoffs, such as mean payoff (Brázdil et al., 2014), discounted sum (Chatterjee et al., 2013), or total reward (Forejt et al., 2011). Variance has been introduced to the landscape in  (Brázdil et al., 2013).

2. Preliminaries

Due to space constraints, some proofs and explanations are shortened or omitted when clear and can be found in the appendix.

2.1. Basic definitions

We mostly follow the definitions of (Chatterjee et al., 2015; Brázdil et al., 2014). are used to denote the sets of positive integers, rational and real numbers, respectively. For , let . Further, refers to , where

is the unit vector in dimension


We assume familiarity with basic notions of probability theory, e.g.,

probability space , random variable , or expected value . The set of all distributions over a countable set is denoted by . Further, is Dirac if for some . To ease notation, for functions yielding a distribution over some set , we may write instead of for .

Markov chains

A Markov chain (MC) is a tuple , where is a countable set of states111We allow the state set to be countable for the formal definition of strategies on MDP. When dealing with Markov Chains in queries, we only consider finite state sets., is a probabilistic transition function, and

is the initial probability distribution. The SCCs and BSCCs of a MC are denoted by

and , respectively (Puterman, 1994).

A run in is an infinite sequence of states, we write to refer to the -th state . A path in is a finite prefix of a run . Each path in determines the set consisting of all runs that start with . To , we associate the usual probability space , where is the set of all runs in , is the -field generated by all , and is the unique probability measure such that . Furthermore, () denotes the set of runs which eventually reach (eventually remain in) the set , i.e. all runs where for some (there exists an such that for all ).

Markov decision processes

A Markov decision process (MDP) is a tuple where is a finite set of states, is a finite set of actions, assigns to each state the set of actions enabled in so that is a partitioning of 222In other words, each action is associated with exactly one state., is a probabilistic transition function that given an action yields a probability distribution over the successor states, and is the initial state of the system.

A run of is an infinite alternating sequence of states and actions such that for all , we have and . Again, refers to the -th state visited by this particular run. A path of length in is a finite prefix of a run in .

Strategies and plays.

Intuitively, a strategy in an MDP is a “recipe” to choose actions based on the observed events. Usually, a strategy is defined as a function that given a finite path , representing the history of a play, gives a probability distribution over the actions enabled in the last state. We adopt the slightly different, though equivalent (Brázdil et al., 2014, Sec. 6) definition from (Chatterjee et al., 2015), which is more convenient for our setting.

Let be a countable set of memory elements. A strategy is a triple , where and are memory update and next move functions, respectively, and is the initial memory distribution. We require that, for all , the distribution assigns positive values only to actions available at , i.e. .

A play of determined by a strategy is a Markov chain , where the set of states is , the initial distribution is zero except for , and the transition probability from to is . Hence, starts in a location chosen randomly according to and . In state the next action to be performed is , hence the probability of entering is . The probability of updating the memory to is , and the probability of selecting as the next action is . Since these choices are independent, and thus we obtain the product above.

Technically, induces a probability measure on . Since we mostly work with the corresponding runs in the original MDP, we overload to also refer to the probability measure obtained by projecting onto . Further, “almost surely” etc. refers to happening with probability 1 according to . The expected value of a random variable is .

A convex combinations of two strategies and , written as , can be obtained by defining the memory as , randomly choosing one of the two strategies via the initial memory distribution and then following the chosen strategy. Clearly, we have that .

Strategy types.

A strategy may use infinite memory , and both and may randomize. The strategy is

  • deterministic-update, if is Dirac and the memory update function gives a Dirac distribution for every argument;

  • deterministic, if it is deterministic-update and the next move function gives a Dirac distribution for every argument.

A stochastic-update strategy is a strategy that is not necessarily deterministic-update and randomized

strategy is a strategy that is not necessarily deterministic. We also classify the strategies according to the size of memory they use. Important subclasses are

memoryless strategies, in which is a singleton, -memory strategies, in which has exactly  elements, and finite-memory strategies, in which is finite.

End components.

A tuple where and is an end component of the MDP if (i) for all actions , implies ; and (ii) for all states there is a path with , . An end component is a maximal end component (MEC) if and are maximal with respect to subset ordering. Given an MDP, the set of MECs is denoted by . By abuse of notation, refers to all states of a MEC , while refers to the actions.

Remark 1 ().

Computing the maximal end component (MEC) decomposition of an MDP, i.e. the computation of , is in P (Courcoubetis and Yannakakis, 1995).

Remark 2 ().

For any MDP and strategy , a run almost surely eventually stays in one MEC, i.e.  (Puterman, 1994).

2.2. Random variables on Runs

We introduce two standard random variables, assigning a value to each run of a Markov Chain or Markov Decision Process.

Weighted reachability.

Let be a set of target states and be a reward function. Define the random variable as , if such an exists, and otherwise. Informally, assigns to each run the value of the first visited target state, or if none. is measurable and discrete, as is finite (Puterman, 1994). Whenever we are dealing with weighted reachability, we assume w.l.o.g. that all target states are absorbing, i.e. for any we have for MC and for all for MDP.

Mean payoff

(also known as long-run average reward, and limit average reward). Again, let be a reward function. The mean payoff of a run is the average reward obtained per step, i.e. The is necessary, since may not be defined in general. Further, is measurable (Puterman, 1994).

Remark 3 ().

There are several distinct definitions of “weighted reachability”. The one chosen here primarily serves as foundation for the more general mean payoff.

3. Introducing the Conditional Value-at-risk

In order to define our problem, we first introduce the general concept of conditional value-at-risk (CVaR), also known as average value-at-risk, expected shortfall, and expected tail loss. As already hinted, the CVaR of some real-valued random variable and probability intuitively is the expectation below the worst -quantile of .

Let be a random variable over the probability space . The associated cumulative density function (CDF) of yields the probability of being less than or equal to the given value , i.e. . is non-decreasing and right continuous with left limits (càdlàg).

The value-at-risk is the worst -quantile, i.e. a value s.t. the probability of attaining a value less than or equal to is :333An often used, mostly equivalent definition is . Unfortunately, this would lead to some complications later on. See Sec. A.1 for details.

Then, with , can be defined as (Rockafellar and Uryasev, 2000)

with the corner cases and .

Unfortunately, this definition only works as intended for continuous , as shown by the following example.




Figure 2. Distribution showing peculiarities of
Example 3.1 ().

Consider a random variable with a distribution as outlined in Fig. 2. For , we certainly have . On the other hand, for any , we get . Consequently, the integral remains constant and would actually decrease for increasing , not matching the intuition.

General definition.

As seen in Ex. 3.1, the previous definition breaks down when is not continuous at the -quantile and consequently . Thus, we handle the values at the threshold separately, similar to (Rockafellar and Uryasev, 2002).

Definition 3.2 ().

Let be some random variable and . With , the of is defined as

which can be rewritten as

The corner cases again are , and .

Since the degenerate cases of and reduce to already known problems, we exclude them in the following.

We demonstrate this definition on the previous example.

Example 3.3 ().

Again, consider the random variable from Ex. 3.1. For we have that . The right hand side of the definition captures the remaining discrete probability mass which we have to handle separately. Together with we get . For example, with , this yields the expected result .

Remark 4 ().

Recall that can be expressed as the left limit of , namely . Hence, solely depends on the CDF of and thus random variables with the same CDF also have the same CVaR.

We say that stochastically dominates for two CDF and , if for all . Intuitively, this means that a sample drawn from is likely to be larger or equal to a sample from . All three investigated operators (, , and ) are monotone w.r.t. stochastic dominance, see Sec. A.1.

4. CVaR in MC and MDP: Problem statement

Now, we are ready to define our problem framework. First, we explain the types of building blocks for our queries, namely lower bounds on expectation, CVaR, and VaR. Formally, we consider the following types of constraints.

is some real-valued random variable, assigning a payoff to each run. With these constraints, the classes of queries are denoted by

  • are the types of constraints,

  • is the type of the objective function, either weighted reachability or mean payoff , and

  • is the dimensionality of the query.

We use to denote the dimensions of the problem, iff . As usual, we assume that all quantities of the input, e.g., probabilities of distributions, are rational.

An instance of these queries is specified by an MDP , a -dimensional reward function , and constraints from , given by vectors and . This implies that in each dimension there is at most one constraint per type. The presented methods can easily be extended to the more general setting of multiple constraints of a particular type in one dimension. The decision problem is to determine whether there exists a strategy such that all constraints are met.

Technically, this is defined as follows. Let be the -dimensional random variable induced by the objective and reward function , operating on the probability space of . The strategy is a witness to the query iff for each dimension we have that , , and . Moreover, constraints are trivially satisfied.

For completeness sake, we also consider queries, i.e. the corresponding problem on (finite state) Markov chains.


We introduce the following abbreviations. When dealing with an MDP , denotes relative to the probability space over runs induced by the strategy . When additionally the random variable (e.g., mean payoff) is clear from the context, we may write and instead of and , respectively. We also define analogous abbreviations for .

5. Single dimension

We show that all queries in one dimension are in P. Furthermore, our LP-based decision procedures directly yield a description of a witness strategy and allow for optimization objectives. We refer to the input constraints by for expectation, for CVaR, and for VaR. Further, we use for indices related to SCCs / MECs.

5.1. Weighted reachability

First, we show the simple result for Markov Chains, providing some insight in the techniques used in the MDP case.

Theorem 5.1 ().

is in P.


Let be a finite-state Markov chain, a reward function, and the target set. Recall that all are absorbing, hence single-state BSCCs. We obtain the stationary distribution of in polynomial time by, e.g., solving a linear equation system (Puterman, 1994). With , we can directly compute the CDF of as and immediately decide the query. ∎

Let us consider the more complex case of MDP. We show a lower bound on the type of strategies necessary to realize queries with constraints on expectation and one of VaR or CVaR. We then continue to prove that this class of strategies is optimal. This characterization is used to derive a polynomial time decision procedure based on a linear program (LP) which immediately yields a witness strategy. Finally, when we deal with the mean payoff case in Sec. 5.2, we make use of the reasoning presented in this section.

Randomization is necessary for weighted reachability.

In the following example, we present a simple MDP on which all deterministic strategies fail to satisfy specific constraints, while a straightforward randomizing one succeeds in doing so.

Figure 3. MDP used to show various difficulties of
Example 5.2 ().

Consider the MDP outlined in Fig. 3. The only non-determinism is given by the choice in the initial state . Hence, any strategy is characterised by the choice in that particular state. Let now and denote the deterministic strategies playing and in , respectively. Clearly, achieves an expectation, , and of . On the other hand, obtains an expectation of with and equal to .

Thus, neither strategy satisfies the constraints , , and (or ). This is the case even when the strategy has arbitrary (deterministic) memory at its disposal, since in the first step there is nothing to remember. Yet, achieves , , and .

Hence strategies satisfying an expectation constraint together with either a CVaR or VaR constraint may necessarily involve randomization in general. We prove that (i) under mild assumptions randomization actually is sufficient, i.e. no memory is required, and (ii) fixed memory may additionally be required in general.

Definition 5.3 ().

Let be an MDP with target set and reward function . We say that satisfies the attraction assumption if A1) the target set is reached almost surely for any strategy, or A2) for all target state we have .

Essentially, this definition implies that an optimal strategy never remains in a non-target MEC. This allows us to design memoryless strategies for the weighted reachability problem.

Theorem 5.4 ().

Memoryless randomizing strategies are sufficient for under the attraction assumption.


Fix an MDP and reward function . We prove that for any strategy there exists a memoryless, randomizing strategy achieving at least the expectation, VaR, and CVaR of .

All target states form single-state MECs, as we assumed that all target states are absorbing. Consequently, naturally induces a distribution over these . Now, we apply (Etessami et al., 2008, Theorem 3.2) to obtain a strategy with for all .

With A1), we have and thus . Hence, obtains the same CDF for the weighted reachability objective. Under A2), the CDF of strategy stochastically dominates the CDF of the original strategy , concluding the proof. ∎

Theorem 5.5 ().

Two-memory stochastic strategies (i.e. with both randomization and stochastic update) are sufficient for .

The proof is a simple application of the following Thm. 5.10, as weighted reachability is a special case of mean payoff. Together with an example for the lower bound it can be found in Sec A.2.

  1. All variables , , are non-negative.

  2. Transient flow for :

  3. Switching to recurrent behaviour:

  4. -consistent split:

  5. Probability-consistent split:

  6. CVaR and expectation satisfaction:

Figure 4. LP used to decide weighted reachability queries given a guess of . , .

Inspired by (Chatterjee et al., 2015, Fig. 3), we use the optimality result from Thm. 5.4 to derive a decision procedure for weighted reachability queries under the attraction assumptions based on the LP in Fig. 4.

To simplify the LP, we make further assumptions – see Sec A.2.2 for details. First, all MECs, including non-target ones, consist of a single state. Second, all MECs from which is not reachable are considered part of and have (similar to the “cleaned-up MDP” from (Etessami et al., 2008)). Finally, we assume that the quantile-probabilities are equal, i.e. . The LP can easily be extended to account for different values by duplicating the variables and adding according constraints.

The central idea is to characterize randomizing strategies by the “flow” they achieve. To this end, Equality (2) essentially models Kirchhoff’s law, i.e. inflow and outflow of a state have to be equal. In particular, expresses the transient flow of the strategy as the expected total number of uses of action . Similarly, models the recurrent flow, which under our absorption assumption equals the probability of reaching . Equality (3) ensures that all transient behaviour eventually changes into recurrent one.

In order to deal with our query constraints, Constraints (4) and (5) extract the worst fraction of the recurrent flow, ensuring that the is at least . Note that equality is not guaranteed by the LP; if for all , we have . Finally, Inequality (6) enforces satisfaction of the constraints.

Theorem 5.6 ().

Let be an MDP with target states and reward function , satisfying the attraction assumption. Fix the constraint probability and thresholds . Then, we have that

  1. for any strategy satisfying the constraints, there is a such that the LP in Fig. 4 is feasible, and

  2. for any threshold , a solution of the LP in Fig. 4 induces a memoryless, randomizing strategy satisfying the constraints and .


First, we prove for a strategy satisfying the constraints that there exists a such that the LP is feasible. By Thm. 5.4, we may assume that is a memoryless randomizing strategy. From (Etessami et al., 2008, Theorem 3.2), we get an assignment to the ’s and ’s satisfying Equalities (1), (2), and (3) such that for all target states . Further, let be the value-at-risk of the strategy. By definition of , we have that .

Assume for now that , i.e. the probability of obtaining a value strictly smaller than is exactly . In this case, choose to be the next smaller reward, i.e. . We set for all , satisfying Constraints (4) and (5).

Otherwise, we have . Now, some non-zero fraction of the probability mass at contributes to the . Again, we set the values for according to Constraint (4

). The only degree of freedom are the values of

where . There, we assign the values so that , satisfying Equality (5).

It remains to check Inequality (6). For expectation, we have . For CVaR, notice that, due to the already proven Constraints (4) and (5), the side of Inequality (6) is equal to and thus at least .

Second, we prove that a solution to the LP induces the desired strategy . Again by (Etessami et al., 2008, Theorem 3.2), we get a memoryless randomizing strategy such that for all states . Then . Further,

by definition. Now, we make a case distinction on for all . If this is true, we have , but . Consequently, and . Otherwise, we have and consequently . Inserting in the above equation immediately gives the result . ∎

The linear program requires to know the beforehand, which in turn clearly depends on the chosen strategy. Yet, there are only linearly many values the random variable attains. Thus we can simply try to find a solution for all potential values of , i.e. , yielding a polynomial time solution.

Corollary 5.7 ().

is in P.


Under the attraction assumption, this follows directly from Thm. 5.6. In general, the reduction to mean payoff used by Thm. 5.5 and the respective result from Cor. 5.11 show the result. ∎

5.2. Mean payoff

In this section, we investigate the case of . Again, the construction for MC is considerably simple, yet instructive for the following MDP case.

Theorem 5.8 ().

is in P.

Proof sketch.

For each BSCC , we obtain its expected mean payoff through, e.g., a linear equation system (Puterman, 1994). Almost all runs in achieve this mean payoff and thus the corresponding random variable is discrete. We reduce the problem to weighted reachability by using the known reformulation

We replace each of these BSCCs by a representative to obtain . Define the set of target states and the reachability reward function . By applying the approach of Thm. 5.1, we obtain the expectation, , and for reachability in which by construction coincides with the respective values for mean payoff in . ∎

For the MDP case, recall that simple expectation maximization of mean payoff can be reduced to weighted reachability 

(Ashok et al., 2017) and deterministic, memoryless strategies are optimal (Puterman, 1994). Yet, solving a conjunctive query involving either VaR or CVaR needs more powerful strategies than in the weighted reachability case of Thm. 5.4. Nevertheless, we show how to decide these queries in P.

Randomization and memory is necessary for mean payoff.

A simple modification of the MDP in Fig. 3 yields an MDP where both randomization and memory is required to satisfy the constraints of the following example.

Figure 5. Memory is necessary for mean payoff queries
Example 5.9 ().

Consider the MDP presented in Fig. 5. There, the same constraints as before, i.e. , , and (or ), can only be satisfied by strategies with both memory and randomization. Clearly, a pure strategy can only satisfy either of the two constraints again. But now a memoryless randomizing strategy also is insufficient, too, since any non-zero probability on action leads to almost all runs ending up on the right side of the MDP, hence yielding a and of . Instead, a stochastic strategy with can simply choose and play the corresponding action indefinitely, satisfying the constraints.

We prove that this bound actually is tight, i.e. that, given stochastic memory update, two memory elements are sufficient.

Theorem 5.10 ().

Two-memory stochastic strategies (i.e. with both randomization and stochastic update) are sufficient for .


Let be a strategy on an MDP with reward function . We construct a two-memory stochastic strategy achieving at least the expectation, VaR, and CVaR of .

First, we obtain a memoryless deterministic strategy which obtains the maximal possible mean payoff in each MEC (Puterman, 1994). We then apply the construction of (Brázdil et al., 2014, Proposition 4.2) (see also (Chatterjee et al., 2015, Lemma 5.7)), where the is our . (Technically, this can be ensured by choosing the constraints of the LP according to .)

Intuitively, this constructs a two-memory strategy on behaving as follows. Initially, remains in each MEC with the same probability as , i.e. by following a memoryless “searching” strategy and stochastically switching its memory state to “remain”. Once in the “remain” state, the behaviour of the optimal strategy is implemented.

Clearly, (i) both strategies remain in a particular MEC with the same probability, and (ii)  obtains as least as much value in each MEC as . Hence the CDF induced by stochastically dominates the one of , concluding the proof. ∎

This immediately gives us a polynomial time decision procedure.

Corollary 5.11 ().

is in P.

Furthermore, we can use results of (Chatterjee et al., 2015, Lemma 16) to trade the stochastic update for more memory.

Corollary 5.12 ().

Stochastic strategies with finite, deterministic memory are sufficient for .

Deterministic strategies may require exponential memory.

As sources of randomness are not always available, one might ask what can be hoped for when only determinism is allowed. As already shown in Ex. 5.2, randomization is required in general. But even if some deterministic strategy is sufficient, it may require memory exponential in the size of the input, even in an MDP with only 3 states. We show this in the following example.

Figure 6. Exponential memory is necessary for mean payoff when only deterministic update is allowed.
Example 5.13 ().

Consider the MDP outlined in Fig. 6 together with the constraints , , and (or ). Again, any optimal strategy needs a significant part of runs to go to the right side in order to satisfy the expectation constraint. Yet, any strategy can only “move” a small fraction of the runs there in each step. In particular, after steps, the right side is only reached with probability at most . When choosing , which needs bits to encode, a deterministic strategy requires memory elements to count the number of steps. The same holds true for any deterministic-update strategy.

On the other hand, a strategy with stochastic memory update can encode this counting by switching its state with a small probability after each step. For example, a strategy switching with probability from “play to “play satisfies the constraint.

5.3. Single constraint queries

In this section, we discuss an important sub-case of the single-dimensional case, namely queries with only a single constraint, i.e. . We show that deterministic memoryless strategies are sufficient in this case.

One might be tempted to use standard arguments and directly conclude this from the results of Thm. 5.4 as follows. Recall that this theorem shows that memoryless, randomizing strategies are sufficient; and that any such strategy can be written as finite convex combination of memoryless, deterministic strategies. Most constraints, for example expectation or reachability, behave linearly under convex combination of strategies, e.g., . Consequently, for an optimal memoryless strategy, there is a deterministic witness, which in turn also is optimal.

Surprisingly, this assumption is not true for . On the contrary, the of a convex combination of strategies might be strictly worse than the s of either strategy, as shown in the following example. We prove a slightly weaker property of which eventually allows us to apply similar reasoning.

Example 5.14 ().

Recall the MDP in Fig. 3 and let . As previously shown, and , but the mixed strategy achieves instead of the convex combination .

For , we have . Yet, any non-trivial convex combination of the two strategies yields a less than . See Sec. A.1.3 for more details. With according constraints, this effectively can force an optimal strategy to choose between or . This observation is further exploited in the NP-hardness proof of the multi-dimensional case in Sec. 6.

Since CVaR considers the worst events, the CVaR of a combination intuitively cannot be better than the combination of the respective CVaRs. We prove this intuition in the general setting, where instead of a convex combination of strategies we consider a mixture of two random variables.

Lemma 5.15 ().

is convex in for fixed , i.e. for random variables and

The proof can be found in Sec. A.3. This result allows us to apply the ideas outlined in the beginning of the section.

Theorem 5.16 ().

For any , deterministic memoryless strategies are sufficient for when .


This is known for  (Puterman, 1994) and  (Filar et al., 1995b).

For CVaR, observe that the convex combination of deterministic strategies cannot achieve a better CVaR than the best strategy involved in the combination (see Lem. 5.15). This immediately yields the result for through Thm. 5.4. For , we exploit the approach of Thm. 5.10. Recall that there we obtained a two-memory strategy . Both randomization and stochastic update are used solely to distribute the runs over all MECs accordingly. By the above reasoning, for each MEC it is sufficient to either almost surely remain there or leave it. This behaviour can be implemented by a deterministic memoryless strategy on the original MDP. ∎

6. Multiple Dimensions

In this section, we deal with multi-dimensional queries. We continue to use for indices related to MECs and further use for dimension indices. First, we show that the Markov Chain case does not significantly change.

Theorem 6.1 ().

For any , is in P.


Similarly to the single-dimensional case, we decide each constraint in each dimension separately, using our previous results. The query is satisfied iff each of the constraints is satisfied. ∎

6.1. NP-Hardness of reachability and mean payoff

For the MDP on the other hand, multiple dimensions add significant complexity. In the following, we show that already the weighted reachability problem with multiple dimensions and only constraints, i.e. , is NP-hard. This result directly transfers to mean payoff, i.e. . Recall that in contrast and even , i.e. constraints on the expectation and ensuring that almost all runs achieve a given threshold, are in P (Chatterjee et al., 2015).

Theorem 6.2 ().

For any , is NP-hard (when the dimension is a part of the input).

Figure 7. Gadget for variable . Uniform transition probabilities are omitted for readability.

We prove hardness by reduction from 3-SAT. The core idea is to utilize observations from Fig. 3 and Ex. 5.14, namely that CVaR constraints can be used to enforce a deterministic choice.

Let be a set of clauses with variables and set the dimensions . By abuse of notation, refers to the dimension of clause and to the one of variable , respectively.

The gadget for the reduction is outlined in Fig. 7. Observe that, due to the structure of the MDP, we have that .

Overall, the reduction works as follows. Initially, a state , representing the variable , is chosen uniformly. In this state, the strategy is asked to give the valuation of through the actions or . As seen in Ex. 5.14, the structure of the shaded states can be used to enforces a deterministic choice between the two actions. Particularly, in dimension we require for . Since all other gadgets yield 0 in dimension and only half of the runs going through end up in the shaded area, this corresponds to Ex. 5.14, where .

Once in either state or , a state corresponding to a clause satisfied by this assignment is chosen uniformly. In the example gadget, we would have , and . We set the reward of to . Then a clause is satisfied under the assignment if the state is visited with positive probability, e.g. if . Clearly, a satisfying assignment exists iff a strategy satisfying these constraints exists. ∎

6.2. NP-completeness and strategies for reachability

For weighted reachability, we prove that the previously presented bound is tight, i.e. that the weighted reachability problem with multiple dimensions and constraints is NP-complete when is part of the input and otherwise. First, we show that the strategy bounds of the single dimensional case directly transfer. Intuitively, this is the case since only the steady state distribution over the target set is relevant, independent of the dimensionality.

Theorem 6.3 ().

Two-memory stochastic strategies (i.e. with both randomization and stochastic update) are sufficient for . Moreover, if for all and , then memoryless randomizing strategies are sufficient.


Follows directly from the reasoning used in the proofs of Thm. 5.10 and Thm. 5.4. ∎

  1. All variables , , are non-negative.

  2. -consistent split for :

  3. Probability-consistent split for :

  4. CVaR and expectation satisfaction for :