# Partial and Conditional Expectations in Markov Decision Processes with Integer Weights

The paper addresses two variants of the stochastic shortest path problem ('optimize the accumulated weight until reaching a goal state') in Markov decision processes (MDPs) with integer weights. The first variant optimizes partial expected accumulated weights, where paths not leading to a goal state are assigned weight 0, while the second variant considers conditional expected accumulated weights, where the probability mass is redistributed to paths reaching the goal. Both variants constitute useful approaches to the analysis of systems without guarantees on the occurrence of an event of interest (reaching a goal state), but have only been studied in structures with non-negative weights. Our main results are as follows. There are polynomial-time algorithms to check the finiteness of the supremum of the partial or conditional expectations in MDPs with arbitrary integer weights. If finite, then optimal weight-based deterministic schedulers exist. In contrast to the setting of non-negative weights, optimal schedulers can need infinite memory and their value can be irrational. However, the optimal value can be approximated up to an absolute error of ϵ in time exponential in the size of the MDP and polynomial in (1/ϵ).

## Authors

• 2 publications
• 15 publications
• ### Stochastic Shortest Paths and Weight-Bounded Properties in Markov Decision Processes

The paper deals with finite-state Markov decision processes (MDPs) with ...
04/30/2018 ∙ by Christel Baier, et al. ∙ 0

• ### On Skolem-hardness and saturation points in Markov decision processes

The Skolem problem and the related Positivity problem for linear recurre...
04/23/2020 ∙ by Jakob Piribauer, et al. ∙ 0

• ### New and Simplified Distributed Algorithms for Weighted All Pairs Shortest Paths

We consider the problem of computing all pairs shortest paths (APSP) and...
10/18/2018 ∙ by Udit Agarwal, et al. ∙ 0

• ### Computational Approaches for Stochastic Shortest Path on Succinct MDPs

We consider the stochastic shortest path (SSP) problem for succinct Mark...
04/24/2018 ∙ by Krishnendu Chatterjee, et al. ∙ 0

• ### Scalable methods for computing state similarity in deterministic Markov Decision Processes

We present new algorithms for computing and approximating bisimulation m...
11/21/2019 ∙ by Pablo Samuel Castro, et al. ∙ 49

• ### Reaching Your Goal Optimally by Playing at Random

Shortest-path games are two-player zero-sum games played on a graph equi...
05/11/2020 ∙ by Benjamin Monmege, et al. ∙ 0

• ### Metrics that respect the support

In this work we explore the family of metrics determined by S-weights, i...
04/20/2018 ∙ by Roberto Assis Machado, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

### 1 Introduction

Stochastic shortest path (SSP) problems generalize the shortest path problem on graphs with weighted edges. The SSP problem is formalized using finite state Markov decision processes (MDPs), which are a prominent model combining probabilistic and nondeterministic choices. In each state of an MDP, one is allowed to choose nondeterministically from a set of actions, each of them is augmented with probability distributions over the successor states and a weight (cost or reward). The SSP problem asks for a policy to choose actions (here called a scheduler) maximizing or minimizing the expected accumulated weight until reaching a goal state. In the classical setting, one seeks an optimal

proper

scheduler where proper means that a goal state is reached almost surely. Polynomial-time solutions exist exploiting the fact that optimal memoryless deterministic schedulers exist (provided the optimal value is finite) and can be computed using linear programming techniques, possibly in combination with model transformations (see

[5, 10, 1]). The restriction to proper schedulers, however, is often too restrictive. First, there are models that have no proper scheduler. Second, even if proper schedulers exist, the expectation of the accumulated weight of schedulers missing the goal with a positive probability should be taken into account as well. Important such applications include the semantics of probabilistic programs (see e.g. [12, 14, 4, 7, 16]) where no guarantee for almost sure termination can be given and the analysis of program properties at termination time gives rise to stochastic shortest (longest) path problems in which the goal (halting configuration) is not reached almost surely. Other examples are the fault-tolerance analysis (e.g., expected costs of repair mechanisms) in selected error scenarios that can appear with some positive, but small probability or the trade-off analysis with conjunctions of utility and cost constraints that are achievable with positive probability, but not almost surely (see e.g. [2]).

This motivates the switch to variants of classical SSP problems where the restriction to proper schedulers is relaxed. One option (e.g., considered in [8]

) is to seek a scheduler optimizing the expectation of the random variable that assigns weight

to all paths not reaching the goal and the accumulated weight of the shortest prefix reaching the goal to all other paths. We refer to this expectation as partial expectation. Second, we consider the conditional expectation of the accumulated weight until reaching the goal under the condition that the goal is reached. In general, partial expectations describe situations in which some reward (positive and negative) is accumulated but only retrieved if a certain goal is met. In particular, partial expectations can be an appropriate replacement for the classical expected weight before reaching the goal if we want to include schedulers which miss the goal with some (possibly very small) probability. In contrast to conditional expectations, the resulting scheduler still has an incentive to reach the goal with a high probability, while schedulers maximizing the conditional expectation might reach the goal with a very small positive probability.

Previous work on partial or conditional expected accumulated weights was restricted to the case of non-negative weights. More precisely, partial expectations have been studied in the setting of stochastic multiplayer games with non-negative weights [8]. Conditional expectations in MDPs with non-negative weights have been addressed in [3]. In both cases, optimal values are achieved by weight-based deterministic schedulers that depend on the current state and the weight that has been accumulated so far, while memoryless schedulers are not sufficient. Both [8] and [3] prove the existence of a saturation point for the accumulated weight from which on optimal schedulers behave memoryless and maximize the probability to reach a goal state. This yields exponential-time algorithms for computing optimal schedulers using an iterative linear programming approach. Moreover, [3] proves that the threshold problem for conditional expectations (“does there exist a scheduler such that the conditional expectation under exceeds a given threshold?”) is PSPACE-hard even for acyclic MDPs.

The purpose of the paper is to study partial and conditional expected accumulated weights for MDPs with integer weights. The switch from non-negative to integer weights indeed causes several additional difficulties. We start with the following observation. While optimal partial or conditional expectations in non-negative MDPs are rational, they can be irrational in the general setting:

###### Example 1.

Consider the MDP depicted on the left in Figure 1. In the initial state , two actions are enabled. Action leads to with probability and weight . Action leads to the states and with probability from where we will return to with weight or , respectively. The scheduler choosing immediately leads to an expected weight of and is optimal among schedulers reaching the goal almost surely. As long as we choose in , the accumulated weight follows an asymmetric random walk increasing by or decreasing by with probability before we return to . It is well known that the probability to ever reach accumulated weight in this random walk is where is the golden ratio. Likewise, ever reaching accumulated weight has probability for all . Consider the scheduler choosing as soon as the accumulated weight reaches in . Its partial expectation is as the paths which never reach weight are assigned weight . The maximum is reached at . In Section 4, we prove that there are optimal schedulers whose decisions only depend on the current state and the weight accumulated so far. With this result we can conclude that the maximal partial expectation is indeed , an irrational number.

The conditional expectation of in is as reaches the goal with accumulated weight if it reaches the goal. So, the conditional expectation is not bounded. If we add a new initial state making sure that the goal is reached with positive probability as in the MDP , we can obtain an irrational maximal conditional expectation as well: The scheduler choosing in as soon as the weight reaches has conditional expectation . The maximum is obtained for ; the maximal conditional expectation is .

Moreover, while the proposed algorithms of [8, 3] crucially rely on the monotonicity of the accumulated weights along the prefixes of paths, the accumulated weights of prefixes of path can oscillate when there are positive and negative weights. As we will see later, this implies that the existence of saturation points is no longer ensured and optimal schedulers might require infinite memory (more precisely, a counter for the accumulated weight). These observations provide evidence why linear-programming techniques as used in the case of non-negative MDPs [8, 3] cannot be expected to be applicable for the general setting.

Contributions. We study the problem of maximizing the partial and conditional expected accumulated weight in MDPs with integer weights. Our first result is that the finiteness of the supremum of partial and conditional expectations in MDPs with integer weights can be checked in polynomial time (Section 3). For both variants we show that there are optimal weight-based deterministic schedulers if the supremum is finite (Section 4). Although the suprema might be irrational and optimal schedulers might need infinite memory, the suprema can be -approximated in time exponential in the size of the MDP and polynomial in (Section 5). By duality of maximal and minimal expectations, analogous results hold for the problem of minimizing the partial or conditional expected accumulated weight. (Note that we can multiply all weights by and then apply the results for maximal partial resp. conditional expectations.)

Related work. Closest to our contribution is the above mentioned work on partial expected accumulated weights in stochastic multiplayer games with non-negative weights in [8] and on computation schemes for maximal conditional expected accumulated weights in non-negative MDPs [3]. Conditional expected termination time in probabilistic push-down automata has been studied in [11]

, which can be seen as analogous considerations for a class of infinite-state Markov chains with non-negative weights. The recent work on notions of conditional value at risk in MDPs

[15] also studies conditional expectations, but the considered random variables are limit averages and a notion of (non-accumulated) weight-bounded reachability.

### 2 Preliminaries

We give basic definitions and present our notation. More details can be found in textbooks, e.g. [17].

Notations for Markov decision processes. A Markov decision process (MDP) is a tuple where is a finite set of states, a finite set of actions, the initial state, is the transition probability function and the weight function. We require that for all . We write for the set of actions that are enabled in , i.e., iff . We assume that is non-empty for all and that all states are reachable from . We call a state absorbing if the only enabled action leads to the state itself with probability and weight . The paths of are finite or infinite sequences where states and actions alternate such that for all . If is finite, then denotes the accumulated weight of , its probability, and its last state. The size of , denoted , is the sum of the number of states plus the total sum of the logarithmic lengths of the non-zero probability values as fractions of co-prime integers and the weight values .

Scheduler. A (history-dependent, randomized) scheduler for is a function that assigns to each finite path a probability distribution over . is called memoryless if for all finite paths , with , in which case can be viewed as a function that assigns to each state a distribution over . is called deterministic if is a Dirac distribution for each path , in which case can be viewed as a function that assigns an action to each finite path . Scheduler is said to be weight-based if for all finite paths , with and . Thus, deterministic weight-based schedulers can be viewed as functions that assign actions to state-weight-pairs. By we denote the class of all schedulers, by the class of weight-based schedulers, by the class of weight-based, deterministic schedulers, and by the class of memoryless deterministic schedulers. Given a scheduler , is a -path iff is a path and for all .

Probability measure. We write or briefly to denote the probability measure induced by and . For details, see [17]. We will use LTL-like formulas to denote measurable sets of paths and also write to describe the set of infinite paths having a prefix with for and . Given a measurable set of infinite paths, we define and where ranges over all schedulers for . Throughout the paper, we suppose that the given MDP has a designated state . Then, and denote the maximal resp. minimal probability of reaching from . That is, and . Let , and .

Mean payoff. A well-known measure for the long-run behavior of a scheduler in an MDP is the mean payoff. Intuitively, the mean payoff is the amount of weight accumulated per step on average in the long run. Formally, we define the mean payoff as the following random variable on infinite paths : . The mean payoff of the scheduler starting in is then defined as the expected value . The maximal mean payoff is the supremum over all schedulers which is equal to the maximum over all -schedulers: . In strongly connected MDPs, the maximal mean payoff does not depend on the initial state.

End components, MEC-quotient. An end component of is a strongly connected sub-MDP. End components can be formalized as pairs where is a nonempty subset of and a function that assigns to each state a nonempty subset of such that the graph induced by is strongly connected. is called maximal if there is no end component with , and for all . The MEC-quotient of an MDP is the MDP arising from by collapsing all states that belong to the same maximal end component to a state . All actions enabled in some state in not belonging to are enabled in . Details and the formal construction can be found in [9]. We call an end component positively weight-divergent if there is a scheduler for such that for all and . In [1], it is shown that the existence of positively weight-divergent end components can be decided in polynomial time.

### 3 Partial and Conditional Expectations in MDPs

We define partial and conditional expectations in MDPs. We extend the definition of [8] by introducing partial expectations with bias which are closely related to conditional expectations. Afterwards, we sketch the computation of maximal partial expectations in MDPs with non-negative weights and in Markov chains.

Partial and conditional expectation. In the sequel, let be an MDP with a designated absorbing goal state . Furthermore, we collapse all states from which is not reachable to one absorbing state . Let . We define the random variable on infinite paths by

 ⊕bgoal(ζ)={wgt(ζ)+bif ζ⊨◊goal,0if ζ⊭◊goal.

We call the expectation of this random variable under a scheduler the partial expectation with bias of and write as well as . If , we sometimes drop the argument ; if is clear from the context, we drop the subscript. In order to maximize the partial expectation, intuitively one has to find the right balance between reaching with high probability and accumulating a high positive amount of weight before reaching . The bias can be used to shift this balance by additionally rewarding or penalizing a high probability to reach .

The conditional expectation of is defined as the expectation of under the condition that is reached. It is defined if . We write and where the supremum is taken over all schedulers with . We can express the conditional expectation as . The following proposition establishes a close connection between conditional expectations and partial expectations with bias.

###### Proposition 2.

Let be an MDP, a scheduler with , , and . Then we have . Further, if , then .

###### Proof.

The first claim follows from . The second claim follows by quantification over all schedulers. ∎

In [3], it is shown that deciding whether for and is PSPACE-hard even for acyclic MDPs. We conclude:

###### Corollary 3.

Given an MDP , , and , deciding whether is PSPACE-hard.

Finiteness. We present criteria for the finiteness of and . Detailed proofs can be found in Appendix 0.A.1. By slightly modifying the construction from [1] which removes end components only containing -weight cycles, we obtain the following result.

###### Proposition 4.

Let be an MDP which does not contain positively weight-divergent end components and let . Then there is a polynomial time transformation to an MDP containing all states from and possibly an additional absorbing state such that

• [noitemsep,topsep=0pt]

• all end components of have negative maximal expected mean payoff,

• for any scheduler for there is a scheduler for with and for any state in , and vice versa.

Hence, we can restrict ourselves to MDPs in which all end components have negative maximal expected mean payoff if there are no positively weight divergent end components. The following result is now analogous to the result in [1] for the classical SSP problem.

###### Proposition 5.

Let be an MDP and arbitrary. The optimal partial expectation is finite if and only if there are no positively weight-divergent end components in .

To obtain an analogous result for conditional expectations, we observe that the finiteness of the maximal partial expectation is necessary for the finiteness of the maximal conditional expectation. However, this is not sufficient. In [3], a critical scheduler is defined as a scheduler for which there is a path containing a positive cycle and for which . Given a critical scheduler, it is easy to construct a sequence of schedulers with unbounded conditional expectation (see Appendix 0.A.1 and [3]). On the other hand, if , then is finite if and only if is finite. We will show how we can restrict ourselves to this case if there are no critical schedulers:

So, let be an MDP with and suppose there are no critical schedulers for . Let be the set of all states reachable from while only choosing actions in . As there are no critical schedulers, does not contain positive cycles. So, there is a finite maximal weight among paths leading from to in . Consider the following MDP : It contains the MDP and a new initial state . For each and each , also contains a new state which is reachable from via an action with weight and probability . In , only action with the same probability distribution over successors and the same weight as in is enabled. So in , one has to decide immediately in which state to leave and one accumulates the maximal weight which can be accumulated in to reach this state in . In this way, we ensure that .

###### Proposition 6.

The constructed MDP satisfies .

We can rely on this reduction to an MDP in which is reached with positive probability for -approximations and the exact computation of the optimal conditional expectation. In particular, the values for are easy to compute by classical shortest path algorithms on weighted graphs. Furthermore, we can now decide the finiteness of the maximal conditional expectation.

###### Proposition 7.

For an arbitrary MDP , is finite if and only if there are no positively weight-divergent end components and no critical schedulers.

Partial and conditional expectations in Markov chains. Markov chains with integer weights can be seen as MDPs with only one action enabled in every state. Consequently, there is only one scheduler for a Markov chain. Hence, we drop the superscripts in and .

###### Proposition 8.

The partial and conditional expectation in a Markov chain are computable in polynomial time.

###### Proof.

Let be the only action available in . Assume that all states from which is not reachable have been collapsed to an absorbing state . Then is the value of in the unique solution to the following system of linear equations with one variable for each state :

 xgoal=xfail=0, xs=wgt(s,α)⋅ps+∑tP(s,α,t)⋅xt for s∈S∖{goal,fail}.

The existence of a unique solution follows from the fact that and are the only end components (see [17]). It is straight-forward to check that is this unique solution. The conditional expectation is obtained from the partial expectation by dividing by the probability to reach the goal. ∎

This result can be seen as a special case of the following result. Restricting ourselves to schedulers which reach the goal with maximal or minimal probability in an MDP without positively weight-divergent end components, linear programming allows us to compute the following two memoryless deterministic schedulers (see [8, 3]).

###### Proposition 9.

Let be an MDP without positively weight-divergent end components. There is a scheduler such that for each we have and where the supremum is taken over all schedulers with . Similarly, there is a scheduler maximizing the partial expectation among all schedulers reaching the goal with minimal probability. Both these schedulers and their partial expectations are computable in polynomial time.

These schedulers will play a crucial role for the approximation of the maximal partial expectation and the exact computation of maximal partial expectations in MDPs with non-negative weights.

Partial expectations in MDPs with non-negative weights. In [8], the computation of maximal partial expectations in stochastic multiplayer games with non-negative weights is presented. We adapt this approach to MDPs with non-negative weights. A key result is the existence of a saturation point, a bound on the accumulated weight above which optimal schedulers do not need memory.

In the sequel, let be arbitrary, let be an MDP with non-negative weights, , and assume that end components have negative maximal mean payoff (see Proposition 4). A saturation point for bias is a natural number such that there is a scheduler with which is memoryless and deterministic as soon as the accumulated weight reaches . I.e. for any two paths and , with and , .

Transferring the idea behind the saturation point for conditional expectations given in [3], we provide the following saturation point which can be considerably smaller than the saturation point given in [8] in stochastic multiplayer games. Detailed proofs to this section are given in Appendix 0.A.2.

###### Proposition 10.

We define and . Then,

 pR:=sup{PEMaxs,α−PEMaxspmaxs−pmaxs,α∣∣ ∣∣s∈S,α∈Act(s)∖Actmax(s)}−R

is an upper saturation point for bias in .

The saturation point is chosen such that, as soon as the accumulated weight exceeds , the scheduler is better than any scheduler deviating from for only one step. So, the proposition states that is then also better than any other scheduler.

As all values involved in the computation can be determined by linear programming, the saturation point is computable in polynomial time. This also means that the logarithmic length of is polynomial in the size of and hence itself is at most exponential in the size of .

###### Proposition 11.

Let and let be the least integer greater or equal to and let . The values form the unique solution to the following linear program in the variables (r ranges over integers):

Minimize under the following constraints:

 For r≥pR:xs,r=pmaxs⋅(r+R)+EMaxs, for r

From a solution to the linear program, we can easily extract an optimal weight-based deterministic scheduler. This scheduler only needs finite memory because the accumulated weight increases monotonically along paths and as soon as the saturation point is reached provides the optimal decisions. As is exponential in the size of , the computation of the optimal partial expectation via this linear program runs in time exponential in the size of .

### 4 Existence of Optimal Schedulers

We prove that there are optimal weight-based deterministic schedulers for partial and conditional expectations. After showing that, if finite, is equal to , we take an analytic approach to show that there is indeed a weight-based deterministic scheduler maximizing the partial expectation. We define a metric on turning it into a compact space. Then, we prove that the function assigning the partial expectation to schedulers is upper semi-continuous. We conclude that there is a weight-based deterministic scheduler obtaining the maximum. Proofs to this section can be found in Appendix 0.B.

###### Proposition 12.

Let be an MDP with . Then we have .

###### Proof sketch.

We can assume that all end components have negative maximal expected mean payoff (see Proposition 4). Given a scheduler , we take the expected number of times that is visited with accumulated weight under for each state-weight pair , and the expected number of times that then chooses . These values are finite due to the negative maximal mean payoff in end components. We define the scheduler choosing in with probability when weight has been accumulated. Then, we show by standard arguments that we can replace all probability distributions that chooses by Dirac distributions to obtain a scheduler such that . ∎

It remains to show that the supremum is obtained by a weight-based deterministic scheduler. Given an MDP with arbitrary integer weights, we define the following metric on the set of weight-based deterministic schedulers, i.e. on the set of functions from : For two such schedulers and , we let where is the greatest natural number such that or if there is no greatest such natural number.

###### Lemma 13.

The metric space is compact.

Having defined this compact space of schedulers, we can rely on the analytic notion of upper semi-continuity.

###### Lemma 14 (Upper Semi-Continuity of Partial Expectations).

If is finite in , then the function assigning to a weight-based deterministic scheduler is upper semi-continuous.

The technical proof of this lemma can be found in Appendix 0.B. We arrive at the main result of this section.

###### Theorem 15 (Existence of Optimal Schedulers for Partial Expectations).

If is finite in an MDP , then there is a weight-based deterministic scheduler with .

###### Proof.

If is finite, then the map is upper semi-continuous. So, this map has a maximum because is a compact metric space. ∎

###### Corollary 16 (Existence of Optimal Schedulers for Conditional Expectations).

If is finite in an MDP , then there is a weight-based deterministic scheduler with .

###### Proof.

By Proposition 6, we can assume that . We know that and that there is a weight-based deterministic scheduler with . By Proposition 2, maximizes the conditional expectation as it reaches with positive probability. ∎

In MDPs with non-negative weights, the optimal decision in a state only depends on as soon as the accumulated weight exceeds a saturation point. In MDPs with arbitrary integer weights, it is possible that the optimal choice of action does not become stable for increasing values of accumulated weight as we see in the following example.

###### Example 17.

Let us first consider the MDP depicted in Figure 2. Let be a path reaching for the first time with accumulated weight . Consider a scheduler which chooses for the first times and then . In this situation, the partial expectation from this point on is:

 12k+1(r−k)+k∑i=112i(r−i)=12k+1+k+1∑i=112i(r−i)=k−r+42k+1+r−2.

For , this partial expectation has its unique maximum for the choice . This already shows that an optimal scheduler needs infinite memory. No matter how much weight has been accumulated when reaching , the optimal scheduler has to count the times it chooses .

Furthermore, we can transfer the optimal scheduler for the MDP to the MDP . In state , we have to make a nondeterministic choice between two action leading to the states and , respectively. In both of these states, action is enabled which behaves like the same action in the MDP except that it moves between the two states if is not reached. So, the action is only enabled every other step. As in , we want to choose after choosing times if we arrived in with accumulated weight . So, the choice in depends on the parity of : For or even, we choose

. For odd

, we choose . This shows that the optimal scheduler in the MDP needs specific information about the accumulated weight, in this case the parity, no matter how much weight has been accumulated.

In the example, the optimal scheduler has a periodic behavior when fixing a state and looking at optimal decisions for increasing values of accumulated weight. The question whether an optimal scheduler always has such a periodic behavior remains open.

### 5 Approximation

As the optimal values for partial and conditional expectation can be irrational, there is no hope to compute these values by linear programming as in the case of non-negative weights. In this section, we show how we can nevertheless approximate the values. The main result is the following.

###### Theorem 18.

Let be an MDP with and . The maximal partial expectation can be approximated up to an absolute error of in time exponential in the size of and polynomial in . If further, , also can be approximated up to an absolute error of in time exponential in the size of and polynomial in .

We first prove that upper bounds for and can be computed in polynomial time. Then, we show that there are -optimal schedulers for the partial expectation which become memoryless as soon as the accumulated weight leaves a sufficiently large weight window around . We compute the optimal partial expectation of such a scheduler by linear programming. The result can then be extended to conditional expectations.

Upper Bounds. Let be an MDP in which all end components have negative maximal mean payoff. Let be the minimal non-zero transition probability in and . Moving through the MEC-quotient, the probability to reach an accumulated weight of is bounded by as or is reached within steps with probability at least . It remains to show similar bounds inside an end component.

We will use the characterization of the maximal mean payoff in terms of super-harmonic vectors due to Hordijk and Kallenberg

[13] to define a supermartingale controlling the growth of the accumulated weight in an end component under any scheduler. As the value vector for the maximal mean payoff in an end component is constant and negative in our case, the results of [13] yield:

###### Proposition 19 (Hordijk, Kallenberg).

Let be an end component with maximal mean payoff for some . Then there is a vector such that .

Furthermore, let be the vector (-t,…,-t) in . Then, is the solution to a linear program with variables, inequalities, and coefficients formed from the transition probabilities and weights in .

We will call the vector a super-potential because the expected accumulated weight after steps is at most when starting in state . Let be a scheduler for starting in some state . We define the following random variables on -runs in : let be the state after steps, let be the action chosen after steps, let be the accumulated weight after steps, and let be the history, i.e. the finite path after steps.

###### Lemma 20.

The sequence satisfies for all .111This means that is a super-martingale with respect to the history .

###### Proof.

By Proposition 19, . ∎

We are going to apply the following theorem by Blackwell [6].

###### Theorem 21 (Blackwell [6]).

Let be random variables, and let . Assume that for all and that there is a such that . Then, .

We denote by . Observe that . We can rescale the sequence by defining . This ensures that , and for all . In this way, we arrive at the following conclusion, putting .

###### Corollary 22.

For any scheduler and any starting state in , we have .

###### Proof.

By Theorem 21,

Let be the set of maximal end components in . For each , let and be as in Corollary 22. Define , and . Then an accumulated weight of cannot be reached with a probability greater than because reaching accumulated weight would require reaching weight in some end component or reaching weight in the MEC-quotient and is a lower bound on the probability that none of this happens (under any scheduler).

###### Proposition 23.

Let be an MDP with . There is an upper bound for the partial expectation in computable in polynomial time.

###### Proof.

In any end component , the maximal mean payoff and the super-potential are computable in polynomial time. Hence, and , and in turn also and are also computable in polynomial time. When we reach accumulated weight for the first time, the actual accumulated weight is at most . So, we conclude that