# Multi-weighted Markov Decision Processes with Reachability Objectives

In this paper, we are interested in the synthesis of schedulers in double-weighted Markov decision processes, which satisfy both a percentile constraint over a weighted reachability condition, and a quantitative constraint on the expected value of a random variable defined using a weighted reachability condition. This problem is inspired by the modelization of an electric-vehicle charging problem. We study the cartography of the problem, when one parameter varies, and show how a partial cartography can be obtained via two sequences of opimization problems. We discuss completeness and feasability of the method.

## Authors

• 19 publications
• 2 publications
• 9 publications
• 12 publications
• ### The Complexity of Reachability in Parametric Markov Decision Processes

09/28/2020 ∙ by Sebastian Junges, et al. ∙ 0

• ### On the Complexity of Reachability in Parametric Markov Decision Processes

This paper studies parametric Markov decision processes (pMDPs), an exte...
04/02/2019 ∙ by Tobias Winkler, et al. ∙ 0

• ### Conditional Value-at-Risk for Reachability and Mean Payoff in Markov Decision Processes

We present the conditional value-at-risk (CVaR) in the context of Markov...
05/08/2018 ∙ by Jan Křetínský, et al. ∙ 0

• ### On Decidability of Time-bounded Reachability in CTMDPs

We consider the time-bounded reachability problem for continuous-time Ma...
06/09/2020 ∙ by Rupak Majumdar, et al. ∙ 0

• ### On Skolem-hardness and saturation points in Markov decision processes

The Skolem problem and the related Positivity problem for linear recurre...
04/23/2020 ∙ by Jakob Piribauer, et al. ∙ 0

• ### Qualitative Multi-Objective Reachability for Ordered Branching MDPs

We study qualitative multi-objective reachability problems for Ordered B...
08/24/2020 ∙ by Kousha Etessami, et al. ∙ 0

• ### Qualitative Controller Synthesis for Consumption Markov Decision Processes

Consumption Markov Decision Processes (CMDPs) are probabilistic decision...
05/14/2020 ∙ by František Blahoudek, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

##### Importing formal methods in connected fields.

Formal methods can help providing algorithmic solutions for control design. The electric-vehicle (EV) charging problem is an example of such an application area. This problem, usually presented as a control problem (see eg. [5]), can actually be modelled using Markov decision processes (MDP) [12, 16, 14]

. Probabilities provide a way to model the non-flexible part of the energy network (consumption outside EV, for which large databases exist—and from which precise statistics can be extracted); we can then express an upper bound on the peak load as a safety condition (encoded as a reachability condition in our finite-horizon model), the constraint on the charging of all vehicles as a quantitative reachability objective, and various optimization criteria (e.g. minimizing the ageing of distribution transformers, or the energy price) as optimization of random cost variables.

Due to the specific form of the constructed model (basically acyclicity of the model), an ad-hoc method could be implemented using the tool PRISM, yielding interesting practical results as reported in [14] and in a forthcoming paper. However, the computability of an optimal strategy in a general MDP, as well as the corresponding decision problem, was unexplored.

##### Markov decision processes.

MDPs have been studied for long [18, 13]. An MDP is a finite-state machine, on which a kind of game is played as follows. In each state, several decisions (a.k.a. actions) are available, each yielding a distribution over possible successor states. Once an action is selected, the next state is chosen probabilistically, following the distribution corresponding to the selected action. The game proceeds that way ad infinitum, generating an infinite play. The way actions are chosen is according to a strategy (also called policy in the context of MDPs). Rewards and/or costs can be associated to each action or edge, and various rules for aggregating individual rewards and costs encountered along a play can be applied to obtain various payoff functions. Examples of payoff functions include:

• sum up all the encountered costs (or rewards) along a play, until reaching some target (finite if the target is reached, infinite otherwise)—this is the so-called truncated-sum payoff;

• sum up all the encountered costs (or rewards) along a play with a discount factor at each step—this is the so-called discounted-sum payoff;

• average over the encountered costs (or rewards) along a play—this is the so-called mean-payoff.

Those payoff functions have been extensively studied in the literature; discounted-sum and mean-payoff have been shown to admit optimal memoryless and deterministic strategies, which can be computed using linear programming, yielding a polynomial-time algorithm. Alternative methods, such as value iteration or policy improvement, can be used in practice. On the other hand, the

shortest-path problem, which aims at minimizing the truncated-sum, has been fully understood only recently [2]: one can decide in polynomial time as well whether the shortest-path is finite, or whether it is equal to (if one cannot ensure reaching the target almost-surely), or whether it can be smaller than any arbitrary small number (if a negative loop can be enforced—note that in the context of stochastic systems, such a statement may be misleading, but it corresponds to a rough intuition), and corresponding strategies can be computed.

##### Multi-constrained problems in MDPs.

The paradigm of multi-constrained objectives in stochastic systems in general, and in MDPs in particular, has recently arisen. It allows to express various (quantitative or qualitative) properties over the model, and to synthesize strategies accordingly. This new field of research is very rich and ambitious, with various types of objective combinations (see for instance [3, 19] for recent overviews). For recent developments on MDPs, one can cite:

• Pareto curves, or percentile queries, of multiple quantitative objectives: given several payoff functions, evaluate which tradeoff can be made between the probabilities, or the expectations, of the various payoff functions. In [9, 7, 10], solutions based on linear programming are provided for mean-payoff objectives. The percentile-query problem for various quantitative payoff functions is studied in [20].

• probability of conjunctions of objectives: given several payoff functions, evaluate the probability that all constraints are satisfied. This problem is studied in [15] for reachability (that is, for the truncated-sum payoff function), a lower bound is proved for that problem, already with a single payoff function.

• the “beyond worst-case” paradigm: satisfy both a safety constraint on all outcomes, and various performance criteria. Variations of this problem for various payoff functions have been studies in [11, 8, 6].

• conditional expectations [4] or conditional values-at-risk [17], which measures likelihoods of properties under some assumptions on the system, have recently been investigated.

##### Our contributions.

The general multi-constrained problem, arising from the EV-charging problem as modelled in [14], takes as an input an MDP with two weights,  and , and requires the existence (and synthesis) of a strategy ensuring that some (absorbing) target state be reached, with a percentile constraint on the truncated sum of  (lower bound parameterized by ), and an expectation constraint on the truncated sum of . The initial EV-charging problem corresponds to the instance of that problem when , where represents the energy that is used for charging and represents the ageing of the transformer.

As defined above, our problem integrates both the “beyond worst-case” paradigm of [8], and percentile queries as in [10] (mixing probabilities and expectations). While in [10] linear programs are used for solving percentile queries (heavily relying on the fact that mean-payoff objectives are tail objectives), we need different techniques since the truncated-sum payoff is very much prefix-dependent; actually, the -lower bound of [15] immediately applies here as well (even without a constraint on the expectation of ). We develop here a methodology to describe the cartography of the problem, that is, the set of values of the parameter  for which the problem has a solution. Our approach is based on two sequences of optimization problems which, in some cases we characterize, allow to have the (almost) full picture. We then discuss computability issues.

## 2 Preliminary definitions

Let be a finite set. We write for the set of distributions over , that is, the set of functions such that . A distribution over  is Dirac if for some .

### 2.1 Definition of the model

In this paper, we mainly focus on doubly-weighted Markov decision processes, but the technical developments mainly rely on simply-weighted Markov decision processes. We therefore define the setting with an arbitrary number of weights.

Let . A -weighted Markov decision process (w-MDP) is a tuple , where:

• is a finite set of states;

• is the initial state;

• is the target state;

• is a finite set of stochastic edges;

• for each , the function assigns a rational weight to each transition of the complete graph with state space .

A (finite, infinite) path in  from  is a (finite, infinite) sequence of states such that and for every , there is such that and . Finite paths are equivalently called histories. We write  (resp. ) for the set of paths (resp. infinite paths), in  from state . Given a history and , the -accumulated weight of  after steps is defined as . This notion extends straightforwardly to infinite paths.

A (randomized) strategy in  is a function assigning to every history a distribution over . A strategy  is said to be pure whenever the distributions it prescribes are Dirac. A path  is an outcome of  whenever for every strict prefix , there exists such that and . Basically, the outcomes of a strategy are the paths that are activated by the strategy. We write (resp. ) for the set of finite (resp. infinite) outcomes of  from state .

Given a strategy and a state , we denote with

the probability distribution, according to

, over the infinite paths in , defined in the standard way using cylinders based on finite paths from . If is a measurable functions from to , we denote by the expected value of w.r.t. the probability distribution , that is, . In all notations, we may omit to mention the superscript  when it is clear in the context, and may omit to mention the starting state  when it is , so that corresponds to .

Consider the 2w-MDP of Figure 1. It has four states , , and , and five edges, labelled with their names (here , , , and ). Weights label pairs of states (but are represented here only for pairs of states that may be activated). Edges , , and  have Dirac distributions, while edge  has a stochastic choice (represented by the small black square). For readability we do not write the exact distributions, but in this example, they are assumed to be uniform.

### 2.2 Payoff functions

We are interested in quantitative reachability properties (also called truncated-sum in the literature), which we formalize as follows. Let . We use standard -based notations for properties; for instance, we write (resp. , when is an interval of ) when there is (resp. ) such that , and (resp. , when is an interval of ) when for every  (resp. for every ). We will often use expressions (in for defining intervals of .

If  and , we define the -th payoff function as where is the least index such that . If  then . The function  is measurable, hence for every in , (simply written as ) and are well-defined. We write whenever .

In the rest of the paper, we assume that is a sink state, and that there is a single loop on  whose weights are all equal to . This is w.l.o.g. since we will study payoff functions , which only consider the prefix up to the first visit to .

Consider again the example of Figure 1. Consider the strategy  which selects or uniformly at random in , and always selects  in . Then,

 \Probσ,s0(\F\Goal) =1 \Probσ,s0(\TS\Goalw1≥1) =12 \Espσ,s0(\TS\Goalw2) =12⋅5+12⋅∞∑i=1−i2i=1+12

## 3 The problem

The problem that we tackle in this paper arises from a recent study of an EV-charging problem [14]. The general problem we will define is a relaxed version of the original problem, combining several stochastic constraints (a percentile query over some payoff function and a constraint on the expectation of some payoff function) together with a worst-case obligation. While various payoff functions could be relevant, we focus on those payoff functions that were used for the EV-charging problem, that is, quantitative reachability (i.e., the truncated-sum payoff). We will see that the developed techniques are really specific to our choice of payoff functions.

In this paper, we focus on a combination of sure reachability of the goal state, of a percentile constraint on the proportion of paths having high value for the first payoff, and of a constraint on the expected value of the second payoff.

Let be a 2w-MDP. Let . For every , we define the problem as follows: there exists a strategy such that

1. for all , it holds ;

2. ;

3. .

We aim at computing the values of for which has a solution. For the rest of this section, we assume that there is a strategy  such that . Otherwise trivially has no solutions, for any . This can be decided using the algorithm recently developed in [2].

To illustrate the problem, we consider again the example given in Figure 1. Consider , and . The only way to satisfy the threshold constraint on  is that at least half of the paths use , impacting over the expectation of . The other paths have to go to , and then take for some time (provided the play goes back to ) in order to decrease the expectation of , before it becomes possible to take and then  (so that the strategy is surely winning). This strategy uses both randomization (at ) and memory (counting the number of times  is taken before can be taken).

We call the cartography of our problem the function which associates to every , either true if has a solution, or false otherwise. It is easily seen that the cartography is a threshold function, and can be characterized by an interval  (which may be left-open or left-closed): In what follows, we describe an algorithmic technique to approximate this interval, and under additional conditions, to compute the bound . Whether the bound belongs to the interval remains open in general.

##### Link with the electric-vehicle (EV) charging problem.

The (centralized) EV-charging problem consists in scheduling power loads within a time interval ( being a fixed time bound) with uncertain exogenous loads, so as to minimize the impact of loading on the electric distribution network (measured through the ageing of the transformer, which depends on the temperature of its winding). Following standard models, time is discretized, and the instantaneous energy consumption at time  can be written as the sum of the non-flexible load (consumption outside EV) and the flexible load

, corresponding to the EV charging. The flexible loads at each time are controllable actions, while the non-flexible part is known, or statistically estimated using past databases.

A first constraint on the transformer is given by its capacity: (where is a constant) for every . A second constraint represents the charge required for charging all vehicles on schedule: , where is a constant. The flexible load  can thus be seen as a weight function .

While greedy solutions can be used to solve the above constraints, the ageing of thansformer has not been taken into account so far. Using a standard model for the ageing of a transformer (see [5, 14] for details), it can be expressed as a weight function  based on a discrete model in which states aggregate information on the system at the the two last timepoints. Globally, a 2w-MDP  can be built, such that a controller for the EV-charging problem coincides with a solution to , for some bound for the expected ageing of the transformer.

## 4 Approximated cartography

We fix a 2w-MDP and two thresholds . We introduce two simpler optimization problems related to , from which we derive informations on the good values of  for which that problem has a solution. As we explain below, our approach is in general not complete. However, we observe that the true part of the cartography of our problem is an interval of the form ; under some hypotheses, we prove that our approach allows to approximate arbitrarily the bound , but may not be able to decide if the interval is left-open or left-closed.

### 4.1 Optimization problems

Let be an integer. We write  for the property (which specifies that the target is reached in no more than  steps, with a -weight larger than or equal to ), and for the property (which means that the target is reached in no more than  steps, with a -weight smaller than ). We write for the property (the target is not reached during the first steps). By extension, we write , and for the properties , and . Finally, we may also (abusively) use such formulas to denote the set of paths that satisfy them.

For every  and every path  of  of length at least , it holds that: . Moreover, observe that and . As a consequence, for every  and every strategy , .

#### 4.1.1 First optimization problem

We define

 ¯¯¯¯¯¯¯¯¯¯¯\valN=inf{\Probσ(ϕ−N∨ψN)∣σ s.t. \Espσ(\TS\Goalw2)<ν2}

and for every , we fix a witnessing strategy  for up to  (i.e. and ).

Note that, since we assume that there is a strategy such that , the constraint of this optimization problem is non-empty. Note also that if is a strategy such that , then , since for every path , whenever .

It is not hard to see that the sequence is non-increasing (see Appendix). We let . We then have:

[] For every , has no solution.

###### Proof.

Fix , and assume towards a contradiction that has a solution.

Fix a winning strategy  for . By the first winning constraint, there exists such that any outcome satisfies (thanks to König’s lemma). Furthermore, since , then belongs to the domain of the optimization problem defining . Hence, we have

 ¯¯¯¯¯¯¯¯¯¯¯\valNε≤\Probσε(ϕ−Nε∨ψNε)=1−\Probσε(ϕ+Nε)=1−\Probσε(ϕ+)≤ε.

This is then a contradiction with . Hence, we deduce that for every , has no solution. ∎

[] For every , for every , has a solution.

###### Proof.

Let be an integer, and . Let be a strategy such that

 \ProbσN(ϕ−N∨ψN)<ε % and \EspσN(\TS\Goalw2)<ν2

Let be an attractor (memoryless) strategy on , that is, a strategy which enforces reaching  ; write for a positive upper-bound on the accumulated weight  when playing that strategy (from any state). For , define as: play for the first  steps, and if is not reached, then play . We show that we can find large enough such that this strategy is a solution to .

The first condition is satisfied, since either the target state is reached during the first steps (i.e. while playing ), or it will be surely reached by playing . Since and , it is the case that , which is the second condition for being a solution to . Finally, thanks to the law of total expectation, we can write:

 \EspσkN(\TS\Goalw2) =\EspσkN(\TS\Goalw2∣\F[≤k]\Goal)⋅\ProbσkN(\F[≤k]\Goal)+\EspσkN(\TS\Goalw2∣\G[≤k]¬\Goal)⋅\ProbσkN(\G[≤k]¬\Goal) ≤\EspσkN(\Acckw2∣\F[≤k]\Goal)⋅\ProbσkN(\F[≤k]\Goal)+\EspσkN(\Acckw2+M∣\G[≤k]¬\Goal)⋅\ProbσkN(\G[≤k]¬\Goal) \tagcommentsincetheglobalimpactofplayingthestrategy$σAtt$isboundedby$M$ =\EspσkN(\Acckw2)+M⋅\ProbσkN(\G[≤k]¬\Goal) \tagcommentbylinearityofexpectationandthelawoftotalexpectationagain =\EspσN(\Acckw2)+M⋅\ProbσN(\G[≤k]¬\Goal) \tagcommentsince$σkN$coincideswith$σN$onthe$k$firststeps

Now, since is finite, it is the case that . Hence:

• , and

• .

Let . One can choose large enough such that

 ∣∣\EspσN(\Acckw2)−\EspσN(\TS\Goalw2)∣∣<η/2 and \ProbσN(\G[≤k]¬\Goal)<η/2M.

We conclude that:

The strategy  therefore witnesses the fact that problem has a solution. ∎

#### 4.1.2 Second optimization problem

We now define

 \val–––––N=inf{\Probσ(ϕ−N)∣σ s.t. \Espσ(\TS\Goalw2)<ν2}.

Notice that for any , . For every , we fix a witness strategy  for up to  (so that and ).

This time, it can be observed that the sequence is non-decreasing. We let . From the results and remarks above, we have for any . From Lemma 4.1.1, we get:

[] For any  and any , has no solution.

While the status of is in general unknown, we still have the following properties: []

• If has a solution, then the sequence is stationary and ultimately takes value . The converse need not hold in general;

• does neither imply that has a solution, nor that has no solution.

#### 4.1.3 Summary

Figure 2 summarizes the previous analysis.

The picture seems rather complete, since only the status of remains uncertain. However, it remains to discuss two things: first, the limits and are a priori unknown, hence the cartography is not effective so far. The idea is then to use the sequences and to approximate the limits. We will therefore discuss cases where the two limits coincide (we then say that the approach is almost-complete), allowing for a converging scheme and hence an algorithm to almost cover the interval with either red (there are no solutions) or green (there is a solution), that is, to almost compute the full cartography of the problem. Second, we should discuss the effectiveness of the approach.

## 5 Almost-completeness of the approach

In this section, we discuss the almost-completeness of our approach, and describe situations where one can show that , which allows to reduce the unknown part of the cartography to the singleton .

The situations for completeness we describe below are conditions over cycles, either on weight  or on weight . When we assume that cycles have positive -weights, we mean it for every cycle, except for cycles containing , which we assumed are self-loops with weight .

### 5.1 When all cycles have a positive w2-weight

In this subsection, we assume that the -weight of each cycle of  is positive (this is the case for instance when the -weight of each edge is , i.e., when counts the number of steps). We let  be the number of states of .

There exists a constant  such that, for any strategy  satisfying , and any , it holds:

 \Probσ(ϕ−N∨ϕ+N)≥1−nN−n⋅κ.
###### Proof.

Assuming otherwise, the impact of all runs that do not belong to would be too large for the constraint on . Indeed, applying the law of total expectation, we can write for every :

 \Esp\calMσ(\TS\Goalw2) =\Esp\calMσ(\TS\Goalw2∣\F[≤N]\Goal)⋅\Prob\calMσ(\F[≤N]\Goal)+\Esp\calMσ(\TS\Goalw2∣\G[≤N]¬\Goal)⋅\Prob\calMσ(\G[≤N]¬\Goal)

Write for the minimal (possibly negative) -weight appearing in , and for the minimal (positive by hypothesis) -weight of cycles in . Noticing that, along any path, at most edges may be outside any cycle, we get

 \Esp\calMσ(\TS\Goalw2∣\F[≤N]\Goal)≥n⋅W2

and

 \Esp\calMσ(\TS\Goalw2∣\G[≤N]¬\Goal)≥n⋅W2+N−nn⋅c2.

We get:

 \Esp\calMσ(\TS\Goalw2) ≥n⋅W2⋅(\Prob\calMσ(\F[≤N]\Goal)+\Prob\calMσ(\G[≤N]¬\Goal))+N−nn⋅c2⋅\Prob\calMσ(\G[≤N]¬\Goal)

Since the left-hand side is strictly smaller than , we get

 \Prob\calMσ(\G[≤N]¬\Goal)=\Prob\calMσ(ψN)=1−\Prob\calMσ(ϕ+N∨ϕ−N)≤nN−n⋅[ν2−W2⋅nc2].

For any constant  satisfying Lemma 5.1, and any , we have

 0≤¯¯¯¯¯¯¯¯¯¯¯\valN−\val–––––N≤nN−n⋅κ.
###### Proof.

We already remarked that . Now, from Lemma 5.1, for every strategy  such that , it holds for any that

 \Probσ(ψN)

Hence, for any strategy  and any ,

 \Probσ(ϕ−N∨ψN)=\Probσ(ϕ−N)+\Probσ(ψN)≤\Probσ(ϕ−N)+nN−n⋅κ.

Taking the infimum over , first in the left-hand side, and then in the right-hand side, we get the expected bound. ∎

.

Notice that the result does not hold without the assumption. Indeed, consider the 2w-MDP defined by the two deterministic edges and , with . Then, for every , , while .

### 5.2 When all cycles have a positive w1-weight

We assume that each cycle of has a positive -weight. We first notice that: There exists an integer such that for every path from of length not visiting the goal state, it holds . In particular, if satisfies , then .

Using this remark, we can prove:

###### Proof.

We fix the index as in Lemma 5.2. For any  and any path  of length larger than , we have

 ρ⊨ϕ−N⟺ρ⊨ϕ−N0andρ⊨ϕ+N⟺ρ⊨ϕ+N0∨(\G[≤N0]¬\Goal∧\F[(N0;N]]\Goal).

From the first equivalence, we infer that for every , .

Let , and write:

 ¯¯¯¯¯¯¯¯¯¯¯\valN =inf{\Probσ(ϕ−N∨ψN)∣σ s.t. \Espσ(\TS\Goalw2)<ν2} =1−sup{\Probσ(ϕ+N)∣σ s.t. \Espσ(\TS\Goalw2)<ν2} =1−sup{\Probσ(ϕ+N0∨(\G[≤N0]¬\Goal∧\F[(N0;N]]\Goal))∣σ s.t. \Espσ(\TS\Goalw2)<ν2}

We claim that: []

 limN→+∞sup{\Probσ(ϕ+N0∨(\G[≤N0]¬\Goal∧\F[(N0;N]]\Goal))∣σ s.t. \Espσ(\TS\Goalw2)<ν2}=sup{\Probσ(ϕ+N0∨(\G[≤N0]¬\Goal∧\F[>N0]\Goal))∣σ s.t. \Espσ(\TS\Goalw2)<ν2}

From this lemma, we get that:

 limN→+∞¯¯¯¯¯¯¯¯¯¯¯\valN =1−sup{\Probσ(ϕ+N0∨(\G[≤N0]¬\Goal∧\F[>N0]\Goal))∣σ s.t. \Espσ(\TS\Goalw2)<ν2} \tagcommentbecauseapathnotsatisfying$ϕ+N0∨(\G[≤N0]¬\Goal∧\F[>N0]\Goal)$satisfies$ϕ−N0∨\G¬\Goal$ =inf{\Probσ(ϕ−N0)∣σ s.t. \Espσ(\TS\Goal