Markov Decision Processes (MDP)  provide a unifying framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying optimization problems solved via dynamic programming and reinforcement learning. They are used in several areas, including economics, control, robotics and autonomous systems. In its simplest form, an MDP comprises a finite set of states , a finite set of control actions , which for each state and action
specifies the transition probabilitiesto successor states . In addition, transitioning from a state an action has an immediate cost 111In several alternative but essentially equivalent definitions of MDPs transitions have associated rewards rather than cost, and the reward may be depend on the successor state as well.. The overall problem is to find a strategy that specifies the action to be made in state in order to optimize some objective (e.g. the expected cost of reaching a goal state).
For many applications, however, such as queuing systems, epidemic processes (e.g. COVID19), and population processes the restriction to a finite state-space is inadequate. Rather, the underlying system has an infinite state-space and the decision making process must take into account the continuous dynamics of the system. In this paper, we consider a particular class of infinite-state MDPs, namely Euclidean Markov Decision Processes , where the state space is given by a (measurable) subset of for some fixed dimension .
As an example, consider the semi-random walk illustrated on the left of Fig. 1 with state-space (one dimensional space, and time). Here the goal is to cross the finishing line before . The decision maker has two actions at her disposal: to move fast and expensive (cost ), or to move slow and cheap (cost
). Both actions have uncertainty about distance traveled and time taken. This uncertainty is modeled by a uniform distribution over a successor state square: given current stateand action , the distribution over possible successor states is the uniform distribution over , where represents the direction of the movement in space and time which depends on the action , while the parameter models the uncertainty. Now, the question is to find the strategy that will minimize the expected cost of reaching a goal state.
In , we proposed two reinforcement learning algorithms implemented in UPPAAL STRATEGO , using online partition refinement techniques. In that work we experimentally demonstrated its improved convergence tendencies on a range of models. For the semi-random walk example, the online learning algorithm returns the strategy illustrated on the right of Fig. 1.
However, despite its efficiency and experimentally demonstrated convergence properties, the learning approach of  provides no hard guarantees as to how far away the expected cost of the learned strategy is from the optimal one. In this paper we propose a step-wise partition refinement process, where each partitioning induces a finite-state imprecise MDP (IMDP). From the induced IMDP we can derive upper and lower bounds on the expected cost of the original infinite-state Euclidean MDP. As a crucial result, we prove the correctness of these bounds, i.e., that they are always guaranteed to contain the true expected cost. Also, we provide value iteration procedures for computing lower and upper expected costs of IMDPs. Figure 2 shows upper and lower bounds on the expected cost over the regions shown in Figure 1.
Applying the IMDP value iteration procedures to the partition learned by UPPAAL STRATEGO therefore allows us to compute guaranteed lower and upper bounds on the expected cost, and thereby validate the results of reinforcement learning. The main contributions of this paper can by summarized as follows:
We define IMDP abstractions of infinite state Euclidean MDPs, and establish as key theoretical properties: the correctness of value iteration to compute upper and lower expected cost functions, the correctness of the upper and lower cost functions as bounds on the cost function of the original Euclidean MDP, and, under a restriction to finite time horizons, the convergence of upper and lower bounds to the actual cost values.
We demonstrate the applicability of the general framework to analyze the accuracy of strategies learned by reinforcement learning.
Our work is closely related to various types of MDP models proposed in different
Imprecise Markov Chains and Imprecise Markov Decision processes
Imprecise Markov Chains and Imprecise Markov Decision processes
have been considered in areas such as operations research and artificial intelligence[15, 4, 14]. The focus here typically is on approximating optimal policies for fixed, finite state spaces. In the same spirit, but from a verification point of view,  focuses on reachability probabilities.
Lumped Markov chains are obtained by aggregating sets of states of a Markov Chain into a single state. Much work is devoted to the question of when and how the resulting process again is a Markov chain (it rarely is) [13, 6]. The interplay of lumping and imprecision is considered in  Most work in this area is concerned with finite state spaces. Abstraction by state space partitioning (lumping) can be understood as a special form of partial observability (one only observes which partition element the current state belongs to). A combination or partial observability with imprecise probabilities is considered in 
 introduce abstractions of finite state MDPs by partitioning the state space. Upper and lower bounds for reachability probabilities are obtained from the abstract MDP, which is formalized as a two player stochastic game.  is concerned with obtaining accurate specifications of an abstraction obtained by state space partitioning. The underlying state space is finite, and a fixed partition is given.
Thus, while there is a large amount of closely related work on abstracting MDPs by state space partitioning, and imprecise MDPs that can result from such an abstraction, to the best of our knowledge, our work is distinguished from previous work by: the consideration of infinite continuous state spaces for the underlying models of primary interest, and the focus on the properties of refinement sequences induced by partitions of increasing granularity.
2 Euclidean MDP and Expected Cost
Definition 1 (Euclidean Markov Decision Processes)
A Euclidean Markov decision process (EMDP) is a tuple where:
is a measurable subset of the -dimensional Euclidean space equipped with the Borel -algebra .
is a measurable set of goal states,
is a finite set of actions,
defines for every a transition kernel on , i.e.,
is a probability distribution onfor all , and is measurable for all . Furthermore, the set of goal states is absorbing, i.e. for all and all : .
is a cost-function for state-action pairs, such that for all : is measurable, and for all .
A run of an MDP is a sequence of alternating states and actions . We denote the set of all runs of an EMDP as . We use to denote , for the prefix , and for the tail of a run. The cost of a run is
The set is equipped with the product -algebra generated by the cylinder sets (, , ). We denote with the Borel -algebra restricted to the non-negative reals, and with the standard extension to , i.e. the sets of the form and , where .
Lemma 1 ()
Due to space constraints proofs are only included in the extended online version of this paper.
We next consider strategies for EMDPs. We limit ourselves to memoryless and stationary strategies, noting that on the rich Euclidean state space this is less of a limitation than on finite state spaces, since a non-stationary, time dependent strategy can here be turned into a stationary strategy by adding one real-valued dimension representing time.
Definition 2 (Strategy)
A (memoryless,stationary) strategy for an MDP is a function , mapping states to probability distributions over , such that for every the function is measurable.
The following lemma is mostly a technicality that needs to be established in order to ensure that an MDP in conjunction with a strategy and an initial state distribution defines a Markov process on , and hence a probability distribution on .
Lemma 2 ()
If is a strategy, then
is a transition kernel on .
Usually, an initial state distribution will be given by a fixed initial state . We then denote the resulting distribution over by (this also depends on the underlying ; to avoid notational clutter, we do not always make this dependence explicitly in the notation).
Definition 3 (Expected Cost)
Let . The expected cost at under strategy is the expectation of under the distribution , denoted . The expected cost at initial state then is defined as
If , then for any strategy : , and hence . However, can also hold for , since also is allowed for non-goal states .
Note that, for any strategy , the functions and are -valued measurable functions on . This follows by measurability of and , for all , and [1, Theorem 13.4].
2.1 Value Iteration for EMDPs
We next show that expected costs in EMDPs can be computed by value iteration. Our results are closely related to Theorem 7.3.10 in . However, our scenario differs from the one treated by Puterman  in that we deal with uncountable state spaces, and in that we want to permit infinite cost values. Adapting Puterman’s notation , we introduce two operators, and , on -valued measurable functions on , defined as follows:
The operators above are well-defined:
Lemma 3 ()
If is measurable, so are and .
The set of -valued measurable functions on forms a complete partial order under the point wise order iff , for all . The top and bottom are respectively given by the constant functions , , for . Meet and join are the point-wise infimum and point-wise supremum, respectively. By their definition, it is easy to see that both and are monotone operators.
Since the set of actions is finite, for every we can define a deterministic strategy , such that . We can establish an even stronger relation:
Lemma 4 ()
As a first main step we can show that the expected cost under the strategy is a fixed point for the operator :
Proposition 1 ()
For any strategy , .
Proposition 2 ()
. Moreover, if , then .
By Proposition 2 and Tarski fixed point theorem, is the least fixed point of . The following theorem, provides us with a stronger result, namely, that is the supremum of the point-wise increasing chain
The following theorem then states that value iteration converges to .
Theorem 2.1 ()
3 Imprecise MDP
The value iteration of Theorem 2.1 is a mathematical process, not an algorithmic one, as it is defined pointwise on the uncountable state space . Our goal, therefore, is to approximate the expected cost function of an EMDP by expected cost functions on finite state spaces consisting of partitions of . In order to retain sufficient information of the original EMDP to be able to derive provable upper and lower bounds for , we approximate the EMDP by an Imprecise Markov Decision Processes (IMDPs) .
Definition 4 (Imprecise Markov Decision Processes)
A finite state, imprecise Markov decision process (IMDP) is a tuple where:
is a finite set of states
is the set of goal states,
is a finite set of actions,
assigns to state-action pairs a closed set of probability distributions over ; the set of goal states is absorbing, i.e., for all and all : ,
assigns to state-action pairs a closed set of costs, such that for all : .
Memoryless, stationary strategies are defined as before. In order to turn an IMDP into a fully probabilistic model, one also needs to resolve the choice of a transition probability distribution and cost value.
Definition 5 (Adversary, Lower/Upper expected cost)
An adversary for an IMDP consists of two functions
A strategy , an adversary , and an initial state together define a probability distribution over runs with , and hence the expected cost . We then define the lower and upper expected cost as
Since and are required to be closed sets, we can here write and rather than , . Furthermore, the closure conditions are needed to justify a restriction to stationary adversaries, as the following example shows (cf. also Example 7.3.2 in ).
Let , , We write for a transition probability distribution with . Then let , . , . Since there is only one action, there is only one strategy . For let such that . Then, if the adversary at the ’th step selects transition probabilities one obtains . For every stationary adversary the transition from to will be taken eventually with probability 1, so that here .
We note that only in the case of does act as an “adversary” to the strategy . In the case of , and represent co-operative strategies. In other definitions of imprecise MDPs only the transition probabilities are set-valued . Here we also allow an imprecise cost function. Note, however, that for the definition of and the adversary’s strategy will simply be to select the minimal (respectively maximal) possible costs, and that we can also obtain as the expected lower/upper costs on IMDPs with point-valued cost functions
where then the adversary has no choice for the strategy .
3.1 Value Iteration for IMDPs
We now characterize as limits of value iteration, again following the strategy of the proof of Theorem 7.3.10 of . In this case, the proof has to be adapted to accommodate the additional optimization of the adversary, and, as in Section 2.1, to allow for infinite costs. We again start by defining suitable operators on -valued functions defined on :
where . The mapping
defines the of an adversary. Similarly
defines a strategy.
Let be the function that is constant 0 on . Denote
We can now state the applicability of value iteration for IMDPs as follows:
Theorem 3.1 ()
Let . Then
We note that even though , in contrast to the operator for EMDPs, now only needs to be computed over a finite state space, we do not obtain from Theorem 3.1 a fully specified algorithmic procedure for the computation of , because the optimization over contained in (5) will require customized solutions that depend on the structure of the .
4 Approximation by Partitioning
From now on we only consider EMDPs whose state space is a compact subset of . We approximate such a Euclidean MDP by IMDPs constructed from finite partitions of . In the following, we denote with a finite partition of . We call an element a region and shall assume that each such is Borel measurable. For we denote by the unique region such that . The diameter of a region is , and the granularity of a is defined as . We say that a partition refines a partition if for any there exist with . We write in this case.
Definition 6 (Induced IMDP)
Let be an MDP, and let be a finite partition of consistent with in the sense that for any either or . The IMDP defined by and then is , where
where is the marginal of on , i.e. , and denotes topological closure.
The following theorem states how an induced IMDP approximates the underlying Euclidean MDP. In the following, we use sub-scripts on expectation operators to identify the (I)MDPs that define the expectations.
Theorem 4.1 ()
Let and as in Definition 6. Then for all :
If , then improves the bounds in the sense that
Our goal now is to establish conditions under which the approximation (10) becomes arbitrarily tight for partitions of sufficiently high granularity. This will require certain continuity conditions for as spelled out in the following definition. In the following, stands for the total variation distance between distributions. Note that we will be using both for discrete distributions on partitions , and for continuous distributions on .
Definition 7 (Continuous Euclidean MDP)
A Euclidean MDP is continuous if
For each there exists , such that: for all partitions , if , then for all , , : .
is continuous on for all .
We observe that due to the assumed compactness of , the first condition of Definition 7 is satisfied if is defined as a function on that for each as a function of is continuous on , and such that is for all a density function relative to Lebesgue measure.
We next introduce some notation for -step expectations and distributions. In the following, we use to denote strategies for induced IMDPs defined on partitions , whereas is reserved for strategies defined on Euclidean state spaces . For a given partition and strategy for let denote two strategies for the adversary (to be interpreted as strategies that are close to achieving and , respectively, even though we will not explicitly require properties that derive from this interpretation). We then denote with the distributions defined by and on run prefixes of length , and with the corresponding expectations for the sum of the first costs . The and also depend on the initial state . To avoid notational clutter, we do not make this explicit in the notation. We then obtain the following approximation guarantee:
Theorem 4.2 ()
Let be a continuous EMDP. For all , there exists , such that for all partitions with , and all strategies defined on :
Theorem 4.2 is a strengthening of Theorem 2 in . The latter applied to processes that are guaranteed to terminate within steps. Our new theorem applies to the expected cost of the first steps in a process of unbounded length. When the process has a bounded time horizon of no more than steps, and if we let be the strategy and the adversaries that achieve the optima in (3), respectively (4), then (13) becomes
We conjecture that this actually also holds true for arbitrary EMDPs:
Let be a continuous Euclidean MDP. Let be a sequence of partitions consistent with such that . Then for all :
The approximation guarantees given by Theorems 4.1 and 4.2 have two important implications: first, they guarantee the correctness and asymptotic accuracy of upper/lower bounds computed by value iteration in IMDP abstractions of the underlying EMDP. Second, they show that the hypothesis space of strategies defined over finite partitions that underlies the reinforcement learning approach of  is adequate in the sense that it contains strategy representations that approximate the optimal strategy for the underlying continuous domain arbitrarily well.
5 Examples and Experiments
We now use our semi-random walker example to illustrate the theory presented in the preceding sections, and to demonstrate its applicability to the validation of machine learning models.
5.1 IMDP Value Iteration
We first illustrate experimentally the bounds and convergence properties expressed by Theorems 4.1 and 4.2. For this we consider a nested sequence of partitions of the continuous state space consisting of regular grid partitions defined by a width parameter for the regions. We run value iteration to compute and for the values . For illustration purposes, we plot expected cost functions along one-dimensional sections for the two fixed time points and .
Figure 3 shows the upper and lower expected costs that we obtain from the induced IMDPs. One can see how the intervals narrow with successive partition refinements. The bounds on the section are closer and converge more uniformly than on . This shows that in the upper left region of the state space () the adversary has a greater influence on the process than at the lower part of the state space (), and the difference between a cooperative and a non-cooperative adversary is more pronounced.
Ultimately, induced strategies are of greater interest than the concrete cost functions. Once upper and lower expectations define the same strategy, further refinement may not be necessary. Figure 4 illustrates for the whole state space the strategies obtained from the lower (Equation (3)) and upper (Equation (4)) approximations. On regions colored blue and yellow, both strategies agree to take the fast and slow actions, respectively. The regions colored light green are those where the lower bound strategy chooses the fast action, and the upper bound strategy the slow action. Conversely for the regions colored light red. One can observe how the blue and yellow areas increase in size with successive partition refinements. However, this growth is not entirely monotonic: for example, some regions in the upper left that for are yellow are sub-divided in successive refinements into regions that are partly yellow, partly light green.
5.2 Analysis of learned strategies
We now turn to partitions computed by the reinforcement learning method developed in , and a comparison of the learned cost functions and strategies with those obtained from the induced IMDPs. We have implemented the semi-random walker in UPPAAL STRATEGO and used reinforcement learning to learn partitions, cost functions and strategies. Our learning framework produces a sequence of refinements, based on sampling additional runs for each refinement. In the following we consider the models learned after and refinements.
Figure 5 illustrates expected costs functions for the partition learned at
. One can observe a strong correlation between the bounds and the learned costs. Nevertheless, the learned cost function sometimes lies outside the given bounds. This is to be expected, since the random sampling process may produce data that is not sufficiently representative to estimate costs for some regions.
Turning again to the strategies obtained on the whole state space, we first note that the learned strategy at , which is shown in Figure 1 (right) exhibits an overall similarity with the strategies illustrated in Figure 4, with the fast action preferred along a diagonal region in the middle of the state space. To understand the differences between the learning and IMDP results, it is important to note that in the learning setting is taken to be the initial state of interest, and all sampling starts there. As a result, regions that are unlikely to be reached (under any choice of actions) from this initial state will obtain very little relevant data, and therefore unreliable cost estimates. This is not necessarily a disadvantage, if we want to learn an optimal control strategy for processes starting at . The value iteration process does not take into account the distinguished nature of .
Figure 6 provides a detailed picture of the consistency of the strategies learned at and with the strategies obtained from value iteration over the same partitions. Drawn in blue/yellow are those regions where the learned strategy picks the fast/slow action, and at least one of upper or lower bound strategies selects the same action. Light blue are those regions where the learned strategy chooses the fast action, but both IMDP strategies select slow. In a single region in the partition (drawn in light yellow) the learned strategy chooses the slow, while both IMDP strategies select fast. As Figure 6 shows, the areas of greatest discrepancies (light blue) are those in the top left and bottom right, which are unlikely to be reached from initial state .
In this paper we have developed theoretical foundations for the approximation of Euclidean MDPs by finite state space imprecise MDPs. We have shown that bounds on the cost function computed on the basis of the IMDP abstractions are correct, and that for bounded time horizons they converge to the exact costs when the IMDP abstractions are refined. We conjecture that this convergence also holds for the total cost of (potentially) infinite runs.
The results we here obtained provide theoretical underpinnings for the learning approach developed in 
. Upper and lower bounds computed from induced IMDPs can be used to check the accuracy of learned value functions. As we have seen, data sparsity and sampling variance can make the learned cost functions fall outside computed bounds. One can also use value iteration on IMDP approximations directly as a tool for computing cost functions and strategies, which then would come with stronger guarantees than what we obtain through learning. However, compared to the learning approach, this has important limitations: first, we will usually only obtain a partial strategy that is uniquely defined only where upper and lower bounds lead to the same actions. Second, we will require a full model of the underlying EMDP, from which IMDP abstractions then can be derived, and the optimization problem over adversaries that is part of the value iteration process must be tractable. Reinforcement learning, on the other hand, can also be applied to black box systems, and its computational complexity is essentially independent of the complexities of the underlying dynamic system.
- Billingsley  P. Billingsley. Probability and Measure. John Wiley, second edition edition, 1986.
- Chen et al.  T. Chen, T. Han, and M. Kwiatkowska. On the complexity of model checking interval-valued discrete time markov chains. Information Processing Letters, 113(7):210–216, 2013.
- Cohn  D. L. Cohn. Measure Theory. Birkhäuser, 1980.
Crossman et al. 
R. Crossman, P. Coolen-Schrijner, D. Škulj, and F. Coolen.
Imprecise markov chains with an absorbing state.
Proceedings of the Sixth International Symposium on Imprecise Probability: Theories and Applications (ISIPTA), pages 119–128. Citeseer, 2009.
- David et al.  A. David, P. G. Jensen, K. G. Larsen, M. Mikučionis, and J. H. Taankvist. Uppaal Stratego. In TACAS 2015, pages 206–211. Springer, 2015.
- Derisavi et al.  S. Derisavi, H. Hermanns, and W. H. Sanders. Optimal state-space lumping in markov chains. Information Processing Letters, 87(6):309–315, 2003.
- Erreygers and De Bock  A. Erreygers and J. De Bock. Computing inferences for large-scale continuous-time markov chains by combining lumping with imprecision. In International Conference Series on Soft Methods in Probability and Statistics, pages 78–86. Springer, 2018.
- Itoh and Nakamura  H. Itoh and K. Nakamura. Partially observable markov decision processes with imprecise parameters. Artificial Intelligence, 171(8-9):453–490, 2007.
- Jaeger et al.  M. Jaeger, P. G. Jensen, K. G. Larsen, A. Legay, S. Sedwards, and J. H. Taankvist. Teaching stratego to play ball: Optimal synthesis for continuous space mdps. In International Symposium on Automated Technology for Verification and Analysis, pages 81–97. Springer, 2019.
- Kwiatkowska et al.  M. Z. Kwiatkowska, G. Norman, and D. Parker. Game-based abstraction for markov decision processes. In (QEST 2006), pages 157–166. IEEE Computer Society, 2006. ISBN 0-7695-2665-9. doi: 10.1109/QEST.2006.19.
- Lun et al.  Y. Z. Lun, J. Wheatley, A. D’Innocenzo, and A. Abate. Approximate abstractions of markov chains with interval decision processes. In A. Abate, A. Girard, and M. Heemels, editors, ADHS 2018, volume 51 of IFAC-PapersOnLine, pages 91–96. Elsevier, 2018. doi: 10.1016/j.ifacol.2018.08.016.
- Puterman  M. L. Puterman. Markov Decision Processes. Wiley, 2005.
- Rubino and Sericola  G. Rubino and B. Sericola. A finite characterization of weak lumpable markov processes. part i: The discrete time case. Stochastic processes and their applications, 38(2):195–204, 1991.
- Troffaes et al.  M. Troffaes, J. Gledhill, D. Škulj, and S. Blake. Using imprecise continuous time markov chains for assessing the reliability of power networks with common cause failure and non-immediate repair. SIPTA, 2015.
- White III and Eldeib  C. C. White III and H. K. Eldeib. Markov decision processes with imprecise transition probabilities. Operations Research, 42(4):739–749, 1994.
Appendix 0.A Total Variation Distance
The following lemma collects some basic facts about total variation distance:
Let be a finite set, and be distributions on with .
Let functions on with values in and for all . Then
where denote expectation under and , respectively.
For each let be distributions on a space (discrete or continuous), such that for all . Then
For A we write
then (16) follows. The proof for B is very similar:
Using the definition of total variation as the first term on the right can be bounded by , and the second by .
Appendix 0.B Proofs
For each , is measurable according to the measurability condition on . It follows that also is measurable for every . Since is the supremum of the , it is measurable [3, Proposition 2.1.4].
For fixed , is a probability measure on by construction. To show that for fixed the function is measurable, we only need to consider the case of singletons . By the measurability of we can express as the supremum of a monotone increasing sequence of simple measurable functions 222Recall that a simple function is a finite weighted sum of indicator functions of measurable sets[1, Theorem 13.5]. For each the integral then decomposes into a weighted sum of integrals of the form , which are measurable in according to Definition 1. Finally,
By [1, Theorem 13.5] we can express as the supremum of a monotone increasing sequence of simple measurable functions . For each the integral then decomposes into a weighted sum of integrals of the form , for some measurable set , which are measurable according to Definition 1. Since , by the monotone convergence theorem [1, Theorem 16.2],
From the above, measurability of follows from the measurability of , for all , and of minima of measurable functions [1, Theorem 13.4]. Measurability of follows similarly by additionally noticing that for any strategy , the -valued function is measurable, for all .
follows by noticing that , where ranges only over deterministic strategies. To establish the reverse equality, notice that, for all and , .
Thus, , for all strategies . From this we obtain . ∎
We have to show that the following holds for all states :
By monotone convergence theorem and linearity of the integral, we have