Markov decision processes (MDPs) is a widely used mathematical framework for control design in stochastic environments, eg. density control of a swarm of agents (Açikmeşe and Bayard, 2012; Demir et al., 2015). It is also a fundamental framework for reinforcement learning, robotic motion planning and stochastic games (Filar and Vrieze, 2012; Li et al., 2019). An MDP can be solved for different objectives including minimum average cost, minimum discounted cost, reachability, among others (Puterman, 2014). Given an objective, solving an MDP is equivalent to computing the optimal policy and the optimal value function of a decision maker over the state space. Among different algorithms for computing the optimal policy, most are based on the Bellman equation that characterizes the value function as its fixed point.
In applications, it is common to encounter MDPs with uncertainties. When modeling an environment as a stochastic process, sampling techniques are often used to determine process parameters such as process costs or probabilities; such models are inherently uncertain. Another source of uncertainty is in stochastic games, where the cost and probability parameters change with respect to another decision maker’s strategy. While existing works focus on certain perturbations in MDPs(Bielecki and Filar, 1991; Altman and Gaitsgory, 1993; Abbad and Filar, 1992), these results do not generalize to the analysis of overall behaviour of the MDP under all possible cost parameters in a compact set.
Additionally, how uncertainty in MDP cost parameters affect the outcome of value iteration type methods is not well studied. Dynamic programming on bounded MDPs is studied in Givan et al. (2000) for specifically interval sets, however convergence of over general compact sets is not considered. While computation of the fixed points of Bellman operator is the topic of numerous studies (Delage and Mannor, 2010), most focus on the convergence analysis of value iteration and its stopping criteria (Ashok et al., 2017; Eisentraut et al., 2019). However, they do not consider the relationship between bounds on the optimal value function and the uncertainty in cost. Similarly motivated, Haddad and Monmege (2018) analyzes entry-wise uncertain transition kernels by using graph-based MDP transformations. While we also derive bounds of an MDP due to uncertain parameters, we differ in our approach: our set-based framework allows for direct extraction of the value iteration trajectories with respect to the set of cost parameters. This differentiates our work from Haddad and Monmege (2018) due to their graphical abstraction of MDP, which allows for derivation of bounds but not extraction of value function trajectories.
Contributions: We characterize the solutions of a family of MDPs at once, represented as sets of MDPs. More specifically, we: (i) develop a characterization of MDPs with uncertain cost parameters; (ii) propose a set-based Bellman operator over non-empty compact sets; (iii) establish the contractivity of this set-based Bellman operator with the existence of a unique compact fixed point set.
2 Review of MDPs and Bellman Operator
Notation: Sets of elements are given by . We denote the set of matrices of rows and columns with real (non-negative) valued entries as . Elements of sets and matrices are denoted by capital letters, , while sets are denoted by cursive letters,
. The ones column vector is denoted by.
We consider a discounted infinite-horizon MDP defined by for a decision maker, where
denotes the finite set of states.
denotes the finite set of actions. Without loss of generality, assume that every action is admissible from each state .
denotes the transition kernel. Each component is the probability of arriving in state by taking state-action . Matrix is column stochastic and element-wise non-negative — i.e. , , .
denotes the cost of each pair .
denotes the discount factor.
At each time step , the decision maker chooses an action based on its current state . The state-action pair
induces a probability distribution, where is the probability that the decision maker arrives at at time step . The state-action also induces a cost that must be paid by the decision maker.
At each time step, the decision maker chooses a policy that dictates the action chosen at each state . We denote policy as a function , where denotes the probability that action is chosen at state . We also denote in shorthand, a probabilistic distribution vector over the action space at each state . We denote the set of all feasible policies of an MDP by . In our context, it suffices to consider only deterministic, stationary policies i.e. is a time invariant function that returns for exactly one action, and for all other possible actions. The set of all such policies is denoted by .
We denote the policy matrix induced by a policy as , where
For an MDP , we are interested in minimizing the discounted infinite horizon expected cost, defined with respect to a policy as
where is the discount factor of future cost, and are the state and action taken at time step , and is the state that the decision maker starts from at .
The minimum expected cost is called the optimal value function. The policy that achieves the optimal value function is called an optimal policy. In general, is unique while is not. It is well known that the set of optimal policies always includes at least one deterministic stationary policy (Puterman, 2014, Thm 6.2.11) — i.e. for each , returns for exactly one action, and for all other possible actions.
2.2 Bellman Operator
Determining the optimal value function of a given MDP is equivalent to solving for the fixed point of the associated Bellman operator, for which a myriad of techniques exist (Puterman, 2014). We introduce the Bellman operator here as well as relating its fixed point to the corresponding MDP problem.
[Standard Bellman Operator] For a discounted infinite horizon MDP , its associated Bellman operator is given component-wise by
The fixed point of the Bellman operator is a value function that is invariant under the operator. [Fixed Point] Let be an operator on the metric space , is a fixed point of if it satisfies
In order to show that the Bellman operator has a unique fixed point, we consider the following operator property. [Contraction Operator] Let be a complete metric space. An operator is a contracting operator if it satisfies
The Bellman operator is known be a contraction operator on the complete metric space . From the Banach fixed point theorem (Puterman, 2014), it has a unique fixed point. Because the optimal value function is given by the unique fixed point of the associated Bellman operator, we use the terms optimal value function and fixed point of interchangeably.
In addition to obtaining , MDPs are also solved to determine the optimal policy, . We note that because every feasible policy induces a Markov chain, every feasible policy also induces a unique stationary value function which satisfies
where is the policy matrix induced by a feasible policy . Given a feasible policy , we can equivalently solve for the stationary value function as . From this perspective, the optimal value function is the minimum vector among the finite set of stationary value functions generated by the set of all policies .
From the optimal value function , we can also derive a deterministic optimal policy from the Bellman operator as
While the optimal policy does not need to be deterministic and stationary, the optimal policy derived from (4) will always be deterministic.
2.3 Termination Criteria for Value Iteration
Among different algorithms to determine the fixed point of the Bellman operator, value iteration (VI) is a commonly used and simple technique in which the Bellman operator is iteratively applied until the optimal value is reached — i.e. starting from any value function and , we apply
The iteration scheme given by (5) converges to the fixed point of the corresponding discounted infinite horizon MDP. The stopping criteria of VI can be considered the over-approximation of the optimal value function. (Puterman, 2014, Thm. 6.3.1) For any initial value function , let be the value function trajectory from (5). Whenever there exists , such that , then is within of the fixed point , i.e. . Lemma 2.3 connects relative convergence of the sequence to absolute convergence towards by showing that the former implies the latter.
In general, the stopping criteria differ for different MDP objectives (see Haddad and Monmege (2018) for recent results on stopping criteria for reachability).
3 Set-based Bellman Operator
The standard Bellman operator with respect to a fixed cost parameter is well studied. Motivated by a family of MDPs corresponding to a compact set of cost parameters with all other data parameters remaining identical, we lift the Bellman operator to operate on sets rather than individual vectors in . For the set-based operator, we analyze its set-based domain and prove that it is a contraction operator. We also prove the existence of a unique fixed point set for a set-based Bellman operator and relate its properties to the fixed point of the standard Bellman operator.
3.1 Set-based operator properties
We define a new metric space based on the Banach space to serve as our set-based operator domain (Rudin and others, 1964), where is the collection of non-empty compact subsets of equipped with partial order: for , if . The metric is the Haussdorf distance (Henrikson, 1999) defined as
Since is a complete metric space, is a complete metric space with respect to .
(Henrikson, 1999, Thm 3.3) If is a complete metric space, then its induced Hausdorff metric space is a complete metric space. On the metric space , we define a set-based Bellman operator. [Set-based Bellman Operator] For a family of MDP problems, , where is a compact set, its associated set-based Bellman operator is given by
where is the closure operator. As we take the union of an uncountably many bounded sets, the resulting set may not be bounded, and therefore it is not immediately obvious that maps into the metric space . We show this is true in Proposition 3.1. If is compact, for all , . For a non-empty set of some finite dimensional real vector space, let us define its diameter to be denoted as . The diameter of any compact set in a metric space is bounded.
We take any non-empty compact set . As , it suffices to prove that is closed and bounded. The closedness is guaranteed by the closure operator. A subset of a metric space is bounded iff its closure is bounded. Hence, to prove the boundedness, it suffices to prove that . Consider any two cost-value function pairs, , they must satisfy
where the norm of the second term must be upper bounded by due to contraction properties of . Let . For the first term, we take to be the optimal strategy corresponding to , then
Therefore, we have that . Since it holds for all then as and are bounded. ∎ Proposition 3.1 shows that is an operator from to . Having established the space which it operates on, we can draw many parallels between and . Similar to the existence of a unique fixed point for operator , we consider whether a fixed point set of which satisfies exists, and if it is unique. To take the comparison further, since is the optimal value function for an MDP problem defined by , how does relate to the family of optimal solutions that corresponds to the MDP family ?
To prove the unique existence of , we utilize the Banach fixed point theorem (Puterman, 2014), which states that a unique fixed point must exist for all contraction operators on complete metric spaces. First, we show that is a contraction as defined in Definition 2.2 on the complete metric space .
For any and closed and bounded, is a contracting operator under the Hausdorff distance.
Consider , , to see that is a contraction, we need to show
First we note that taking () of a continuous function over a set is equivalent to taking the () over the closure of . Let and , then due to continuity of norms (Rudin and others, 1964, Thm 4.16),
Therefore, it suffices to prove
For any , ,
where corresponds to the optimal policy for the MDP and corresponds to the optimal policy for the MDP in (9b). In (9c) we replaced by by noting that and is optimal, therefore must result in a larger value function (similar to the proof of Prop. 3.1). In (9e) we note that the infimum of expression over set must be upper bounded by when . In (9f), we used the fact that .
Taking the over and ,
Therefore . Since , is a contracting operator on .∎
The contraction property of implies that repeated application of the operator to any will result in closer and closer sets in the Hausdorff sense of distance to a fixed point set. It is then natural to consider if there is a unique set which all converges to.
There exists a unique fixed point to the set-based Bellman operator as defined in Definition 2.2, such that , and is a closed and bounded subset of . Furthermore, for any iteration starting from arbitrary , , the sequence converges in the Hausdorff sense i.e. . As shown in Proposition 3.1, is a contracting operator. From the Banach fixed point theorem (Puterman, 2014, Thm 6.2.3), there exists a unique fixed point , and any arbitrary will generate a sequence that converges to the fixed point. ∎
The fixed point of Bellman operator on metric space corresponds to the optimal value function of the MDP associated with cost parameter . Because there is no direct association of an MDP problem to the set-based Bellman operator , we cannot claim the same for . However, does have many interesting properties on , in parallel to operator on , especially in terms of the value iteration method (5). Suppose that instead of a fixed cost parameter, we have that at each iteration , a that is random chosen from a compact set of costs, , then it is interesting to ask if contains all the limit points of . Indeed, we can infer from Theorem 3.1 that the sequence converges to under the Hausdorff metric. Furthermore, even when itself does not converge, it must converge to the set under the Hausdorff metric— i.e. .
We summarize our results on set-based Bellman operator: for a compact cost function set , converges to to a unique compact set which contains all the fixed points of for all fixed . Furthermore, also contains the limit points of for any , , given that converges. Even if the limit does not exist, must asymptotically converge to in the Hausdorff sense. Future work includes extending the uncertainty analysis to consider uncertainty in the transition kernel to fully capture learning in a general stochastic game.
- Perturbation and stability theory for markov control problems. IEEE Trans. Autom. Control. Cited by: §1.
- A markov chain approach to probabilistic swarm guidance. In American Control Conference, pp. 6300–6307. Cited by: §1.
- Stability and singular perturbations in constrained markov decision problems. IEEE Trans. Autom. Control 38 (6), pp. 971–975. Cited by: §1.
- Value iteration for long-run average reward in markov decision processes. In International Conference on Computer Aided Verification, pp. 201–221. Cited by: §1.
- Singularly perturbed markov control problem: limiting average cost. Annals of Op. Res. 28 (1), pp. 153–168. Cited by: §1.
- Percentile optimization for markov decision processes with parameter uncertainty. Op. Res. 58 (1), pp. 203–213. Cited by: §1.
- Decentralized probabilistic density control of autonomous swarms with safety. Autonomous Robots 39 (4), pp. 537 –554. Cited by: §1.
- Stopping criteria for value and strategy iteration on concurrent stochastic reachability games. arXiv preprint arXiv:1909.08348. Cited by: §1.
- Controlled markov processes with safety state constraints. IEEE Trans. Autom. Control 64 (3), pp. 1003–1018. Cited by: §2.1.
- Competitive markov decision processes. Springer Science & Business Media. Cited by: §1.
- Bounded-parameter markov decision processes. Artificial Intelligence 122 (1-2), pp. 71–109. Cited by: §1.
- Interval iteration algorithm for mdps and imdps. Theoretical Computer Science 735, pp. 111–131. Cited by: §1, §2.3.
- Completeness and total boundedness of the hausdorff metric. In MIT Undergraduate Journal of Mathematics, Cited by: §3.1, §3.1.
- Tolling for constraint satisfaction in markov decision process congestion games. In American Control Conference, pp. 1238–1243. Cited by: §1.
- Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §1, §2.1, §2.2, §2.2, §2.3, §3.1, §3.1.
- Principles of mathematical analysis. Vol. 3, McGraw-hill New York. Cited by: §3.1, §3.1.