Markov decision process (MDP) is a fundamental framework for control design in stochastic environments, reinforcement learning, and stochastic games[2, 3, 10, 14, 21]
. Given cost and transition probabilities, solving an MDP is equivalent to minimizing an objective in expectation, and requires determining the optimal value function as well as deriving the corresponding optimal policy for each state. Relying on the fact that the optimal value function is thefixed point of the Bellman operator, dynamic programming methods iteratively apply variants of the Bellman operator to converge to the optimal value function and the optimal policy .
We are motivated to study MDPs where the parameters that define the environment are sets rather than single valued. Such set-based perspective arises naturally in the analysis of parameter uncertain MDPs and stochastic games. In this paper, we develop a framework for evaluating MDPs on compact sets. Specifically, we show that when the cost parameter of the MDP is in a compact set rather than single-valued, we can define a Bellman operator on the space of compact sets, such that it is contractive with respect to the Hausdorff distance. We prove the existence of a unique and compact fixed point set that the operator must converge to, and give interpretations of the fixed point set in the context of parameter uncertain MDPs and stochastic games.
When modeling a system as a stochastic process, sampling techniques are often used to determine cost and transition probability parameters. In such scenarios, the MDP can be either interpreted as a standard MDP with error bounds on its parameters, or as a set-based MDP in which its parameters are sets rather than single-valued. In the former approach, an MDP can be solved with standard dynamic programming methods, and the stability of its solution with respect to the parameter perturbation can be analyzed locally [1, 4, 5]. However, these sensitivity results are only local approximations in the context of compact parameter sets. The latter approach is not well explored — some research exists on bounded interval set MDPs , in which dynamic programming techniques such as value and policy iteration have been shown to converge. However, uncertain cost parameters may not always result in interval sets; another example of a bounded cost parameter set is a compact polytope. In this paper, we show that in general, given an MDP with a compact set cost parameter, it must have an associated Bellman operator whose a unique compact fixed point set contains the optimal value function of MDPs whose single-valued cost parameter belong to the given cost parameter set.
As opposed to parameter uncertain MDPs where the underlying cost and probability parameters are constant albeit uncertain, stochastic games result in MDPs where the cost and probability parameters vary with opponents’ changing policies. An individual player can interpret a stochastic game as an MDP with a parameter-varying environment, and as a result, it does not commit to solving a fixed parameter MDP unless a Nash equilibrium is achieved. A Nash equilibrium defines an optimal joint policy for all players, at which no player has any incentive to change its policy. Since every player is performing optimally, each player’s MDP parameters remain constant. In learning theory for stochastic games, players iteratively update their individual policies to converge to a Nash equilibrium. Many of the learning algorithms are based on variants of the Bellman operator with costs and probabilities changing at each iteration [6, 24]. In this paper, we do not focus on demonstrating convergence towards a Nash equilibrium. Instead, we specialize the set-based MDP framework to a single controller stochastic game, and show that the set of Nash equilibria must be contained in the fixed point set of a set-based Bellman operator.
In , we initiated our analysis of set-based MDPs by proving the existence of a unique fixed point set to the set-based Bellman operator. In this paper, we demonstrate the significance of this fixed point set by relating it to the fixed points of parameter uncertain MDPs and the Nash equilibria set of stochastic games. We further explore the fixed point set in the context of iterative solutions to stochastic games, and show that the fixed point set of the set-based Bellman operator bounds the asymptotic behaviour of dynamic programming-based learning algorithms.
The paper is structured as follows: we provide references to existing research in Section 2; we recall definition of an MDP and the Bellman operator in Section 3; Section 4 extends these definitions to set-based MDPs, providing theoretical results for the existence of a fixed point set of a set-based Bellman operator. Section 5 relates properties of the fixed point set to stochastic games. An interval set-based MDP is presented in Section 6 with a computation of exact bounds, while the application to stochastic games is illustrated in Section 7, where we model unknown policies of the opponent as cost intervals.
2 Related Research
, where the value functions are bounded with a given probability when cost and transition probability parameters are Gaussian distributed. In contrast to our MDP model, the cost parameters under the Gaussian distribution assumption do not come from a compact set. Bounding MDPs with reachability objectives and uncertain transition probabilities is studied in. However, the techniques utilized in  require abstraction of the MDP state space and therefore do not extend to value functions which are defined per state. Our work is perhaps closest to that of , in which value iteration and policy iteration are shown to converge for an MDP whose cost and probability parameters are interval sets rather than the more general compact sets that we study. Finally, we note that a generalization of  is given in , in which an algebraic abstraction to interval sets is used to analyze uncertain MDP parameters.
Introduced in , stochastic game generalizes the notion of an optimal policy in MDP to a Nash equilibrium. Learning algorithms for computing the Nash equilibria can be categorized by the assumption of perfect vs imperfect information . In this paper, our Nash equilibria analysis always assumes perfect information. The computation complexity of player general sum stochastic games is shown to be NP hard in , while value iteration for such games is shown to diverge in . However, some Bellman operator-based algorithms will converge when constrained to two player stochastic games or zero sum stochastic games [11, 26, 30, 31].
3 MDP and Bellman Operator
We introduce our notation for existing results in MDPs, which are used throughout the paper. Contents from this section are discussed in further detail in .
Notation: Sets of elements are given by . We denote the set of matrices of rows and columns with real or non-negative valued entries as or , respectively. Matrices and some integers are denoted by capital letters, , while sets are denoted by cursive letters, . The set of all compact subsets of is denoted by
. The column vector of ones is denoted by
. The identity matrix of sizeis denoted by .
We consider a discounted infinite-horizon MDP defined by , where
denotes the finite set of states.
denotes the finite set of actions. Without loss of generality, assume that every action is admissible from each state .
defines the transition kernel. Each component is the probability of arriving in state by taking state-action . Matrix is column stochastic and element-wise non-negative — i.e.,
defines the cost matrix. Each component is the cost of state-action pair .
denotes the discount factor.
At each time step , the decision maker chooses an action at its current state . The state-action pair
induces a probability distribution vector over statesas . The state-action also induces a cost for the decision maker.
The decision maker chooses actions via a policy. We denote policy as a function , where denotes the probability that action is chosen at state . We also denote in shorthand, a probability vector over the action space at each state . The set of all policies of an MDP is denoted by . Within , we consider a subset of deterministic policies , where a policy is deterministic if at each state , returns for exactly one action, and for all other possible actions. A policy that is not deterministic is a mixed policy.
We denote the policy matrix induced by the policy as , where
For an MDP , we are interested in minimizing the discounted infinite horizon expected cost, defined with respect to a policy as
where is the discounted infinite horizon expected value of objective , and are the state and action taken at time step , and is the initial state of the decision maker at .
is the optimal value function for the initial state . The policy that achieves this optimal value is called an optimal policy. In general, the optimal value function is unique while the optimal policy is not. The set of optimal policies always includes at least one deterministic stationary policy if there are no additional constraints [27, Thm 6.2.11]. If there are additional constraints on the policy and state space, deterministic policies may become infeasible .
3.1 Bellman Operator
Determining the optimal value function of a given MDP is equivalent to solving for the fixed point of the associated Bellman operator, for which a myriad of techniques exists . We introduce the Bellman operator and its fixed point here for the corresponding MDP problem.
[Bellman Operator] For a discounted infinite horizon MDP , its Bellman operator is given component-wise as
The fixed point of the Bellman operator is a value function that is invariant with respect to the operator. [Fixed Point] is a fixed point of an operator iff
In order to show that the Bellman operator has a unique fixed point, we consider the following operator properties. [Order Preservation] Let be a partially ordered space with partial order . An operator is an order preserving operator iff
[Contraction] Let be a complete metric space with metric . An operator is a contracting operator iff
The Bellman operator is known to have both properties on the complete metric space . Therefore, Banach fixed point theorem can be used to show that has a unique fixed point . Because the optimal value function is given by the unique fixed point of the associated Bellman operator , we use the terms optimal value function and fixed point of interchangeably.
In addition to obtaining , MDPs are also solved to determine the optimal policy, . Every policy induces a unique stationary value function given by
where denotes the Kronecker product and . Given a policy , we can equivalently solve for the stationary value function as . From this perspective, the optimal value function is the minimum vector among the set of stationary value functions corresponding to the set of policies . Policy iteration algorithms utilize this fact to obtain the optimal value function by iterating over the feasible policy space .
Given the optimal value function , we can also derive a deterministic optimal policy from the Bellman operator as
where returns the first optimal action if multiple actions minimize the expression at state . While the optimal policy does not need to be unique, deterministic or stationary, the optimal policy derived from (6) will always be unique, deterministic and stationary.
3.2 Termination Criteria for Value Iteration
Among different algorithms that solve for the fixed point of the Bellman operator, value iteration (VI) is a commonly used and simple technique in which the Bellman operator is iteratively applied until the optimal value function is reached — i.e. starting from any value function , we apply
The iteration scheme given by (7) converges to the optimal value function of the corresponding discounted infinite horizon MDP. The following result presents a stopping criteria for (7). [27, Thm. 6.3.1] For any initial value function , let satisfy the value iteration given by (7). For , if
then is within of the fixed point , i.e.
Lemma 3.2 connects relative convergence of the sequence to absolute convergence towards by showing that the former implies the latter. In general, the stopping criteria differ for different MDP objectives (see  for recent results on stopping criteria for MDPs with a reachability objective).
4 Set-based Bellman Operator
The classic Bellman operator with respect to a cost is well studied. Motivated by parameter uncertain MDPs and stochastic games, we extend the classic Bellman operator by lifting it to operate on sets rather than individual value functions in . For the set-based operator, we analyze its set-based domain and prove relevant operator properties such as order preservation and contraction. Finally, we show the existence of a unique fixed point set and relate its properties to the fixed point of the classic Bellman operator.
4.1 Set-based operator properties
For the domain of our set-based operator, we define a new metric space based on the Banach space , where denotes the collection of non-empty compact subsets of . We equip with partial order, , where for , iff . The metric is the Haussdorf distance  defined as
[18, Thm 3.3] If is a complete metric space, then its induced Hausdorff metric space is a complete metric space. From Lemma 4.1, since is a complete metric space, is a complete metric space with respect to . On the complete metric space , we define a set-based Bellman operator which acts on compact sets. [Set-based Bellman Operator] For a family of MDP problems, , where is a non-empty compact set, its associated set-based Bellman operator is given by
where is the closure operator. Since is the union of uncountably many bounded sets, the resulting set may not be bounded, and therefore it is not immediately obvious that maps into the metric space . If is non-empty and compact, then , . For a non-empty compact subset of a finite dimensional real vector space, we define its diameter as . The diameter of a set in a metric space is finite if and only if it is bounded .
Take any non-empty compact set . As , it suffices to prove that is closed and bounded. The closedness is guaranteed by the closure operator. A subset of a metric space is bounded iff its closure is bounded. Hence, to prove the boundedness, it suffices to prove that . For any two cost-value function pairs ,
We bound (10) by bounding the two terms on the right hand side separately. The second term satisfies
due to contraction properties of . To bound the first term, we note that for any two vectors ,
where the operator denote maximum element, and denote maximum component of vector . Evaluating with (11),
where is an optimal policy corresponding to .
Since and for any , . Finally it follows from (10) that
Since (12) holds for all then as both and are bounded. Proposition 4.1 shows that is an operator from to . Having established the space which operates on, we can draw many parallels between and . Similar to having a fixed point in the real vector space, we consider whether a unique fixed point set which satisfies exists. To take the comparison further, since is optimal for an MDP problem defined by , we consider if correlates to the family of optimal value functions that correspond to the MDP family . We explore these parallels in this paper, prove the existence of a unique fixed point for the set-based Bellman operator , and derive sufficient conditions for its existence.
We prove the existence and uniqueness of by utilizing the Banach fixed point theorem , which states that a unique fixed point must exist for all contraction operators on complete metric spaces. First, we show that has properties given in Definitions 3.1 and 3.1 on the complete metric space .
For any and closed and bounded, is an order preserving and a contracting operator in the Hausdorff distance.
Consider , which satisfy , then
We conclude that is order-preserving. To see that is contracting, we need to show
First we note that taking () of a continuous function over the closure of a set is equivalent to taking the () over itself. Furthermore, iff . Therefore taking the of over is equivalent to taking the of over .
Given and for arbitrary , ,
The contraction property of implies that any repeated application of the operator to a will result in closer consecutive sets in the Hausdorff distance. It is then natural to consider if there is a unique set which all converges to.
There exists a unique fixed point of the set-based Bellman operator as defined in Definition 3.1, such that , and is a closed and bounded set in .
Furthermore, for any set , the iteration
converges in the Haussdorf distance — i.e.,
As shown in Proposition 4.1, is a contracting operator. From the Banach fixed point theorem [27, Thm 6.2.3], there exists a unique fixed point , and any arbitrary will generate a sequence of sets that converges to .
4.2 Properties of fixed point set
In the case of the Bellman operator on metric space , the fixed point corresponds to the optimal value function of the MDP associated with cost . Because there is no direct association of an MDP problem to the set of cost parameters , we cannot claim the same for the set-based Bellman operator and . However, does have many interesting properties on , especially in terms of set-based value iteration (16).
We consider the following generalization of value iteration: suppose that instead of a fixed cost parameter, we have that at each iteration , a that is randomly chosen from the compact set of cost parameters . In general, may not exist. However, we can infer from Theorem 4.1 that the sequence converges to the set in the Hausdorff distance.
Let be a sequence of costs in , where is a compact set within . Let us define the iteration
for any . Then the sequence satisfies
At each iteration , . From Theorem 4.1, converges to in Hausdorff distance, . Therefore for every , there exists such that for all , . Since , must also be true for all . Therefore . Proposition 4.2 implies that regardless of whether or not the sequence converges, the sequence must asymptotically approach . This has important interpretations in the game setting that is further explored in Section 5. On the other hand, Proposition 4.2 also implies that if does converge, its limit point must be an element of . We define the set of fixed points of for each as
i.e., is the set of optimal value functions for the set of MDPs where . Furthermore, we consider all sequences such that for , the iteration approaches a limit point , and define the set of all such limit points as
then . For any and ,
is satisfied for all . Furthermore, by assumption, each has an associated iteration whose limit point is equal to , i.e. . Additionally,
follows from Proposition 4.2. Therefore,
From the fact that the infimum over a compact set is always achieved for an element of the set , . Therefore . To see that , take for all , then . We make the distinction between , , and to emphasize that is not simply the set of fixed points corresponding to for all possible , given by , or the limit points of for all possible sequences , given by . The fixed point set contains all possible limiting trajectories of without assuming a limit point exists. In Corollary 4.2, can be easily understood as the set of optimal value functions for the set of standard MDPs generated by . An interpretation for is perhaps less obvious. We show in Section 7 that corresponds to the set of limit points that result from the best response dynamics of a two player single controller stochastic game, as defined in the Section 5.
We summarize our results on set-based Bellman operator as the following: given a compact set of cost parameters , converges to to a unique compact set . The set contains all the fixed points of for . Furthermore, also contains the limit points of for any , , given that converges. Even if the limit does not exist, must asymptotically converge to in the Hausdorff distance.
5 Stochastic Games
In this section, we further elaborate on the properties of the fixed point set in the context of stochastic games, and show that with an appropriate over-approximation of the Nash equilibria cost parameters, contains the optimal value functions for player one at Nash equilibria.
A stochastic game extends a standard MDP to a multi-agent competitive setting . We specifically focus on games with two players. As opposed to standard MDPs, the cost and and transition probabilities and for each player depends on the joint policy, , where and are respectively player one and player two’s policies as defined for standard MDPs in Section 3. The set of joint policies is given by , and the set of policies for player one and player two is given by and , respectively. We denote the actions of player one by and the actions of player two by
. The transition kernel of the game is determined by the tensor, where satisfies
Each player’s cost also depends on the joint policy, where and denote player one and player two’s cost when the joint action is taken from state , respectively. With the same notation of Section 3, we denote the transition kernel for player one when player two applies policy as , where
Similarly, let the cost of player one be denoted as
when player two takes on policy . Player two’s cost and transition kernel can be similarly defined. Each player then solves a discounted MDP given by . Since each player only controls part of the joint action space, the generalization to joint action space introduces non-stationarity in the transition and cost, when viewed from the perspective of an individual player solving an MDP.
We assume that both players have perfect information — i.e. both players can fully observe the joint policy as well as predict future states’ probability distributions.
Given a joint policy , each player attempts to minimize its value function. Each player’s optimal discounted infinite horizon expected cost is given by
As formulated by (20), we denote the value function of player one by and the value function of player two by . Given a joint policy , both players have unique stationary value functions given by
Since a stochastic game can be viewed as coupled MDPs, the MDP notion of optimality must be expanded to reflect dependency of a player’s individual optimal policy on the joint policy space. We define a Nash equilibrium in terms of each player’s value function [14, Sec.3.1]. [Two Player Nash Equilibrium] A joint policy is a Nash equilibrium if the corresponding value functions as given by (21) satisfy
We denote the Nash equilibrium value functions as , and the set of Nash equilibria for a stochastic game as . Definition 5 implies that a Nash equilibrium is achieved when the joint policy simultaneously generates value functions and which are the fixed points of the Bellman operator with respect to parameters and , respectively — i.e. , and .
Nash equilibria are not unique in the general sum case. Furthermore, , i.e. Nash equilibria policies are not composed of deterministic individual policies. Therefore while each player’s Nash equilibrium value function is always the fixed point of the associated Bellman operator, the Nash equilibrium policy for each player is not the optimal deterministic policy associated to the Nash equilibrium value function in general. The existence of at least one Nash equilibrium for any general sum stochastic game is given in . When the stochastic game is also zero sum, the Nash equilibrium is also unique.
In this paper, we focus on non-stationarity in the cost term only and leave non-stationary in the probability kernel term to future analysis. Specifically, we constrain our analysis to a single controller game [14, 13], i.e. when the probability transition kernel is controlled by player one only. [Single controller game] A single controller game is a two player stochastic game where the probability transition kernel is independent of player two’s actions, i.e., for each