I Introduction
A variety of distributed planning and decisionmaking problems, including multiplayer games, search and rescue, and infrastructure monitoring, can be modeled as Multiagent Markov Decision Processes (MMDPs). In such processes, the state transitions and rewards are determined by the joint actions of all of the agents. While there is a substantial body of work on computing such optimal joint policies [27, 20, 17], a key challenge is that the the total number of states and actions grows exponentially in the number of agents. This increases the complexity of computing an optimal policy, as well as storing and implementing the policy on the agents.
One approach to mitigating this complexity is to identify additional problem structures. One such structure is transitionindependence (TI) [5]. In a TIMDP, the state transitions of an agent are independent of the states and actions of the other agents. Such MDPs may arise, for example, in multirobot scenarios where the motion of each robot is independent of the others. TIMDPs can be approximately solved by factoring into multiple MDPs, one for each agent, and then obtaining a local policy, in which each agent’s next action depends only on that agent’s current state. When the TI property holds and the MDP possesses additional structure, such as submodularity, this approach may yield scalable algorithms for computing nearoptimal policies [16].
The TI property, however, does not hold for general MMDPs when there is coupling between the agents. Coupling occurs when two agents must cooperate to reach a particular state, or when the actions of agents may interfere with each other. In this case, the existing results providing nearoptimality do not hold, and at present there are no scalable algorithms for local policy selection in nonTIMMDPs.
In this paper, we investigate the problem of computing approximately optimal local policies for nonTI MMDPs. We propose transition dependence, which captures the deviation of the MMDP from transition independence. We make the following contributions:

We define the transition dependence property, in which the parameter
characterizes the maximum change in the probability distribution of any agent due to changes in the states and actions of the other agents.

We propose a local search algorithm for computing local policies of the agents, in which each agent computes an optimal policy assuming that the remaining agents follow given, fixed policies.

We prove that, when the reward functions are monotone and submodular in the agent actions, the proposed algorithm achieves a provable optimality bound as a function of the dependence parameter and the ergodicity of the MMDP.

We evaluate our approach on two numerical case studies, namely, a patrolling example and a multirobot target tracking scenario. On average, our approach achieves optimality in the multirobot scenario and optimality in the multiagent patrolling example, while requiring 1020% of the runtime of an exact optimal algorithm.
The paper is organized as follows. Section II presents the related work. Section III contains preliminary results. Section IV presents our problem formulation and algorithm. Section V contains optimality and complexity analyses. Section VI presents simulation results. Section VII concludes the paper.
Ii Related Work
MDPs have been extensively studied as a framework for multiagent planning and decisionmaking [19, 22]. Most existing works focus on selecting an optimal joint strategy for the agents, which maps each global system state to an action for each agent [27, 20, 17]. These methods can be shown to converge to a locally optimal policy, in which no agent can improve the overall reward by unilaterally changing its policy. These joint decisionmaking problems can be viewed as special cases of multiagent games in which all agents have a shared reward [28]
. These approaches, however, suffer from a “curse of dimensionality,” in which the state space grows exponentially in the number of agents, and hence do not scale well to large numbers of agents.
Transitionindependent MDPs (TIMDPs) provide problem structure that can be exploited to speed up the computation [5]. In a TIMDP, each agent’s transitions probabilities are independent of the actions and states of the other agents, allowing the MDP to be factored and approximately solved [14, 6, 8, 15]. Extensions of the TIMDP approach to POMDPs were presented in [2, 3]. A greedy algorithm for TIMDPs with submodular rewards was proposed in [16]. The goal of the present paper is to extend these works to nonTI MDPs by relaxing transition independence, enabling optimality bounds for a broader class of MDPs. A local policy algorithm was proposed in [23] that leverages a fastdecaying property that is distinct from the approximate transition independence that we consider.
Our optimality bounds rely on submodularity of the reward functions. Submodularity is a diminishingreturns property of discrete functions that has been studied in a variety of contexts, including offline [11], online [9], adaptive [13], and streaming [4] submodularity. Submodular properties were leveraged to improve the optimality bounds of multiagent planning [16], sensor scheduling [24], and solving POMDPs [1]. Submodularity for transitiondependent MDPs, however, has not been explored.
Iii Background and Preliminaries
This section gives background on perturbations of Markov chains, as well as definition and relevant properties of submodularity.
Iiia Perturbations of Markov Chains
A finitestate, discretetime Markov chain is a stochastic process defined over a finite set , in which the next state is chosen according to a probability distribution , where is the current state. A Markov chain over is defined by its transition matrix , in which represents the probability of a transition from state to state . The following theorem describes the steadystate behavior of a class of Markov chains.
Theorem 1 (Ergodic Theorem [25])
Consider a Markov chain with transition matrix . Suppose there exists such that for all and . Then there is a probability distribution over such that, for any distribution over the initial state,
where is the number of times the Markov chain reaches state in the first time steps. Moreover,
is the unique left eigenvector of
with eigenvalue
.A Markov chain satisfying the conditions of Theorem 1 is ergodic. The probability distribution defined in Theorem 1 is the stationary distribution of the chain. Intuitively, a Markov chain is ergodic if the relative frequency of reaching each state is independent of the initial state. The ergodicity coefficient of a matrix is defined by
We next state preliminary results on perturbations of ergodic Markov chains. First, we define the total variation distance between two probability distributions as follows. For two probability distributions and over a finite space , the total variation distance is defined by
The total variation distance satisfies [18]
Let and denote the transition matrices of two ergodic Markov chains on the same state space with stationary distributions and , and define . The norm of the matrix is defined by
where is the th entry of . The group inverse of , denoted , is the unique square matrix satisfying
Let , where denotes the identity. It is known [21] that , where
denotes the vector with all
’s.The following result gives a bound on the distance between and as a function of the perturbation .
Lemma 1 ([26])
The total variation distance between the stationary distributions and of Markov chains with transition matrices and , respectively, satisfies
where is the group inverse of .
IiiB Background on Submodularity and Matroids
A function is submodular [12] if, for any sets and any element , we have
The function is monotone if for . A matroid is defined as follows.
Definition 1
Let denote a finite set and let be a collection of subsets of . Then is a matroid if (i) , (ii) and implies that , and (iii) for any with , there exists such that .
The rank of a matroid is equal to the cardinality of the maximal independent set in . A matroid basis is a maximal independent set in , i.e., a set such that and for all . A partition matroid is defined by a partition of the set into , where for . A set is independent in the partition matroid if, for all , .
The following result leads to optimality bounds on local search algorithms for submodular maximization.
Lemma 2 ([11])
Suppose that is a basis of matroid , is a monotone submodular function, and there exists such that, for any and with ,
Then we have for any , where is the rank of .
Iv Problem Formulation and Proposed Algorithm
In this section, we first present our problem formulation, followed by the proposed algorithm.
Iva System Model and Problem Formulation
We consider a Markov Decision Process (MDP) ^{1}^{1}1In this paper, we use MDP and MMDP interchangeably. defined by a tuple , where and denote the state and action spaces, respectively. The transition probability function denotes the probability of transitioning from state to state after taking action . The reward function defines the reward from taking action in state . The goal is to maximize the average reward per stage, denoted by , where and denote the state and action at time .
The state and action spaces of can be decomposed between a set of agents and an underlying environment. We write to denote the state space of the environment, the state space of agent , and to denote the action space of agent . We then have and . Throughout the paper, we use to denote a state in and to denote a tuple of state values for the agents excluding agent . Similarly, we denote an action in as and let denote a tuple of actions for the agents excluding agent .
We assume that the reward function is a monotone and submodular function of the agent actions for any fixed state value. Define and . We observe that the size of the state space may grow exponentially in the number of agents, increasing the complexity of computing the transition probabilities and optimal policy. A problem structure that is known to simplify these computations is transition independence, defined as follows.
Definition 2 ([5])
An MDP is transition independent (TI) if there exist transition functions and , , such that
Transition independence implies that the state transitions of each agent depend only on that agent’s states and actions, thus enabling factorization of the MDP and reducing the complexity of simulating and solving the MDP. We observe, however, that the TI property does not hold for general MDPs, and introduce the following relaxation.
Definition 3
Let
An MDP is transition dependent (or dependent) if
(1) 
Intuitively, the dependent property implies that the impact of the other agents on agent ’s transition probabilities is bounded by . When , our definition of dependent MDP reduces to TIMDP defined in [5].
The agents choose their actions at each time step by following a policy, which maps the current and previous state values to the action at time . We focus on stationary policies of the form , which only incorporate the current state value when choosing the next action. We assume that, for any stationary policy, the resulting induced Markov chain is ergodic. We let denote the maximum value of the ergodic number over all stationary policies. Furthermore, to reduce the complexity of storing the policy at the agents, each agent follows a local policy . Hence, each agent’s actions only depend on the environment and the agent’s internal state. Any policy with this structure can be expressed as , where denotes the policy of agent . We let denote the set of policies of the agents excluding .
The problem is formulated as follows. Define the value function for policies by
When it is not clear from the context, we let denote the average reward from policy on MDP . The goal is then to select that maximizes . As a preliminary, we say that a policy is locally optimal if, for all , for any agent policy . We say that is locally optimal if for all and all policies for agent .
IvB Proposed Algorithm
To motivate our approach, we first map the problem to a combinatorial optimization problem as in
[16]. Consider the finite set of agent policies, which we write as , where denotes the set of possible local policies for agent . The collection is formally defined as the set of functions of the form . The problem of selecting an optimal collection of local policies can therefore be mapped to the combinatorial optimization problem(2) 
In (2), the policy is interpreted as a set, in which each element represents the policy of a single agent. Since there is exactly one policy per agent, the constraint is a partition matroid constraint. The following proposition provides additional structure for a special case of (2).
Proposition 1 ([16])
If the MDP is transitionindependent and the rewards are monotone and submodular in for any fixed state , then the function is monotone and submodular in .
Proposition 1
implies that, when the MDP is TI and reward function is submodular, efficient heuristic algorithms will lead to provable optimality guarantees. One such algorithm is local search, which attempts to improve the current set of policies
by searching for policies satisfying . If no such policy can be found, then the policy is a local optimum of (2), and hence Lemma 2 can be used to obtain a optimality guarantee.The difficulty in the above approach arises from the fact that the number of possible policies for each agent grows exponentially in the number of states . Hence, instead of brute force search, the approach of [16] leverages the fact that, in a TIMDP in which all other agents adopt stationary policies, the optimal policy for agent can be obtained as the solution to an MDP. This MDP has reward function and transition matrix, respectively, given by
and , where denotes the stationary distribution of the joint states under the chosen policies. Using this property, an optimal policy for agent , conditioned on the policies of the other agents, can be obtained by solving this equivalent MDP.
We now present our proposed approach, which generalizes this idea from TI to nonTI MDPs. Our algorithm is initialized as follows. Choose a parameter . First, for each agent , choose a probability distribution over the states in and a policy . Next, define a local transition function for each agent as
(3) 
where the expectation is over from distribution . We then choose policies arbitrarily, and set as the stationary distribution on the state induced by the policy under transition function .
At the th iteration of the algorithm, each agent updates its policy while the other agent policies are held constant. The optimal policy of agent is approximated by the solution to a local MDP denoted , where
(4) 
A policy is then obtained as the optimal policy for . If , then set equal to , compute as the stationary distribution of under policy , and increment . The algorithm terminates when no agent modifies its policy in an iteration .
Pseudocode for this algorithm is given in Algorithm 1.
V Optimality Analysis
We analyze the optimality in three stages. First, we define a TIMDP, and prove that the policies returned by our algorithm are within a provable bound of a local optimum of the TIMDP. We then use submodularity of the reward function to prove that the local optimal policies provide a constantfactor approximation to the global optimum on the TIMDP. Finally, we prove that the approximate global optimum on the TIMDP is also an approximate global optimum for the original MDP.
We define to be the policy returned by our algorithm. Let be the joint stationary distribution of the agents in the MDP arising from these policies. We construct a TIMDP . The transition function is defined by
The reward function . We observe that, by construction, if is the local MDP obtained at the last iteration of Algorithm 1, then for all .
Lemma 3
The policy returned by Algorithm 1 is a local optimum for MDP .
Proof: By construction, the algorithm terminates if, for all , there is no policy such that
implying that is a local optimum of .
Based on the local optimality, we can derive the following optimality bound for .
Lemma 4
Let denote the optimal local policies for MDP . Then
Lemma 4 provides an optimality bound with respect to the TIMDP . Next, we leverage the dependent property to derive an optimality bound with respect to the given MDP . We start with the following preliminary results.
Lemma 5
For any state and policy , , where and are the transition matrices corresponding to and , respectively.
A proof can be found in the technical appendix. We next exploit the bound in Lemma 5 to approximate the gap between the and .
Lemma 6
Suppose that is dependent MDP. For any policy , .
Proof: Let and denote the stationary distributions induced by policy on MDPs and . We have
By Lemma 1,
giving the desired result.
Combining these derivations yields the following.
Theorem 2
Let be an dependent MDP and and denote the output of Algorithm 1 and the optimal policies, respectively. Then
Vi Simulation
In this section, we present our simulation results. We consider two scenarios, namely, multirobot control and a multiagent patrolling example. Both simulations are implemented using Python 3.8.5 on a workstation with Intel(R) Xeon(R) W2145 CPU @ 3.70GHz processor and GB memory. Given the transition and reward matrices, an MDP is solved using Python MDP Toolbox [10].
Via Multirobot Control
ViA1 Simulation Settings
We consider a set of robots whose goal is to cover maximum number of targets from a set of fixed targets positioned in a grid environment. Robots initially start from a fixed set of grid locations. At each time , each robot can move one grid position horizontally or vertically from the current grid position by taking some action . For each robot , let denotes the set of grid positions that can be reached from the current grid position under each action . is the grid position corresponds to . Note that , if is not a valid action (e.g., at the bottom left corner of the grid are not valid actions). Let be the transition probability function associated with robot and be the probability of robot transitions from a grid position to under action . Let be the number of robots at after taking actions . Then,
where . Uncertainty in the environment is modeled by the parameter and the transition dependencies between the robots are modeled by the parameter .
We model the multirobot control problem as an MDP . The state space , where for all . The action space . The transition probability matrix is denoted as . The probability of transitioning from a state to some target state by taking action is given as . The submodular reward of is given by , where is the number of robots visiting target following a joint policy at a state state . The parameter captures the effectiveness of having agents at target . Similar submodular reward has been used in [16].
ViA2 Simulation Results
We use Algorithm 1 to find a set of policies for the robots that maximizes their average reward. Parameters , , , and are set to , , and , respectively. The transition probability for each agent is calculated by evaluating (3) over samples of actions and states . We test Algorithm 1 under different sizes of grids , number of agents , number of targets , and initial locations of the agents. For each setting, we execute Algorithm 1 for trials, and take the average over the trials as the performance of Algorithm 1. We compare Algorithm 1 with the a global MDP approach, which calculates the optimal values using relative value iteration algorithm [7] provided by Python MDP Toolbox [10] on MDP . Note that the state space of MDP is exponential in grid size and number of agents . The action space is exponential in . The total size of all local MDPs for all agents constructed using (3) and (4) grows linearly with respect to the number of agents. As the number of agents and/or grid size of examples increase, the global MDP approach incurs a heavy memory computation overhead to the system. For an example, when robots trying to reach target in a grid, it requires around Gb of memory to compute the solution using global MDP approach, while our proposed approach only requires Mb memory to calculate the policy. Therefore, the global MDP approach is not computationally efficient for larger example sizes.
Table I shows the simulation results obtained using Algorithm 1 and the global MDP approach. We observe that our proposed approach provides more than optimality with respect to the average reward achieved by the agents for all settings, while incuring comparable run time when the example size is small and much less run time when the example sizes increase. Particularly, as the number of agents and the grid size increase, e.g., two agents, two targets, and grid, our proposed approach maintains more than optimality with only run time, compared with the global MDP approach. Hence, our proposed approach shows scalability to mutliagent scenarios with dependent property.
ViB MultiAgent Patrolling Example
ViB1 Simulation Settings
We implement our proposed approach on a patrolling example with multiple patrol units capturing multiple adversaries among a finite set of locations as an evaluation. At each time, each patrol unit can be deployed at some location .
The objective of the patrol units is to compute a policy to patrol the locations to capture the adversaries. Each adversary is assumed to follow a heuristic policy as follows. If there exists no patrol unit that is deployed at the adversary’s target location , then with probability the adversary transitions to location and with probability the adversary transitions to some other location . If the adversary’s target location is being patrolled by some unit, then with probability the adversary transitions to location , and with probability the adversary transitions to some other location . The adversaries’ policies are assumed to be known to the patrol units.
The patrolling example is modeled by an MDP , where is the set of joint locations of the patrol units and adversaries, with is the set of locations at which patrol unit is deployed and is the set of locations where adversary can be located. The action set of each patrol unit and adversary is . Thus the joint action space . We shall remark that the joint action space is defined as the Cartesian product of the action spaces of all patrol units and adversaries so that we can accurately capture the transition probabilities of all patrol units and adversaries. We solve the problem by optimizing over the joint action space of all the patrol units, since the adversaries’ policies are known to the patrol units. For each patrol unit and adversary , we let
Here parameters capture the transition uncertainties, parameter captures the transition dependency among the patrol units, and captures the adversaries’ reactions to the patrol units’ actions. Let and be two joint locations. Then , where . We define the reward function for each and as , where , where is the effectiveness parameter, and are the number of patrol units and adversaries that are in location corresponding to , respectively.
ViB2 Simulation Results
We use Algorithm 1 to compute the policies for the patrol units, given the adversaries’ policies. Parameters , , , , and are set as , , , , and , respectively. We calculate the transition probability of each patrol unit by evaluating (3) over all possible actions and all possible states of all adversaries and all the other patrol units except . We implement our proposed approach under various settings by varying the number of patrol units, adversaries, and locations. For each setting, we run Algorithm 1 for trials and take the average over the trials as its performance. We compare Algorithm 1 with the global MDP approach that implements relative value iteration algorithm on MDP .
Table II shows the simulation results obtained using Algorithm 1 and the global MDP approach. We observe that our proposed approach achieves more than of optimality with respect to the average reward, while incuring at most of run time over all settings. By comparing the first row, 4th row, 6th row, and 7th row in Table II, we have that the run time advantage provided by our proposed approach remains when we increase the number of locations. By comparing the first three rows in Table II, we observe that our proposed approach remains close to optimal average reward (more than ), but scales better when the number of agents including patrol units and adversaries increases.
Vii Conclusions
This paper presented an approach for selecting decentralized policies for transition dependent MMDPs. We proposed a property of transition dependence, which we defined based on the maximum total variation distances for each agent’s state transitions conditioned on the actions of the other agents. In the special case of , the MMDP is transitionindependent. We developed a local search algorithm that runs in polynomial time in the number of agents. We derived optimality bounds on the policies obtained from our algorithm as a function of . Our results were verified through numerical studies on a patrolling example and a multirobot control scenario.
References
 [1] (1979) Structural results for partially observable Markov decision processes. Operations Research 27 (5), pp. 1041–1053. Cited by: §II.
 [2] (2013) Decentralized control of partially observable Markov decision processes. In 52nd IEEE Conference on Decision and Control, pp. 2398–2405. Cited by: §II.

[3]
(2019)
Modeling and planning with macroactions in decentralized POMDPs.
Journal of Artificial Intelligence Research
64, pp. 817–859. Cited by: §II.  [4] (2014) Streaming submodular maximization: massive data summarization on the fly. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 671–680. Cited by: §II.
 [5] (2004) Solving transition independent decentralized Markov decision processes. Journal of Artificial Intelligence Research 22, pp. 423–455. Cited by: §I, §II, §IVA, Definition 2.
 [6] (2004) Decentralized Markov decision processes with eventdriven interactions. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent SystemsVolume 1, pp. 302–309. Cited by: §II.
 [7] (1995) Dynamic programming and optimal control. Vol. 1, Athena Scientific Belmont, MA. Cited by: §VIA2.
 [8] (2005) A polynomial algorithm for decentralized Markov decision processes with temporal constraints. In Proceedings of the fourth International Joint conference on Autonomous Agents and Multiagent Systems, pp. 963–969. Cited by: §II.
 [9] (2014) Online submodular maximization with preemption. In Proceedings of the TwentySixth Annual ACMSIAM Symposium on Discrete Algorithms, pp. 1202–1216. Cited by: §II.
 [10] (2015) Markov decision process (MDP) toolbox for Python. Note: https://pymdptoolbox.readthedocs.io/en/latest/ Cited by: §VIA2, TABLE I, TABLE II, §VI.
 [11] (1978) An analysis of approximations for maximizing submodular set functions—II. In Polyhedral Combinatorics, pp. 73–87. Cited by: §II, Lemma 2.
 [12] (2005) Submodular functions and optimization. Elsevier. Cited by: §IIIB.

[13]
(2011)
Adaptive submodularity: theory and applications in active learning and stochastic optimization
. Journal of Artificial Intelligence Research 42, pp. 427–486. Cited by: §II.  [14] (2003) Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research 19, pp. 399–468. Cited by: §II.
 [15] (2019) Successor features based multiagent RL for eventbased decentralized MDPs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6054–6061. Cited by: §II.
 [16] (2017) Decentralized planning in stochastic environments with submodular rewards. Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence. Cited by: §I, §II, §II, §IVB, §IVB, §VIA1, Proposition 1.

[17]
(2000)
An algorithm for distributed reinforcement learning in cooperative multiagent systems
. InProceedings of the Seventeenth International Conference on Machine Learning
, Cited by: §I, §II.  [18] (2017) Markov Chains and Mixing Times. Vol. 107, American Mathematical Soc.. Cited by: §IIIA.
 [19] (1994) Markov games as a framework for multiagent reinforcement learning. In Machine Learning Proceedings, pp. 157–163. Cited by: §II.
 [20] (2001) Valuefunction reinforcement learning in Markov games. Cognitive Systems Research 2 (1), pp. 55–66. Cited by: §I, §II.
 [21] (1975) The role of the group generalized inverse in the theory of finite Markov chains. SIAM Review 17 (3), pp. 443–464. Cited by: §IIIA.
 [22] (2002) Game theory and decision theory in multiagent systems. Autonomous Agents and MultiAgent Systems 5 (3), pp. 243–254. Cited by: §II.
 [23] (2019) Exploiting fast decaying and locality in multiagent MDP with tree dependence structure. In 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 6479–6486. Cited by: §II.
 [24] (2015) Exploiting submodular value functions for faster dynamic sensor selection. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, pp. 3356–3363. Cited by: §II.
 [25] (2008) Introduction to information retrieval. Vol. 39, Cambridge University Press Cambridge. Cited by: Theorem 1.
 [26] (1991) Sensitivity analysis, ergodicity coefficients, and rank—one updates. Numerical Solution of Markov chains 8, pp. 121. Cited by: Lemma 1.
 [27] (2003) Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In Advances in Neural Information Processing Systems, pp. 1603–1610. Cited by: §I, §II.
 [28] (2019) Multiagent reinforcement learning: a selective overview of theories and algorithms. arXiv preprint arXiv:1911.10635. Cited by: §II.
Appendix
Proof of Lemma 5: The total variation distance between these distributions is given by
With slight abuse of notation, we let and denote the probability distributions of when the agents follow policy in MDPs and . We let (resp. ) denote the probability that when and in MDP (resp. ). For a state , we let .
For any , we define the sets for to denote the set of tuples of that can be completed to an element of . We define by
i.e., can be completed to an element of . We can then write the probability as
Comments
There are no comments yet.