# Scalable Planning in Multi-Agent MDPs

Multi-agent Markov Decision Processes (MMDPs) arise in a variety of applications including target tracking, control of multi-robot swarms, and multiplayer games. A key challenge in MMDPs occurs when the state and action spaces grow exponentially in the number of agents, making computation of an optimal policy computationally intractable for medium- to large-scale problems. One property that has been exploited to mitigate this complexity is transition independence, in which each agent's transition probabilities are independent of the states and actions of other agents. Transition independence enables factorization of the MMDP and computation of local agent policies but does not hold for arbitrary MMDPs. In this paper, we propose an approximate transition dependence property, called δ-transition dependence and develop a metric for quantifying how far an MMDP deviates from transition independence. Our definition of δ-transition dependence recovers transition independence as a special case when δ is zero. We develop a polynomial time algorithm in the number of agents that achieves a provable bound on the global optimum when the reward functions are monotone increasing and submodular in the agent actions. We evaluate our approach on two case studies, namely, multi-robot control and multi-agent patrolling example.

## Authors

• 5 publications
• 7 publications
• 17 publications
• 31 publications
06/25/2020

### Distributed Policy Synthesis of Multi-Agent Systems With Graph Temporal Logic Specifications

We study the distributed synthesis of policies for multi-agent systems t...
06/01/2021

### Gradient Play in Multi-Agent Markov Stochastic Games: Stationary Points and Convergence

We study the performance of the gradient play algorithm for multi-agent ...
09/15/2019

### Exploiting Fast Decaying and Locality in Multi-Agent MDP with Tree Dependence Structure

This paper considers a multi-agent Markov Decision Process (MDP), where ...
03/24/2015

### Individual Planning in Agent Populations: Exploiting Anonymity and Frame-Action Hypergraphs

Interactive partially observable Markov decision processes (I-POMDP) pro...
08/04/2021

### Offline Decentralized Multi-Agent Reinforcement Learning

In many real-world multi-agent cooperative tasks, due to high cost and r...
09/25/2019

### α^α-Rank: Practically Scaling α-Rank through Stochastic Optimisation

Recently, α-Rank, a graph-based algorithm, has been proposed as a soluti...
11/29/2015

### Solving Transition-Independent Multi-agent MDPs with Sparse Interactions (Extended version)

In cooperative multi-agent sequential decision making under uncertainty,...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

A variety of distributed planning and decision-making problems, including multiplayer games, search and rescue, and infrastructure monitoring, can be modeled as Multi-agent Markov Decision Processes (MMDPs). In such processes, the state transitions and rewards are determined by the joint actions of all of the agents. While there is a substantial body of work on computing such optimal joint policies [27, 20, 17], a key challenge is that the the total number of states and actions grows exponentially in the number of agents. This increases the complexity of computing an optimal policy, as well as storing and implementing the policy on the agents.

One approach to mitigating this complexity is to identify additional problem structures. One such structure is transition-independence (TI) [5]. In a TI-MDP, the state transitions of an agent are independent of the states and actions of the other agents. Such MDPs may arise, for example, in multi-robot scenarios where the motion of each robot is independent of the others. TI-MDPs can be approximately solved by factoring into multiple MDPs, one for each agent, and then obtaining a local policy, in which each agent’s next action depends only on that agent’s current state. When the TI property holds and the MDP possesses additional structure, such as submodularity, this approach may yield scalable algorithms for computing near-optimal policies [16].

The TI property, however, does not hold for general MMDPs when there is coupling between the agents. Coupling occurs when two agents must cooperate to reach a particular state, or when the actions of agents may interfere with each other. In this case, the existing results providing near-optimality do not hold, and at present there are no scalable algorithms for local policy selection in non-TI-MMDPs.

In this paper, we investigate the problem of computing approximately optimal local policies for non-TI MMDPs. We propose -transition dependence, which captures the deviation of the MMDP from transition independence. We make the following contributions:

• We define the -transition dependence property, in which the parameter

characterizes the maximum change in the probability distribution of any agent due to changes in the states and actions of the other agents.

• We propose a local search algorithm for computing local policies of the agents, in which each agent computes an optimal policy assuming that the remaining agents follow given, fixed policies.

• We prove that, when the reward functions are monotone and submodular in the agent actions, the proposed algorithm achieves a provable optimality bound as a function of the dependence parameter and the ergodicity of the MMDP.

• We evaluate our approach on two numerical case studies, namely, a patrolling example and a multi-robot target tracking scenario. On average, our approach achieves -optimality in the multi-robot scenario and -optimality in the multi-agent patrolling example, while requiring 10-20% of the runtime of an exact optimal algorithm.

The paper is organized as follows. Section II presents the related work. Section III contains preliminary results. Section IV presents our problem formulation and algorithm. Section V contains optimality and complexity analyses. Section VI presents simulation results. Section VII concludes the paper.

## Ii Related Work

MDPs have been extensively studied as a framework for multi-agent planning and decision-making [19, 22]. Most existing works focus on selecting an optimal joint strategy for the agents, which maps each global system state to an action for each agent [27, 20, 17]. These methods can be shown to converge to a locally optimal policy, in which no agent can improve the overall reward by unilaterally changing its policy. These joint decision-making problems can be viewed as special cases of multi-agent games in which all agents have a shared reward [28]

. These approaches, however, suffer from a “curse of dimensionality,” in which the state space grows exponentially in the number of agents, and hence do not scale well to large numbers of agents.

Transition-independent MDPs (TI-MDPs) provide problem structure that can be exploited to speed up the computation [5]. In a TI-MDP, each agent’s transitions probabilities are independent of the actions and states of the other agents, allowing the MDP to be factored and approximately solved [14, 6, 8, 15]. Extensions of the TI-MDP approach to POMDPs were presented in [2, 3]. A greedy algorithm for TI-MDPs with submodular rewards was proposed in [16]. The goal of the present paper is to extend these works to non-TI MDPs by relaxing transition independence, enabling optimality bounds for a broader class of MDPs. A local policy algorithm was proposed in [23] that leverages a fast-decaying property that is distinct from the approximate transition independence that we consider.

Our optimality bounds rely on submodularity of the reward functions. Submodularity is a diminishing-returns property of discrete functions that has been studied in a variety of contexts, including offline [11], online [9], adaptive [13], and streaming [4] submodularity. Submodular properties were leveraged to improve the optimality bounds of multi-agent planning [16], sensor scheduling [24], and solving POMDPs [1]. Submodularity for transition-dependent MDPs, however, has not been explored.

## Iii Background and Preliminaries

This section gives background on perturbations of Markov chains, as well as definition and relevant properties of submodularity.

### Iii-a Perturbations of Markov Chains

A finite-state, discrete-time Markov chain is a stochastic process defined over a finite set , in which the next state is chosen according to a probability distribution , where is the current state. A Markov chain over is defined by its transition matrix , in which represents the probability of a transition from state to state . The following theorem describes the steady-state behavior of a class of Markov chains.

###### Theorem 1 (Ergodic Theorem [25])

Consider a Markov chain with transition matrix . Suppose there exists such that for all and . Then there is a probability distribution over such that, for any distribution over the initial state,

 limt→∞η(s,t)t=π(s),

where is the number of times the Markov chain reaches state in the first time steps. Moreover,

is the unique left eigenvector of

with eigenvalue

.

A Markov chain satisfying the conditions of Theorem 1 is ergodic. The probability distribution defined in Theorem 1 is the stationary distribution of the chain. Intuitively, a Markov chain is ergodic if the relative frequency of reaching each state is independent of the initial state. The ergodicity coefficient of a matrix is defined by

 Λ1(P)=12maxi,j∑k|Pik−Pjk|.

We next state preliminary results on perturbations of ergodic Markov chains. First, we define the total variation distance between two probability distributions as follows. For two probability distributions and over a finite space , the total variation distance is defined by

 ||μ−ν||TV≜maxΘ⊆Ω|μ(Θ)−ν(Θ)|.

The total variation distance satisfies [18]

 ||μ−ν||TV=12∑x∈Ω|μ(x)−ν(x)|.

Let and denote the transition matrices of two ergodic Markov chains on the same state space with stationary distributions and , and define . The -norm of the matrix is defined by

 ||Δ||1=maxi{∑j|Δij|},

where is the -th entry of . The group inverse of , denoted , is the unique square matrix satisfying

 PP#P=P, P#PP#=P#, P#P=PP#.

Let , where denotes the identity. It is known [21] that , where

denotes the vector with all

’s.

The following result gives a bound on the distance between and as a function of the perturbation .

###### Lemma 1 ([26])

The total variation distance between the stationary distributions and of Markov chains with transition matrices and , respectively, satisfies

 ||μ−ν||TV≤12Λ1(Z#)||P−P′||1,

where is the group inverse of .

### Iii-B Background on Submodularity and Matroids

A function is submodular [12] if, for any sets and any element , we have

 f(S∪{v})−f(S)≥f(T∪{v})−f(T).

The function is monotone if for . A matroid is defined as follows.

###### Definition 1

Let denote a finite set and let be a collection of subsets of . Then is a matroid if (i) , (ii) and implies that , and (iii) for any with , there exists such that .

The rank of a matroid is equal to the cardinality of the maximal independent set in . A matroid basis is a maximal independent set in , i.e., a set such that and for all . A partition matroid is defined by a partition of the set into , where for . A set is independent in the partition matroid if, for all , .

The following result leads to optimality bounds on local search algorithms for submodular maximization.

###### Lemma 2 ([11])

Suppose that is a basis of matroid , is a monotone submodular function, and there exists such that, for any and with ,

 f(S)≥11+ϵf(S∪{v}∖{u}).

Then we have for any , where is the rank of .

## Iv Problem Formulation and Proposed Algorithm

In this section, we first present our problem formulation, followed by the proposed algorithm.

### Iv-a System Model and Problem Formulation

We consider a Markov Decision Process (MDP) 111In this paper, we use MDP and MMDP interchangeably. defined by a tuple , where and denote the state and action spaces, respectively. The transition probability function denotes the probability of transitioning from state to state after taking action . The reward function defines the reward from taking action in state . The goal is to maximize the average reward per stage, denoted by , where and denote the state and action at time .

The state and action spaces of can be decomposed between a set of agents and an underlying environment. We write to denote the state space of the environment, the state space of agent , and to denote the action space of agent . We then have and . Throughout the paper, we use to denote a state in and to denote a tuple of state values for the agents excluding agent . Similarly, we denote an action in as and let denote a tuple of actions for the agents excluding agent .

We assume that the reward function is a monotone and submodular function of the agent actions for any fixed state value. Define and . We observe that the size of the state space may grow exponentially in the number of agents, increasing the complexity of computing the transition probabilities and optimal policy. A problem structure that is known to simplify these computations is transition independence, defined as follows.

###### Definition 2 ([5])

An MDP is transition independent (TI) if there exist transition functions and , , such that

 P({s0,…,sm},{a1,…,am},{s′0,…,s′m})=P0(s0,s′0)m∏i=1Pi(si,ai,s′i).

Transition independence implies that the state transitions of each agent depend only on that agent’s states and actions, thus enabling factorization of the MDP and reducing the complexity of simulating and solving the MDP. We observe, however, that the TI property does not hold for general MDPs, and introduce the following relaxation.

###### Definition 3

Let

 μi(si,ai,s−i,s′−i,a−i)=Pr(st+1i=⋅|st={si,s−i},at={ai,a−i},st+1−i=s′−i).

An MDP is -transition dependent (or -dependent) if

 maxi,si,ais−i,s′−i,s′′−is′′′−i,a−i,a′−i||μi(si,ai,s−i,s′−i,a−i)−μi(si,ai,s′′−i,s′′′−i,a′−i)||TV≤δ (1)

Intuitively, the -dependent property implies that the impact of the other agents on agent ’s transition probabilities is bounded by . When , our definition of -dependent MDP reduces to TI-MDP defined in [5].

The agents choose their actions at each time step by following a policy, which maps the current and previous state values to the action at time . We focus on stationary policies of the form , which only incorporate the current state value when choosing the next action. We assume that, for any stationary policy, the resulting induced Markov chain is ergodic. We let denote the maximum value of the ergodic number over all stationary policies. Furthermore, to reduce the complexity of storing the policy at the agents, each agent follows a local policy . Hence, each agent’s actions only depend on the environment and the agent’s internal state. Any policy with this structure can be expressed as , where denotes the policy of agent . We let denote the set of policies of the agents excluding .

The problem is formulated as follows. Define the value function for policies by

 J(π)=limT→∞E{1TT−1∑t=0R(st,π(st))}.

When it is not clear from the context, we let denote the average reward from policy on MDP . The goal is then to select that maximizes . As a preliminary, we say that a policy is locally optimal if, for all , for any agent policy . We say that is -locally optimal if for all and all policies for agent .

### Iv-B Proposed Algorithm

To motivate our approach, we first map the problem to a combinatorial optimization problem as in

[16]. Consider the finite set of agent policies, which we write as , where denotes the set of possible local policies for agent . The collection is formally defined as the set of functions of the form . The problem of selecting an optimal collection of local policies can therefore be mapped to the combinatorial optimization problem

 maximizeJ(π)s.t.π∈Π,|π∩Πi|=1 ∀i=1,…,m (2)

In (2), the policy is interpreted as a set, in which each element represents the policy of a single agent. Since there is exactly one policy per agent, the constraint is a partition matroid constraint. The following proposition provides additional structure for a special case of (2).

###### Proposition 1 ([16])

If the MDP is transition-independent and the rewards are monotone and submodular in for any fixed state , then the function is monotone and submodular in .

Proposition 1

implies that, when the MDP is TI and reward function is submodular, efficient heuristic algorithms will lead to provable optimality guarantees. One such algorithm is local search, which attempts to improve the current set of policies

by searching for policies satisfying . If no such policy can be found, then the policy is a local optimum of (2), and hence Lemma 2 can be used to obtain a -optimality guarantee.

The difficulty in the above approach arises from the fact that the number of possible policies for each agent grows exponentially in the number of states . Hence, instead of brute force search, the approach of [16] leverages the fact that, in a TI-MDP in which all other agents adopt stationary policies, the optimal policy for agent can be obtained as the solution to an MDP. This MDP has reward function and transition matrix, respectively, given by

 Ri(si,ai)=∑s−iq(s−i)R({si,s−i},{ai,π−i(s−i)})

and , where denotes the stationary distribution of the joint states under the chosen policies. Using this property, an optimal policy for agent , conditioned on the policies of the other agents, can be obtained by solving this equivalent MDP.

We now present our proposed approach, which generalizes this idea from TI to non-TI MDPs. Our algorithm is initialized as follows. Choose a parameter . First, for each agent , choose a probability distribution over the states in and a policy . Next, define a local transition function for each agent as

 Pi(si,ai,s′i)=Eμi(P({si,s−i},{ai,πi(s−i)},{s′−i,s′−i})), (3)

where the expectation is over from distribution . We then choose policies arbitrarily, and set as the stationary distribution on the state induced by the policy under transition function .

At the -th iteration of the algorithm, each agent updates its policy while the other agent policies are held constant. The optimal policy of agent is approximated by the solution to a local MDP denoted , where

 Rki(si,ai)=∑s−i⎡⎣⎛⎝∏j≠i^qj(sj)⎞⎠R({si,s−i},{ai,π−i(s−i)})⎤⎦. (4)

A policy is then obtained as the optimal policy for . If , then set equal to , compute as the stationary distribution of under policy , and increment . The algorithm terminates when no agent modifies its policy in an iteration .

Pseudocode for this algorithm is given in Algorithm 1.

## V Optimality Analysis

We analyze the optimality in three stages. First, we define a TI-MDP, and prove that the policies returned by our algorithm are within a provable bound of a local optimum of the TI-MDP. We then use submodularity of the reward function to prove that the local optimal policies provide a constant-factor approximation to the global optimum on the TI-MDP. Finally, we prove that the approximate global optimum on the TI-MDP is also an approximate global optimum for the original MDP.

We define to be the policy returned by our algorithm. Let be the joint stationary distribution of the agents in the MDP arising from these policies. We construct a TI-MDP . The transition function is defined by

 ^P(s,a,s′)=m∏i=1^Pi(si,ai,s′i).

The reward function . We observe that, by construction, if is the local MDP obtained at the last iteration of Algorithm 1, then for all .

###### Lemma 3

The policy returned by Algorithm 1 is a -local optimum for MDP .

Proof: By construction, the algorithm terminates if, for all , there is no policy such that

 J^M(πi,^π−i)=JMi(πi)≥(1+ϵ)JMi(^πi)=(1+ϵ)J^M(^π),

implying that is a -local optimum of .

Based on the local optimality, we can derive the following optimality bound for .

###### Lemma 4

Let denote the optimal local policies for MDP . Then

 J^M(^π)≥12+ϵmJ^M(π∗).

Proof: The proof follows from the submodularity of (Proposition 1) and Lemma 2.

Lemma 4 provides an optimality bound with respect to the TI-MDP . Next, we leverage the -dependent property to derive an optimality bound with respect to the given MDP . We start with the following preliminary results.

###### Lemma 5

For any state and policy , , where and are the transition matrices corresponding to and , respectively.

A proof can be found in the technical appendix. We next exploit the bound in Lemma 5 to approximate the gap between the and .

###### Lemma 6

Suppose that is -dependent MDP. For any policy , .

Proof: Let and denote the stationary distributions induced by policy on MDPs and . We have

 |J^M(π)−JM(π)|≤(Rmax−Rmin)||qM−q^M||1.

By Lemma 1,

 ||qM−q^M||1≤2¯¯¯λmδ,

giving the desired result.

Combining these derivations yields the following.

###### Theorem 2

Let be an -dependent MDP and and denote the output of Algorithm 1 and the optimal policies, respectively. Then

 JM(π∗)≤4Rmax¯¯¯λmδ+(1+mϵ)J^M(^π)+JM(^π).

Proof: We have

 \IEEEeqnarraymulticol3lJM(π∗)−JM(^π) ≤ |JM(^π)−J^M(^π)+J^M(^π)−J^M(π∗)| +|J^M(π∗)−JM(π∗)| ≤ 4Rmax¯¯¯λmδ+(1+mϵ)J^M(^π)

by Lemmas 4 and 6.

## Vi Simulation

In this section, we present our simulation results. We consider two scenarios, namely, multi-robot control and a multi-agent patrolling example. Both simulations are implemented using Python 3.8.5 on a workstation with Intel(R) Xeon(R) W-2145 CPU @ 3.70GHz processor and  GB memory. Given the transition and reward matrices, an MDP is solved using Python MDP Toolbox [10].

### Vi-a Multi-robot Control

#### Vi-A1 Simulation Settings

We consider a set of robots whose goal is to cover maximum number of targets from a set of fixed targets positioned in a grid environment. Robots initially start from a fixed set of grid locations. At each time , each robot can move one grid position horizontally or vertically from the current grid position by taking some action . For each robot , let denotes the set of grid positions that can be reached from the current grid position under each action . is the grid position corresponds to . Note that , if is not a valid action (e.g., at the bottom left corner of the grid are not valid actions). Let be the transition probability function associated with robot and be the probability of robot transitions from a grid position to under action . Let be the number of robots at after taking actions . Then,

 Pi(si,ai,s′i)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩c, if s′i=d(ai) and n(s′i)

where . Uncertainty in the environment is modeled by the parameter and the transition dependencies between the robots are modeled by the parameter .

We model the multi-robot control problem as an MDP . The state space , where for all . The action space . The transition probability matrix is denoted as . The probability of transitioning from a state to some target state by taking action is given as . The submodular reward of is given by , where is the number of robots visiting target following a joint policy at a state state . The parameter captures the effectiveness of having agents at target . Similar submodular reward has been used in [16].

#### Vi-A2 Simulation Results

We use Algorithm 1 to find a set of policies for the robots that maximizes their average reward. Parameters , , , and are set to , , and , respectively. The transition probability for each agent is calculated by evaluating (3) over samples of actions and states . We test Algorithm 1 under different sizes of grids , number of agents , number of targets , and initial locations of the agents. For each setting, we execute Algorithm 1 for trials, and take the average over the trials as the performance of Algorithm 1. We compare Algorithm 1 with the a global MDP approach, which calculates the optimal values using relative value iteration algorithm [7] provided by Python MDP Toolbox [10] on MDP . Note that the state space of MDP is exponential in grid size and number of agents . The action space is exponential in . The total size of all local MDPs for all agents constructed using (3) and (4) grows linearly with respect to the number of agents. As the number of agents and/or grid size of examples increase, the global MDP approach incurs a heavy memory computation overhead to the system. For an example, when robots trying to reach target in a grid, it requires around  Gb of memory to compute the solution using global MDP approach, while our proposed approach only requires  Mb memory to calculate the policy. Therefore, the global MDP approach is not computationally efficient for larger example sizes.

Table I shows the simulation results obtained using Algorithm 1 and the global MDP approach. We observe that our proposed approach provides more than optimality with respect to the average reward achieved by the agents for all settings, while incuring comparable run time when the example size is small and much less run time when the example sizes increase. Particularly, as the number of agents and the grid size increase, e.g., two agents, two targets, and grid, our proposed approach maintains more than optimality with only run time, compared with the global MDP approach. Hence, our proposed approach shows scalability to mutli-agent scenarios with -dependent property.

### Vi-B Multi-Agent Patrolling Example

#### Vi-B1 Simulation Settings

We implement our proposed approach on a patrolling example with multiple patrol units capturing multiple adversaries among a finite set of locations as an evaluation. At each time, each patrol unit can be deployed at some location .

The objective of the patrol units is to compute a policy to patrol the locations to capture the adversaries. Each adversary is assumed to follow a heuristic policy as follows. If there exists no patrol unit that is deployed at the adversary’s target location , then with probability the adversary transitions to location and with probability the adversary transitions to some other location . If the adversary’s target location is being patrolled by some unit, then with probability the adversary transitions to location , and with probability the adversary transitions to some other location . The adversaries’ policies are assumed to be known to the patrol units.

The patrolling example is modeled by an MDP , where is the set of joint locations of the patrol units and adversaries, with is the set of locations at which patrol unit is deployed and is the set of locations where adversary can be located. The action set of each patrol unit and adversary is . Thus the joint action space . We shall remark that the joint action space is defined as the Cartesian product of the action spaces of all patrol units and adversaries so that we can accurately capture the transition probabilities of all patrol units and adversaries. We solve the problem by optimizing over the joint action space of all the patrol units, since the adversaries’ policies are known to the patrol units. For each patrol unit and adversary , we let

 Pi(si,ai,s′i)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩c if ai=s′i,∄i′≠i s.t. ai=ai′1−c|L| if ai≠s′i,∄i′≠i s.t. ai=ai′δc if ai=s′i,∃i′≠i s.t. ai=ai′1−δc|L| if ai≠s′i,∃i′≠i s.t. ai=ai′ Pj(sj,aj,s′j)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩d if aj=s′j,∄i s.t. ai=aj′1−d|L| if aj≠s′j,∄i s.t. ai=aj′βd if aj=s′j,∃i s.t. ai=aj′1−βd|L| if aj≠s′j,∃i s.t.% ai=aj′

Here parameters capture the transition uncertainties, parameter captures the transition dependency among the patrol units, and captures the adversaries’ reactions to the patrol units’ actions. Let and be two joint locations. Then , where . We define the reward function for each and as , where , where is the effectiveness parameter, and are the number of patrol units and adversaries that are in location corresponding to , respectively.

#### Vi-B2 Simulation Results

We use Algorithm 1 to compute the policies for the patrol units, given the adversaries’ policies. Parameters , , , , and are set as , , , , and , respectively. We calculate the transition probability of each patrol unit by evaluating (3) over all possible actions and all possible states of all adversaries and all the other patrol units except . We implement our proposed approach under various settings by varying the number of patrol units, adversaries, and locations. For each setting, we run Algorithm 1 for trials and take the average over the trials as its performance. We compare Algorithm 1 with the global MDP approach that implements relative value iteration algorithm on MDP .

Table II shows the simulation results obtained using Algorithm 1 and the global MDP approach. We observe that our proposed approach achieves more than of optimality with respect to the average reward, while incuring at most of run time over all settings. By comparing the first row, 4-th row, 6-th row, and 7-th row in Table II, we have that the run time advantage provided by our proposed approach remains when we increase the number of locations. By comparing the first three rows in Table II, we observe that our proposed approach remains close to optimal average reward (more than ), but scales better when the number of agents including patrol units and adversaries increases.

## Vii Conclusions

This paper presented an approach for selecting decentralized policies for transition dependent MMDPs. We proposed a property of -transition dependence, which we defined based on the maximum total variation distances for each agent’s state transitions conditioned on the actions of the other agents. In the special case of , the MMDP is transition-independent. We developed a local search algorithm that runs in polynomial time in the number of agents. We derived optimality bounds on the policies obtained from our algorithm as a function of . Our results were verified through numerical studies on a patrolling example and a multi-robot control scenario.

## References

• [1] S. C. Albright (1979) Structural results for partially observable Markov decision processes. Operations Research 27 (5), pp. 1041–1053. Cited by: §II.
• [2] C. Amato, G. Chowdhary, A. Geramifard, N. K. Üre, and M. J. Kochenderfer (2013) Decentralized control of partially observable Markov decision processes. In 52nd IEEE Conference on Decision and Control, pp. 2398–2405. Cited by: §II.
• [3] C. Amato, G. Konidaris, L. P. Kaelbling, and J. P. How (2019) Modeling and planning with macro-actions in decentralized POMDPs.

Journal of Artificial Intelligence Research

64, pp. 817–859.
Cited by: §II.
• [4] A. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, and A. Krause (2014) Streaming submodular maximization: massive data summarization on the fly. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 671–680. Cited by: §II.
• [5] R. Becker, S. Zilberstein, V. Lesser, and C. V. Goldman (2004) Solving transition independent decentralized Markov decision processes. Journal of Artificial Intelligence Research 22, pp. 423–455. Cited by: §I, §II, §IV-A, Definition 2.
• [6] R. Becker, S. Zilberstein, and V. Lesser (2004) Decentralized Markov decision processes with event-driven interactions. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 1, pp. 302–309. Cited by: §II.
• [7] D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas (1995) Dynamic programming and optimal control. Vol. 1, Athena Scientific Belmont, MA. Cited by: §VI-A2.
• [8] A. Beynier and A. Mouaddib (2005) A polynomial algorithm for decentralized Markov decision processes with temporal constraints. In Proceedings of the fourth International Joint conference on Autonomous Agents and Multiagent Systems, pp. 963–969. Cited by: §II.
• [9] N. Buchbinder, M. Feldman, and R. Schwartz (2014) Online submodular maximization with preemption. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1202–1216. Cited by: §II.
• [10] S. A. W. Cordwell (2015) Markov decision process (MDP) toolbox for Python. Cited by: §VI-A2, TABLE I, TABLE II, §VI.
• [11] M. L. Fisher, G. L. Nemhauser, and L. A. Wolsey (1978) An analysis of approximations for maximizing submodular set functions—II. In Polyhedral Combinatorics, pp. 73–87. Cited by: §II, Lemma 2.
• [12] S. Fujishige (2005) Submodular functions and optimization. Elsevier. Cited by: §III-B.
• [13] D. Golovin and A. Krause (2011)

Adaptive submodularity: theory and applications in active learning and stochastic optimization

.
Journal of Artificial Intelligence Research 42, pp. 427–486. Cited by: §II.
• [14] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman (2003) Efficient solution algorithms for factored MDPs. Journal of Artificial Intelligence Research 19, pp. 399–468. Cited by: §II.
• [15] T. Gupta, A. Kumar, and P. Paruchuri (2019) Successor features based multi-agent RL for event-based decentralized MDPs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6054–6061. Cited by: §II.
• [16] R. R. Kumar, P. Varakantham, and A. Kumar (2017) Decentralized planning in stochastic environments with submodular rewards. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence. Cited by: §I, §II, §II, §IV-B, §IV-B, §VI-A1, Proposition 1.
• [17] M. Lauer and M. Riedmiller (2000)

An algorithm for distributed reinforcement learning in cooperative multi-agent systems

.
In

Proceedings of the Seventeenth International Conference on Machine Learning

,
Cited by: §I, §II.
• [18] D. A. Levin and Y. Peres (2017) Markov Chains and Mixing Times. Vol. 107, American Mathematical Soc.. Cited by: §III-A.
• [19] M. L. Littman (1994) Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings, pp. 157–163. Cited by: §II.
• [20] M. L. Littman (2001) Value-function reinforcement learning in Markov games. Cognitive Systems Research 2 (1), pp. 55–66. Cited by: §I, §II.
• [21] C. D. Meyer (1975) The role of the group generalized inverse in the theory of finite Markov chains. SIAM Review 17 (3), pp. 443–464. Cited by: §III-A.
• [22] S. Parsons and M. Wooldridge (2002) Game theory and decision theory in multi-agent systems. Autonomous Agents and Multi-Agent Systems 5 (3), pp. 243–254. Cited by: §II.
• [23] G. Qu and N. Li (2019) Exploiting fast decaying and locality in multi-agent MDP with tree dependence structure. In 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 6479–6486. Cited by: §II.
• [24] Y. Satsangi, S. Whiteson, F. A. Oliehoek, et al. (2015) Exploiting submodular value functions for faster dynamic sensor selection. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 3356–3363. Cited by: §II.
• [25] H. Schütze, C. D. Manning, and P. Raghavan (2008) Introduction to information retrieval. Vol. 39, Cambridge University Press Cambridge. Cited by: Theorem 1.
• [26] E. Seneta (1991) Sensitivity analysis, ergodicity coefficients, and rank—one updates. Numerical Solution of Markov chains 8, pp. 121. Cited by: Lemma 1.
• [27] X. Wang and T. Sandholm (2003) Reinforcement learning to play an optimal Nash equilibrium in team Markov games. In Advances in Neural Information Processing Systems, pp. 1603–1610. Cited by: §I, §II.
• [28] K. Zhang, Z. Yang, and T. Başar (2019) Multi-agent reinforcement learning: a selective overview of theories and algorithms. arXiv preprint arXiv:1911.10635. Cited by: §II.

## Appendix

Proof of Lemma 5: The total variation distance between these distributions is given by

 maxT|∑s′∈TP(s,π(s),s′)−¯¯¯¯P(s,π(s),s′)|.

With slight abuse of notation, we let and denote the probability distributions of when the agents follow policy in MDPs and . We let (resp. ) denote the probability that when and in MDP (resp. ). For a state , we let .

For any , we define the sets for to denote the set of tuples of that can be completed to an element of . We define by

 Ti(s1:(i−1))={s′i∈Si:{s′1:(i−1),s′i}⊆Q∈T},

i.e., can be completed to an element of . We can then write the probability as

 m∏i=1P(s′i∈Ti(s1:(i−1))|s1:(i−1)∈T1:(