Fueled with recent advances in deep neural networks, reinforcement learning (RL) has been in the limelight for many recent breakthroughs in artificial intelligence, including defeating humans in games (e.g., chess, Go, StarCraft), self-driving cars, smart home automation, service robots, among many others. Despite these remarkable achievements, many basic tasks can still elude a single RL agent. Examples abound from multi-player games, multi-robots, cellular antenna tilt control, traffic control systems, smart power grids to network management.
Often, cooperation among multiple RL agents is much more critical: multiple agents must collaborate to complete a common goal, expedite learning, protect privacy, offer resiliency against failures and adversarial attacks, and overcome the physical limitations of a single RL agent behaving alone. These tasks are studied under the umbrella of cooperative multi-agent RL (MARL), where agents seek to learn optimal policies to maximize a shared team reward, while interacting with an unknown stochastic environment and with each other. Cooperative MARL is far more challenging than the single-agent case due to: i) the exponentially growing search space, ii) the non-stationary and unpredictable environment caused by the agents’ concurrent yet heterogeneous behaviors, and iii) the lack of central coordinators in many applications. These difficulties can be alleviated by appropriate coordination among agents.
The cooperative MARL can be further categorized into subclasses depending on the information structure and types of coordination, such as how much information (e.g., state, action, reward, etc.) is available for each agent, what kinds of information can be shared among the agents, and what kinds of protocols (e.g., communication networks, etc.) are used for coordination. When only local partial state observation is available for each agent, the corresponding multi-agent systems are often described through decentralized partially observable Markov decision processes (MDP), or DEC-POMDP for short, for which the decision problem is known to be extremely challenging. In fact, even the planning problem of DEC-POMDPs (with known models) is known to be NEXT-complete. Despite some recent empirical successes [2, 3, 4], finding an exact solution of Dec-POMDPs using RLs with theoretical guarantees remains an open question.
When full state information is available for each agent, we call agents joint action learners (JALs) if they also know the joint actions of other agents, and independent learners (ILs) if agents only know their own actions. Learning tasks for ILs are still very challenging, since each agent sees other agents as parts of the environment, so without observing the internal states, including other agents’ actions, the problem essentially becomes non-Markovian  and a partially observable MDP (POMDP). It turns out that optimal policy can be found under restricted assumptions such as deterministic MDP , and for general stochastic MDPs, several attempts have demonstrated empirical successes [7, 8, 9]. For a more comprehensive survey on independent MARLs, the reader is referred to the survey .
The form of rewards, either centralized or decentralized, also makes a huge difference in multi-agent systems. If every agent receives a common reward, the situation becomes relatively easy to deal with. For instance, JALs can perfectly learn exact optimal policies of the underlying decision problem even without coordination among agents . The more interesting and practical scenario is when rewards are decentralized, i.e., each agent receives its own local reward while the global reward to be maximized is the sum of local rewards. This decentralization is especially important when taking into account the privacy and resiliency of the system.
Clearly, learning without coordination among agents is impossible under decentralized rewards. This article focuses on this important subclass of cooperative MARL with decentralized rewards, assuming the full state and action information is available to each agent. In particular, we consider decentralized coordination through network communications characterized by graphs, where each node in the graph represents each agent and edges connecting nodes represent communication between them.
Distributed optimization rises to the challenge by achieving global consensus on the optimal policy through only local computation and communication with neighboring agents. Recently, several important advances have been made in this direction such as the distributed TD-learning , distributed Q-learning , distributed actor-critic algorithm , and other important results [14, 15, 16, 17]. These works largely benefit from the synergistic connection between RLs and the core idea of averaging consensus-based distributed optimization , which leverages averaging consensus protocols for information propagation over networks and rich theory established in this field during the last decade.
In this survey, we provide an overview of this emerging field with an emphasis on optimization within the decentralized setting (decentralized rewards and decentralized communication protocols). For this purpose, we highlight the evolution of RL algorithms from single-agent to multi-agent systems, from a distributed optimization perspective, in the hope to catalyze the growing synergy among distributed optimization, signal processing, and RL communities.
In the sequel, we first revisit the basics of single-agent RL in Section II and extend to multi-agent RL in Section III. In Section IV, we provide preliminaries of distributed optimization as well as consensus algorithms. In Section V, we discuss several important consensus-based MARL algorithms with decentralized network communication protocols. Finally, in Section VI, we conclude with future directions and open issues. Note that our review is not exhaustive given the magazine limits; we suggest the interested reader to further read [19, 6, 20].
Ii Single-agent RL basics
To understand MARL, it is imperative that we briefly review the basics of single-agent RL setting, where only a single agent interacts with an unknown stochastic environment. Such environments are classically represented by a Markov decision process: , where the state-space and action-space , upon selecting an action with the current state , the state transits to
according to the state transition probability, and the transition incurs a random reward . For simplicity, we consider the infinite-horizon (discounted) Markov decision problem (MDP), where the agent sequentially takes actions to maximize cumulative discounted rewards. The goal is to find a deterministic optimal policy, , such that
where is the discount factor, is the set of all admissible deterministic policies, and
is a state-action trajectory generated by the Markov chain under policy. Solving MDPs involves two key concepts associated with the expected return:
is called the (state) value function for a given policy , which encodes the expected cumulative reward when starting in the state , and then, following the policy thereafter.
is called the state-action value function or Q-function for a given policy , which measures the expected cumulative reward when starting from state , taking the action , and then, following the policy .
Their optima over all possible policies are defined by and , respectively. Given the optimal value functions or , the optimal policy can be obtained by picking an action that is greedy with respect to or , i.e., or , respectively. When the MDP instance, , is known, then it can be solved efficiently via dynamic programming (DP) algorithms. Based on the Markov property, the value function for a given policy , satisfies the Bellman equation: . The similar property holds for as well. Moreover, the optimal Q-function , satisfies the Bellman optimality equation, . Various DP algorithms, such as the policy and value iterations, are obtained by turning the Bellman equations into update rules.
Ii-a Classical RL Algorithms
Many classical RL algorithms can be viewed as stochastic variants of DPs. This insight will be key for scaling MARL in the sequel. The temporal-difference (TD) learning is a fundamental RL algorithm to estimate the value function of a given policy(called as policy evaluation method):
where , , denotes the stationary state distribution under policy , and is the learning rate (or step-size). For any fixed policy , TD update converges to almost surely (i.e., with probability ) if the step-size satisfies the so-called Robbins-Monro rule, , . Although theoretically sound, the naive TD learning is only applicable to small-scale problems as it needs to store and enumerate values of all states. However, most practical problems we face in the real-world have large state-space. In such cases, enumerating all values in a table is numerically inefficient or even intractable.
Using function approximations resolves this problem by encoding the value function with a parameterized function class, . The simplest example is the linear function approximation, , where is a feature matrix, and is a pre-selected feature mapping. TD learning update with linear function approximation is written as follows
The above update is known to converge to almost surely , where is the solution to the projected Bellman equation, provided that the Markov chain with transition matrix (state transition probability matrix under policy ) is ergodic and the step-size satisfies the Robbins-Monro rule. Finite sample analysis of the TD learning algorithm is only recently established in [23, 24, 25]. Besides the standard TD, there also exits a wide spectrum of TD variants in the literature [26, 27, 28, 29]. Note that when a nonlinear function approximator, such as neural networks, is used, these algorithms are not guaranteed to converge.
The policy optimization methods aim to find the optimal policy and broadly fall under two camps, with one focusing on value-based updates, and the other focusing on direct policy-based updates. There is also a class of algorithms that belong to both camps, called actor-critic algorithms. Q-learning is one of the most representative valued-based algorithms, which obeys the update rule
where , , and is called the behavior policy, which refers to the policy used to collect observations for learning. The algorithm converges to almost surely  provided that the step-size satisfies the Robbins-Monro rule, and every state is visited infinitely often. Unlike value-based methods, direct policy search methods optimize a parameterized policy from trajectories of the state, action, reward, without any value function evaluation steps, using the following (stochastic) gradient steps:
where is a stochastic estimate of the gradient evaluated at . The gradient of the value function has the simple analytical form which, however, needs an estimate of the Q-function, . The simple policy gradient method replaces with a Monte Carlo estimate, which is called REINFORCE . However, the high variance of the stochastic gradient estimates due to the Monte Carlo procedure often leads to slow and sometimes unstable convergence. The actor-critic methods combine the advantages of value-based and direct policy search methods 
to reduce the variance. These algorithms parameterize both the policy and the value functions, and simultaneously update both in training
where and are parameters of the value and policy, respectively. They often exhibit better empirical performance than value-based or direct policy-based methods alone. Nonetheless, when (nonlinear) function approximation is used, the convergence guarantees of all these algorithms remain rather elusive.
Ii-B Modern Optimization-based RL Algorithms
) generate new principles for solving RL problems as we transition from linear towards nonlinear function approximations as well as establish theoretical guarantees based on rich theory in mathematical optimization literature.
To build up an understanding, we first recall the linear programming (LP) formulation of theplanning problem 
where is the initial state distribution,
is the expected reward vector, andis the state transition probability matrix given action . The constraints in this LP naturally arise from the Bellman equations. It is known that the solution to (6) is the optimal state-value function , and that the solution to the dual of (6) yields the optimal policy. By exploiting the Lagrangian duality, the optimal value function and optimal policy can be found through solving the min-max problem:
where sets and are properly chosen domains that restrict on the optimal value function and policy.
Building on this min-max formulation, several recent works introduce efficient RL algorithms for finding the optimal policy. For instance, the stochastic primal-dual RL (SPD-RL) in  solves the min-max problem (7) with the stochastic primal-dual algorithm
where and are unbiased stochastic gradient estimations, which are obtained by using samples of , and stand for the projection operators onto the sets and . Since these gradients are obtained based on the samples, the updates can be executed without the model knowledge. The SPD Q-learning in  extends it to the -learning framework with off-policy learning, where the sample observations are collected from some time-varying behavior policies. The dual actor-critic in  generalizes the setup to continuous state-action MDP and exploits nonlinear function approximations for both value function and the dual policy. These primal-dual type algorithms resemble the classical actor-critic methods by simultaneously updating the value function and policy, yet in a more efficient and principled manner.
Apart from the LP formulation, alternative nonlinear optimization frameworks based on the fixed point interpretation of Bellman equations have also been explored, both for policy evaluation and policy optimization. To name a few, Baird’s residual gradient algorithm , designed for policy evaluation, aims for minimizing the mean-squared Bellman error, i.e.,
where and are the expected reward vector and state transition probability matrix under policy , respectively, is the feature matrix, is a diagonal matrix with diagonal entries being the stationary state distributions, and . The gradient TD (GTD)  solves the projected Bellman equation, , by minimizing the mean-square projected Bellman error,
where is the projection onto the range of the feature matrix . This is largely driven by the fact that most temporal-difference learning algorithms converge to the minimum of MSPBE. However, directly minimizing these optimization objectives (8) and (9) can be challenging due to the double sampling issue and computational burden for the projections. Here, the double sampling issue means the requirement of double samples of the next stats from the current state to obtain an unbiased stochastic estimate of gradients of the objective mainly due to its quadratic nonlinearity. Alternatively, [39, 28] get around this difficulty by resorting to min-max reformulations of the MSBE and MSBPE and introduce primal-dual type methods for policy evaluation with finite sample analysis. Similar ideas have also been employed for policy optimization based on the (softmax) Bellman optimality equation; see, e.g.,  (called Smoothed Bellman Error Embedding (SBEED) algorithm).
Compared to the classical RL approaches, the optimization-based RLs exhibit several key advantages. First, in many applications such as robot control, the agents’ behaviors are required to mediate among multiple different objectives. Sometimes, those objectives can be formulated as constraints, e.g., safety constraints. In this respect, optimization-based approaches are more extensible than the traditional dynamic programming-based approaches when dealing with policy constraints. Second, existing optimization theory provides ample opportunities in developing convergence analysis for RLs with and without function approximations; see, e.g., [33, 34]. More importantly, these methods are highly generalizable to the multi-agent RL setup with decentralized rewards, when integrated with recent fruitful advances made in distributed optimization. This last aspect is our main focus in this survey.
Iii From single-agent to multi-agent RLs
Cooperative MARL extends the single-agent RL to agents,
, where the system’s behavior is influenced by the whole team of simultaneously and independently acting agents in a common environment. This can be further classified into MARLs with centralized rewards and decentralized rewards.
Iii-a MARL with Centralized Rewards
We start with MARLs with centralized rewards, where all agents have access to a central reward. In this setting, a multi-agent MDP can be characterized by the tuple, . Each agent observes the common state and executes action inside its own action set according to its local policy . The joint action causes the state to transit to with probability , and the agent receives the common reward . The goal for each agent is to learn a local policy such that is an optimal central policy.
Suppose each agent receives the central reward and knows the joint state and action pair (i.e., agents are JALs). Cooperative MARL, in this case, is straightforward because all agents have full information to find an optimal solution. As an example, a naive application of the Q-learning  to multi-agent settings is
where each agent keeps its local Q-function . In particular, it is equivalent to the single-agent Q-learning executed by each agent in parallel, and as almost surely for all ; thereby . Similarly, the policy search methods and actor-critic methods can be easily generalized to MARL with JALs . In such a case, coordination among agents is unnecessary to learn the optimal policy. However, in practice, each agent may not have access to the global rewards due to limitations of communication or privacy issues; as a result, coordination protocols are essential for achieving the optimal policy corresponding to the global reward.
Iii-B Networked MARL with Decentralized Reward
The main focus of this survey is on MARLs with decentralized rewards, where each agent only receives a local reward, and the central reward function is characterized as the average of all local rewards. The goal of each agent is to cooperatively find an optimal policy corresponding to the central reward by sharing local learning parameters over a communication network.
More formally, a coordinated multi-agent MDP with a communication network (i.e., networked MA-MDP) is given as the tuple, , where is the random reward of agent given action and the current state , and is an undirected graph (possibly time-varying or stochastic) characterizing the communication network. Each agent observes the common state , executes action according to its local policy , receives the local reward , and the joint action causes the state to transit to with probability . The central reward is defined as . In the course of learning, each agent receives learning parameters from its neighbors of the communication network. The overall model is illustrated as in Figure 1.
For an illustrative example, we consider a wireless sensor network (WSN) , where data packets are routed to the destination node through multi-hop communications. The WSN is represented by a graph with nodes (routers), and edges connecting nodes whenever two nodes are within the communication range of each other. The route’s QoS performance (quality of service) depends on the decisions of all nodes. Below we formulate the WSN as a networked MA-MDP.
Example 1 (WSN as a networked MA-MDP).
The WSN is a multi-agent system, where sensor nodes are agents. Each agent takes action , which consists of forwarding a packet to one of its neighboring node , sending an acknowledgment message (ACK) to the predecessor, dropping the data packet, where is the set of neighbors of the node . The global state is a tuple of local states , which consists of the set of ’s neighboring nodes, and the set of packets encapsulated with QoS requirement. A simple example of the reward is , where
The reward measures the quality of local routing decisions in terms of meeting with QoS requirements.
Each agent only has access to its own reward, which measures the quality of its own routing decisions based on the QoS requirements, while the efficiency of overall tasks depends on a sum of local rewards. If each node knows the global state and action , then the overall system is a networked MA-MDP.
Finding the optimal policy for networked MA-MDPs naturally relates to one of the most fundamental problems in decentralized coordination and control, called the consensus problem. In the sequel, we first review the recent advances in distributed optimization and consensus algorithms, and then march forward to the discussions of recent developments for cooperative MARL based on consensus algorithms.
Iv Distributed optimization and consensus algorithms
In this section, we briefly introduce several fundamental concepts in distributed optimization, which are the backbone of distributed MARL algorithms to be discussed.
Consider a set of agents, , each with some initial values, . The agents are interconnected over an underlying communication network characterized by a graph , where is a set of undirected edges, and each agent has a local view of the network, i.e., each agent is aware of its immediate neighbors, , in the network, and communicates with them only.
The goal of the consensus problem is to design a distributed algorithm that the agents can execute locally to agree on a common value as they refine their estimates. The algorithm must be local in the sense that each agent performs its own computations and communicates with its immediate neighbors only. Formally speaking, the agents are said to reach a consensus if
for some and for every set of initial values . For ease of notation, we consider the scalar case, , from now on.
A popular approach to the consensus problem is the distributed averaging consensus algorithm 
The averaging update is executed by local agent , as it only receives values of its neighbors, , and is known to ensure consensus provided that the graph is connected. Note that an undirected graph is connected if there is a path connecting every pair of two distinct nodes. Using matrix notations, we can compactly represent (11) as follows
where is a column vector with entries, , and is the weight matrix associated with (11) such that if and zero otherwise. Here, means the element in the -th row and -th column of the matrix .
is a stochastic matrix, i.e., it is nonnegative, and its row sums are one. Hence,converges to a rank one stochastic matrix, i.e., , where
is the unique (normalized) left-eigenvector of
for eigenvaluewith and is an -dimensional vector with all entries equal to one. Since , we have , implying the consensus.
Iv-B Distributed optimization with averaging consensus
Consider a multi-agent system connected over a network, where each agent has its own (convex) cost function, . Let be the system objective that the agents want to minimize collectively. The distributed optimization problem is to solve the following optimization problem:
where represents additional constraints on the variable . By introducing local copies , it is equivalently expressed as
The distributed averaging consensus algorithm can be generalized to solve the distributed optimization. An example is the consensus-based distributed subgradient method , where each agent updates its local variable according to
where is any subgradient of and is the Euclidean projection onto the constraint set .
The algorithm is a simple combination of the averaging consensus and the classical subgradient method. As in the averaging consensus, the update is executed by local agent , and it only receives the values of its neighbors, . When all cost functions are convex, it is known that local variables, , reach a consensus and converge to a solution to (14), , under properly chosen step-sizes.
Other distributed optimization algorithms include the EXTRA  (exact first-order algorithm for decentralized consensus optimization), push-sum algorithm  for directed graph models, gossip-based algorithm , and etc. A comprehensive and detailed summary of the distributed optimization can be found in the monograph .
Iv-C Distributed min-max optimization with averaging consensus
To put it one step further, distributed averaging consensus algorithm can also be generalized to solve the min-max problem in a distributed fashion. The distributed min-max optimization problem deals with the zero-sum game:
where is a convex-concave function and is separable. By introducing local copies , , the min-max problem is equivalently expressed as
Similar to the distributed subgradient method, the distributed primal-dual algorithm works by performing averaging consensus and sugradient descent for the local variable and of each agent:
where and are step-sizes, and are any subgradients of with respect to and , respectively, and and are the Euclidean projection onto the constraint sets and , respectively. The distributed primal-dual algorithm and other variants have been well studied in [48, 49, 50].
V Networked MARL with decentralized rewards
In this section, we focus on networked MARL with decentralized rewards, where the corresponding networked MA-MDP is described by the tuple, . The goal of each agent is to cooperatively find an optimal policy corresponding to the central reward, , by sharing local learning parameters over a communication network characterized by graph .
Decentralized rewards are common in practice when multiple agents cooperate to learn under sensing and physical limitations. Consider multiple robots navigating and executing multiple tasks in geometrically separated regions. The robots receive different rewards based on the space they reside in. Decentralized rewards are also particularly useful when MARL agents cooperate to learn an optimal policy securely due to privacy considerations. For instance, if we do not want to reveal full information about the policy design criterion to an RL agent to protect privacy, a plausible approach is to operate multiple RL agents, and provide each agent with only partial information about the reward function. In this case, no single agent alone can learn the optimal policy corresponding to the whole environment, without information exchange among other agents. Most recent algorithms to be discussed in this section, including [11, 16, 17, 51, 52, 12, 13, 14, 15], apply the distributed averaging consensus algorithm introduced in Section IV in one way or another. We now discuss these algorithms in details below, with a brief summary provided in Table I.
|Papers||Availability of actions||Reward||Function Approx.||Convergence|
|Policy Evaluation||Doan et al.||N/A||Decentralized||LFA||Yes|
|Wai et al.||LFA||Yes|
|Macua et al.||N/A||Centralized||LFA||Yes|
|Stanković et al.||LFA||Yes|
|Kar et al.||JAL||Decentralized||Tabular||Yes|
|Zhang et al.||JAL||LFA, NFA||Yes|
|Zhang et al.||JAL||LFA, NFA||Local|
|Qu et al.||JAL||NFA||Local|
V-a Distributed Policy Evaluation
The goal of distributed policy evaluation is to evaluate the central value function
in a distributed manner. The information available to each agent is , where represents the set of learning parameters agent receives from its neighbors over the communication network, and is the set of all neighbors of node over the graph . Note that for policy evaluation with state value function , the information or is not necessary, thereby it is not indicated in the information set .
The distributed TD-learning  executes the following local updates of agent :
where each agent keeps its local parameter . The algorithm resembles the consensus-based distributed subgradient method in Section IV-B. The first term, dubbed as the mixing term, is an average of local copies of the learning parameter of neighbors, , received from communication over networks, and controls local parameters to reach a consensus. The second term, referred to as the TD update, follows the standard TD updates. Under suitable conditions such as the graph connectivity, each local copy, , converges to in expectation and almost surely , where is the optimal solution found by the single-agent TD learning acting on the central reward.
V-B Distributed Policy Optimization
The goal of distributed policy optimization is to cooperatively find an optimal central policy corresponding to the central reward, . Note that the distributed TD-learning in the previous section only finds the state value function under a given policy. The averaging consensus idea can also be extended to Q-learning and actor-critic algorithms for finding the optimal policy for networked MARL.
The distributed Q-learning in  locally updates the Q-function according to
where is the agent index, and are learning rates (or step-sizes) depending on the number of instances when is encountered. The information available to each agent is . The overall diagram of the distributed Q-learning algorithm is given in Figure 3. Each agent keeps the local Q-function, , and the mixing term consists of Q-functions of neighbors received from communication networks. It has been shown that each local reaches a consensus and converges to almost surely  with suitable step-size rules and under assumptions such as the connectivity of the graph and an infinite number of state-action visits.
The distributed actor-critic algorithm in  generalizes the single-agent actor-critic to networked MA-MDP settings where the averaging consensus steps are taken for the value function parameters
where and are parameters of nonlinear function approximations for the local actor and local critic, respectively. Here is the advantage function evaluated at . The overall diagram of the distributed actor-critic is given in Figure 4. Each agent keeps its local parameters , and in the mixing step, it only receives local parameters of the critic from neighbors. The actor and critic updates are similar to those of typical actor-critic algorithms with local parameters. The information available to each agent is . The results in  study a MARL generalization of the fitted Q-learning with the information structure . Compared to the tabular distributed Q-learning in , the distributed actor-critic and fitted Q-learning may not converge to an exact optimal solution mainly due to the use of function approximations.
V-C Optimization Frameworks for Networked MA-MDP
Recall that in Section II-B, we discussed optimization frameworks of single-agent RL problem. By integrating them with consensus-based distributed optimization, they can be naturally adapted to solve networked MA-MDPs. In this subsection, we introduce some recent work in this direction, such as the value propagation , primal-dual distributed incremental aggregated gradient , distributed GTD . The main idea of these algorithms is essentially rooted in formulating the overall MDP into a min-max optimization problem, , with separable function , and solving the distributed min-max optimization problem (16). For MARL tasks, the distributed min-max problem can be solved using stochastic variants of the distributed saddle-point algorithms in Section IV-C.
The multi-agent policy evaluation algorithms in  and  are multi-agent variants of the GTD  based on the consensus-based distributed saddle-point framework for solving the mean-squared projected Bellman error in (9), which can be equivalently converted into an optimization problem with separable objectives:
To alleviate the double sampling issues in GTD, the approach in  applies the Fenchel duality with an additional proximal term to each objective, arriving at the reformulation:
where the local objectives are expressed as max-forms
The resulting problem can be solved by using stochastic variants of the consensus-based distributed subgradient method akin to . In particular, the algorithm introduces gradient surrogates of the objective function with respect to the local primal and dual variables, and the mixing steps for consensus are applied to both the local parameters and local gradient surrogates. The main idea of the primal-dual algorithm used in  is briefly (with some simplifications) written by
where and are step-sizes, and are surrogates of the gradients, and , respectively, from through some basic gradient tracking steps.
where is the vector enumerating the local parameters, , and is the graph Laplacian matrix. Note that if the underlying graph is connected, then if and only if . By constructing the Lagrangian dual of the above constrained optimization, we obtain the corresponding single min-max problem. Thanks to the Laplacian matrix, the corresponding stochastic primal-dual algorithm is automatically decentralized. Compared to , it only needs to share local parameters with neighbors rather than the gradient surrogates.
The MARL in  combines the averaging consensus and SBEED  (Smoothed Bellman Error Embedding), which is called distributed SBEED here. In particular, the distributed SBEED aims to solve the so-called smoothed Bellman equation
by minimizing the corresponding mean squared smoothed Bellman error:
where is a positive real number capturing the smoothness level, and are deep neural network parameters for the value and policy, respectively. Directly applying the stochastic gradient to the above objective using samples leads to biases due to the nonlinearity of the objective (or double sampling issue). To alleviate this difficulty, the distributed SBEED introduces the primal-dual form as in , which results in a distributed saddle-point problem similar to (16) and is processed with a stochastic variants of the distributed proximal primal-dual algorithm in .
V-D Special Case: Networked MARL with Centralized Rewards
Lastly, we remark that the algorithms in this section can be directly applied to MA-MDPs with central rewards. As in Section III, we consider an MDP, , with an additional network communication model , while each agent receives the common reward instead of the local reward . One may imagine reinforcement learning algorithms running in identical and independent simulated environments. Under this assumption, a distributed policy evaluation was studied in . It combines GTD  with the distributed averaging consensus algorithm as follows:
where is the local TD-error. Each agent has access to the information , while the action is not used in the updates. The first update is equivalent to the GTD in  with a local parameter and the second term is equivalent to the distributed averaging consensus update in (11). Since the GTD update rule is equivalent to a stochastic primal-dual algorithm, the above update rule is equivalent to a distributed algorithm for solving the distributed saddle-point problem in (16). Note that  only proves the weak convergence of the algorithm. In the same vein, the multi-agent policy evaluation  generalizes the GQ learning to distributed settings, which is more general than GTD in the sense that it incorporates an importance weight of agent that measures the dissimilarity between the target and behavior policy for the off-policy learning.
Vi Future Directions
Until now, we mainly focused on networked MARL and recent advances which combine tools in consensus-based distributed optimization with MARL under decentralized rewards. There remain much more challenging agendas to be studied. By bridging two domains in a synergistic way, these research topics are expected to generate new results and enrich both fields.
Robustness of networked MARL
Communication networks in real world, oftentimes, suffer from communication delays, noises, link failures, or packet drops. Moreover, network topologies may vary as time goes by and the information exchange over the networks may not be bidirectional in general. Extensive results on distributed optimization algorithms over time-varying, directed graphs, w/o communication delays have been actively studied in the distributed optimization community, yet mostly in deterministic and convex settings. The study of networked MARLs under aforem