Value Propagation for Decentralized Networked Deep Multi-agent Reinforcement Learning

01/27/2019 ∙ by Chao Qu, et al. ∙ 0

We consider the networked multi-agent reinforcement learning (MARL) problem in a fully decentralized setting, where agents learn to coordinate to achieve the joint success. This problem is widely encountered in many areas including traffic control, distributed control, and smart grids. We assume that the reward function for each agent can be different and observed only locally by the agent itself. Furthermore, each agent is located at a node of a communication network and can exchanges information only with its neighbors. Using softmax temporal consistency and a decentralized optimization method, we obtain a principled and data-efficient iterative algorithm. In the first step of each iteration, an agent computes its local policy and value gradients and then updates only policy parameters. In the second step, the agent propagates to its neighbors the messages based on its value function and then updates its own value function. Hence we name the algorithm value propagation. We prove a non-asymptotic convergence rate 1/T with the nonlinear function approximation. To the best of our knowledge, it is the first MARL algorithm with convergence guarantee in the control, off-policy and non-linear function approximation setting. We empirically demonstrate the effectiveness of our approach in experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-agent systems have applications in a wide range of areas such as robotics, traffic control, distributed control, telecommunications, and economics. For these areas, it is often difficult or simply impossible to predefine agents’ behaviour to achieve satisfactory results, and multi-agent reinforcement learning (MARL) naturally arises Bu et al. (2008); Tan (1993). For example, El-Tantawy et al. (2013) represent an traffic signal control problem as a multi-player stochastic game and solve it with MARL. MARL generalizes reinforcement learning by considering a set of autonomous agents ( decision makers) sharing a common environment. However, multi-agent reinforcement learning is a challenging problem since agents interact with both the environment and each other. For instance, independent Q-learning—treating other agents as a part of the environment—often fails as the multi-agent setting breaks the theoretical convergence guarantee of Q-learning and makes the learning unstable Tan (1993). Rashid et al. (2018) alleviate such problem using a centralized mixing network (i.e., being centralized for training, but decentralized during execution.). Its communication pattern is illustrated in the left panel of Figure 1.

Despite the great success of (partially) centralized MARL approaches, there are various scenarios, such as sensor networks Rabbat and Nowak (2004) and intelligent transportation systems Adler and Blue (2002) , where a central agent does not exist or may be too expensive to use. Also, a centralized approach might suffer from malicious attacks to the center. These scenarios necessitate fully decentralized approaches, which are useful for many applications including unmanned vehicles Fax and Murray (2002), power grid Callaway and Hiskens (2011), and sensor networks Cortes et al. (2004). For these approaches, we can use a network to represent interactions between agents (see the right panel of Figure 1): each agent makes its own decision based on its local reward and messages received from their neighbors. In particular, we consider collaborative MARL in which all agents have a common goal of maximizing the averaged cumulative rewards over all agents(see equation (4)).

In this paper, we propose a new fully decentralized networked multi-agent deep reinforcement learning algorithm. Using the softmax temporal consistency Nachum et al. (2017) to connect value and policy updates and a decentralized optimization method Hong et al. (2017), we obtain a two-step iterative update in this new algorithm. In the first step of each iteration, each agent computes its local policy and value gradients and then updates only policy parameters. In the second step, each agent propagates to its neighbors the messages based on its value function and then updates its own value function. Hence we name the algorithm value propagation

. In value propagation, local rewards are passed to neighbor agents via value function messages. We approximate the local policy and value functions of each agent by deep neural networks, which enable automatic feature generation and end-to-end learning.

Value propagation is both principled and data efficient. We prove the value propagation algorithm converges with the rate even with the non-linear deep neural network function approximation. To the best of our knowledge, it is the first deep MARL algorithm with non-asymptotic convergence guarantee. At the same time, value propagation can use off-policy updates, making it data efficient. Also, by integrating the decentralized optimization method Hong et al. (2017) in MARL, we essentially extend this method into a stochastic version, which may be of independent interest to the optimization community.

Experimental results show that value propagation clearly outperformed independent RL on each agent (i.e., no communications) and achieved favorable results over alternative networked MARLZhang et al. (2018).

Figure 1: Centralized network vs Decentralized network. Each blue node in the figure corresponds to an agent. In centralized network (left), the red central node collects information for all agents, while in decentralized network (right), agents exchanges information with neighbors.

2 Preliminary

2.1 Mdp

Markov Decision Process (MDP) can be described by a 5-tuple (): is the finite state space, is the finite action space,

are the transition probabilities,

are the real-valued immediate rewards and is the discount factor. A policy is used to select actions in the MDP. In general, the policy is stochastic and denoted by , where is the conditional probability density at associated with the policy. Define to be the optimal value function. It is known that is the unique fixed point of the Bellman optimality operator,

The optimal policy is related to by the following equation:

2.2 Softmax Temporal Consistency

Nachum et al. (2017) establish a connection between value and policy based reinforcement learning based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Particularly, the soft Bellman optimality is as follows,

(1)

where and controls the degree of regularization. When , above equation reduces to the standard Bellman optimality condition. It is not hard to derive that

(2)

which is a softmax function over .

An important property of this soft Bellman operator is the called temporal consistency, which leads to the Path Consistency Learning.

Proposition 1.

Nachum et al. (2017) Assume . Let be the fixed point of (1) and be the corresponding policy that attains that maximum on the RHS of (1). Then, is the unique pair that satisfies the following equation for all

A straightforward way to apply temporal consistency is to optimize the following problem,

Dai et al. (2018) get around the double sampling problem of above formulation by introduce a primal-dual form

(3)

where ,

controls the trade-off between bias and variance.

Notation. We use

to denote the Euclidean norm over the vector.

stands for the transpose of . denotes the entry-wise product between two vectors.

3 Value Propagation

In this section, we present our multi-agent reinforcement learning algorithm, i.e., value propogation. To begin with, we extend MDP to the Networked Multi-agent MDP following the definition in Zhang et al. (2018). Let be an undirected graph with agents (node). represents the set of edges. means agent and can communicate with each other through this edge. A networked multi-agent MDP is characterized by a tuple : is a global state space shared by all agents, is the action space of agent , is the joint action space, is the transition probability, denotes the local reward function of agent . We assume that states and the joint action are globally observable while rewards are observed only locally. At each time step, agents observe and make the decision . Then each agent just receives its own reward , and the environment switches to the new state according to the transition probability. Furthermore, since each agent make the decisions independently, it is reasonable to assume that the policy can be factorized, i.e., . This assumption is also made by Zhang et al. (2018). We call our method fully-decentralized method, since reward is received locally, the action is executed locally by agent, critic (value function) are trained locally.

3.1 Multi-Agent Softmax Temporal Consistency

The goal of value propagation is to learn a policy to maximize the long term reward averaged over the agent, i.e.,

(4)

In the following, we adapt the temporal consistency into the multi-agent version. Let

be the optimal value function and be the corresponding policy. Apply the soft temporal consistency, we obtain that for all , is the unique pair that satisfies

(5)

A optimization problem inspired by (5) would be

(6)

There are two potential issues of the above formulation:

  • Due to the inner conditional expectation, it would require two independent samples to obtain the unbiased estimation of gradient of

    Dann et al. (2014).

  • is a global variable over the network, thus can not be updated in a decentralized way.

For the first issue, we introduce the primal-dual form of (6) as that in Dai et al. (2018). Using the fact that and the interchangeability principle Shapiro et al. (2009) we have,

(7)

Change the variable , the objective function becomes

(8)

where

3.2 Decentralized Parametrization

In the following, we assume that policy, value function, dual variable are all in the parametric function class. Particularly, each agent’s policy is . The value function is characterized by the parameter , while represents the parameter of . As that in Dai et al. (2018), we optimize a lightly different version from (3.1).

(9)

where controls the bias and variance trade-off.

Recall the second issue on global variable . To address this problem, we introduce the local copy of the value function, i.e., for each agent . In the algorithm, we have a consensus update step, such that these local copies are the same, i.e., , or equivalently , where are parameter of respectively. Notice now in (3.2), there is a global dual variable in the primal-dual form. Therefore, we also introduce the local copy of the dual variable, i.e., to formulate it into the decentralized optimization problem.

Now the final objective function we need to optimize is

(10)

where

Now we are ready to present the algorithm of value propagation. In the following, for notational simplicity, we assume the parameter of each agent is a scalar, i.e., . We pack the parameter together and slightly abuse the notation by writing , , Similarly, we also pack the stochastic gradient , .

3.3 Algorithm of Value propagation

To solve (10), we first optimize the inner dual problem and find the solution , such that . Then we do the (decentralized) stochastic gradient decent to solve the primal problem.

(11)

Notice in practice, we can not get the exact solution of the dual problem. Thus we do the (decentralized) stochastic gradient for steps in the dual problem and get a approximated solution in the Algorithm 1. In our analysis, we take the error generated from this inexact solution into the consideration and analyze its effect on the convergence.

In dual update we do a consensus update using the stochastic gradient of each agent, where is some auxiliary variable to incorporate the communication constraint. This update rule is adapted from the decentralized optimization Hong et al. (2017). Notice Hong et al. (2017) just consider the batch gradient case while our algorithm and analysis include the stochastic gradient case. Some remarks on the Algorithm 1 are in order.

  • The update of each agent just needs the information of the agent itself and its neighbors. To see this, we need to explain the matrix in the algorithm which are closely related to the topology of the Graph. Particularly, is the degree matrix, with denoting the degree of node . is the node-edge incidence matrix: if and it connects vertex i and j with , then if , if and otherwise. The signless incidence matrix , where the absolute value is taken for each component of . The signless graph Laplacian . By definition if . Notice the non-zeros element in , , the update just depends on each agent itself and its neighbor. The derivation of this update is deferred to appendix due to the limit of space.

  • The consensus update forces agent to satisfy the constraint (similarly for ) to reach a consensual estimation of . Particularly, the variable records the violation of the constraint . and gradient information compose a new gradient direction to update .

  • The topology of the Graph affects the convergence speed. In particular, the rate depends on and , which is related to spectral gap of the network. We refer reader to Hong et al. (2017) since it is not the main contribution of this paper.

After the update of dual parameters, we optimize the primal parameter , . Similarly, we use a mini-batch data from the replay buffer and then do a consensus update on . The same remarks on also hold for the primal parameter Notice here we do not need the consensus update on , since each agent’s policy is different from each other.

   Input: Environment ENV, learning rate , , , discount factor , number of step to train dual parameter , replay buffer capacity , node-edge incidence matrix , degree matrix , signless graph Laplacian .
  Initialization of , , .
  for  do
      sample trajectory and add it into the replay buffer.1. Update the dual parameter Do following dual update times:
     Random sample a mini-batch of transition from the replay buffer.
     for agent to  do
        Calculate the stochastic gradient of w.r.t. .
     end for// Do consensus update on 2. Update primal parameters
     Random sample a mini-batch of transition from the replay buffer.
     for  agent to  do
        Calculate the stochastic gradient , of , w.r.t. ,
     end for// Do gradient decent on : for each agent .// Do consensus update on :
  end for
Algorithm 1 Value Propagation

3.4 Acceleration

The algorithm 1 trains the agent with vanilla gradient decent method with a extra consensus update. In practice, the adaptive momentum gradient methods including Adagrad Duchi et al. (2011)

, Rmsprop

Tieleman and Hinton and Adam Kingma and Ba (2014) have much better performance in training the deep neural network. We adapt Adam in our setting, and propose algorithm 2 which has better performance than algorithm 1 in practice. We defer details to the appendix due to limits of space.

3.5 Multi-step Extension on value propagation

The temporal consistency can be extended to the multi-step case Nachum et al. (2017), where the following equation holds

Thus in the objective function (10), we can replace by and change the estimation of stochastic gradient correspondingly in Algorithm 1 and Algorithm 2 to get the multi-step version of vaue propagation (see appendix). In practice, the performance of setting is better than which is also observed in single agent case Nachum et al. (2017); Dai et al. (2018). We can tune for each application to get the best performance.

4 Theoretical Result

In this section, we give the convergence result on the Algorithm 1 using the vanilla stochastic gradient and leave the proof of its accelerated version (Algorithm 2) as a future work. To the best of our knowledge, the convergence of Adam to the stationary point in the decentralized setting is still an open problem. We first made two mild assumptions on function approximators of , , .

Assumption 1.

1. The function approximator is differentiable and has Lipschitz continuous gradient, i.e.,

This is commonly assumed in the non-convex optimization.

2. The function approximator is lower bounded. This can be easily satisfied when the parameter is bounded, i.e., for some positive constant .

In the following, we give the theoretical analysis for our algorithm 1 in the same setting with Antos et al. (2008); Dai et al. (2018) where samples are prefixed and from one single -mixing off-policy sample path. We denote

Theorem 1.

Let function approximators of , and satisfy Assumption 1, the total training step be . We solve the inner dual problem with a approximated solution , such that , and . Assume the variance of the stochastic gradient , and (estimated by a single sample) are bounded by , the size of the mini-batch is , the step size , the bias-variance trade-off term , then value propagation in Algorithm 1 converges to the stationary solution of with rate

Some remarks are in order:

  • The convergence criteria and its dependence on the network structure are involved. We defer the definition of them to the proof section in the appendix (equation (44)).

  • We require that the approximated dual solution are not far from such that the estimation of the primal gradient of and are not far from the true one (the distance is less than ). Once the inner dual problem is concave, we can get this approximated solution easily using vanilla decentralized stochastic gradient method at most steps or better result by advanced momentum method in practice. If the dual problem is non-convex, we still can show the dual problem converges to some stationary solution with rate by our proof.

  • In the theoretical analysis, the stochastic gradient estimated from the mini-batch (rather than the estimation from a single sample ) is common in non-convex analysis, see the work Ghadimi and Lan (2016). In practice, a mini-batch of samples is widely used in training deep neural network.

Algorithm Decentralized network Nonlinear convergence proof Off-policy training
Value propagation Yes Yes, Yes
MA-AC Zhang et al. (2018) Yes No No
COMA Foerster et al. (2018) No No No
Qmix Rashid et al. (2018) No No Yes
Table 1: Comparison of different algorithm from aspects of network structure, convergence guarantee and off-policy training.

5 Related work

Among related work on MARL, the setting of Zhang et al. (2018) is close to ours, where the author proposed a fully decentralized multi-agent Actor-Critic algorithm to maximize the expected time-average reward . They provide the asymptotic convergence analysis on the on-policy and linear function approximation setting. In our work, we consider the discounted reward setup, i.e., equation (4). Our algorithm includes both on-policy and off-policy setting thus can exploit data more efficiently. Furthermore, we provide a convergence rate in the non-linear function approximation setting which is much stronger than the result in Zhang et al. (2018). Littman (1994) proposed the framework of the Markov Game which can be applied to collaborative and competitive setting Lauer and Riedmiller (2000); Hu and Wellman (2003). These early work considered the tabular case thus can not apply to the real problem with large state space. Recent works Foerster et al. (2016, 2018); Rashid et al. (2018); Raileanu et al. (2018); Jiang et al. (2018)

have exploited powerful deep learning and obtained some promising empirical results. However most of them lacks theoretical guarantees while our work provides convergence analysis.

Arslan and Yüksel (2017) consider the case where the reward function of each agent is the same, which is different from our setting. In the multi-task reinforcement learning, the reward function among different tasks can be different Teh et al. (2017), and each agent may learn to share information or transfer useful information from one task to another to accelerate learning. However, in these previous works, there are no interactions between agents as in our setting. Notice most of MARL work are in the fashion of centralized training and decentralized execution. In the training, they do not have the constraint on the communication, while our work has a network structure which is more realistic to train a large system as we mentioned. In table 1, we compare some MARL (although the setting of them are not exact same) from the perspective of network structure, convergence guarantee with non-linear approximation, off-policy training. There are a plethora of works on the deep reinforcement learning, thus we just mention some of them closely related to our work. Nachum et al. (2017) proposed a Path Consistency Learning algorithm which builds the connection between value and policy based reinforcement learning using a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Dai et al. (2018) alleviated the double sampling problem in Nachum et al. (2017) by introducing a dual variable which can tune the bias-variance trade-offs. Our work can be thought as a multi-agent version of above works. The challenge arises from how to diffuse the information of each local agent efficiently to the whole network while remains the convergence property of the original algorithm.

6 Experimental result

In this section, we test our accelerated value propagation, i.e., Algorithms 2 through numerical simulations. The settings of the experiment are similar to those in Zhang et al. (2018).

6.1 Randomly Sampled MDP

Figure 2: Results on randomly sampled MDP. Left: Value function of different agents in value propagation. In the figure, value functions of three agents are similar, which means agents get consensus on value functions. Middle: cumulative reward of value propagation and cumulative reward of centralized PCL with 10 agents. Right : cumulative reward of value propagation and cumulative reward of centralized PCL with 20 agents. In both middle and right panels, the X-axis is the number of episodes;Y-axis the cumulative reward averaged over agents.

The aim of this toy example is to test whether value propagation can get the consensus estimation of value functions through communication which then leads to the collaboration. To this end, we compare value propagation with the centralized PCL. The centralized PCL means that there is a central node to collect rewards of all agent, thus it can optimize the objective function (6) using the single agent PCL algorithm Nachum et al. (2017); Dai et al. (2018). Ideally, value propagation should converges to the same long term reward with the one achieved by the centralized PCL. In the experiment, we consider a multi-agent RL problem with and agents, where each agent has two actions. A discrete MDP is randomly generated with states. The transition probabilities are distributed uniformly with a small additive constant to ensure ergodicity of the MDP, which is . For each agent and each state-action pair , the reward is uniformly sampled from . The value function and dual variable

are approximated by two hidden-layer neural network with Relu as the activation function where each hidden-layer has

hidden units. The policy of each agent is approximated by a one hidden-layer neural network with Relu as the activation function where the number of the hidden units is . The output is the softmax function to approximate . The mixing matrix in Algorithm 2 is selected as the Metropolis Weights in (12). The graph is generated by randomly placing communication links among agents such that the connectivity ratio is . We set , , learning rate =5e-4. The choice of , are the default value in Adam.

Figure 3: Results on Cooperative Navigation task. Left: value functions of three random picked agents (totally 16 agents) in value propagation. They get consensus. Middle : cumulative reward of value propagation (eta=0.01 and eta=0.1), MA-AC and PCL without communication with agent number N=8. Right: cumulative reward of value propagation (eta=0.01 and eta=0.1), MA-AC and PCL without communication with agent number N=16. Our algorithm outperforms MA-AC and PCL without communication. Comparing with the middle panel, the number of agent increases in the right panel. Therefore, the problem becomes harder (more collisions). We see agents achieve lower cumulative reward (averaged over agents) and need more time to find a good policy.

In the left panel of Figure 2, we verify that the value function in value propagation reaches the consensus. Particularly, we random choose three agent , , and draw their value functions over 20 randomly picked states. It is easy to see that, value functions , , over these states are almost same. This is accomplished by the consensus update in the algorithm 2. In the middle and right panel of Figure2, we compare the result of value propagation with centralized PCL. It is easy to see that value propagation and centralized PCL converge to almost the same value. The centralized PCL converges faster than value propagation, since it does not need time to diffuse the reward information over the network.

6.2 Cooperative Navigation task

The aim of this section is to demonstrate that the value propagation outperforms decentralized multi-agent Actor-Critc (MA-AC)Zhang et al. (2018) and the Multi-agent PCL without communication. Here PCL without communication means each agents maintains its own estimation of policy and value function but there is no communication Graph. Notice this is different from the centralized PCL in section 6.1, where centralized PCL has a central node to collect all reward information and thus do not need further communication. Remark that the original MA-AC is designed for the averaged reward setting thus we adapt it into the discounted case to fit our setting. We test the value propagation in the environment of the Cooperative Navigation task Lowe et al. (2017), where agents need to reach a set of landmarks trough physical movement. We modify this environment to fit our setting. Each agent can observe the position of the landmarks and other agents. The reward is given when the agent reaches its own landmarks. The penalty is received if agents collide with any other agents. Since the position of Landmarks are different, the reward function of each agent is different. Here we assume the state is globally observable which means position of the landmarks and other agents are observable to each agent. In particular, we assume the environment is in a rectangular region with size . There are or agents. Each agent has a single target landmark, i.e.,

, which is randomly located in the region. Each agent has five actions which corresponds to going up, down, left, right with units 0.1 or staying at the position. The agent has high probability (0.95) to move in the direction following its action and go in other direction randomly otherwise. The length of each epoch is set to be 500. When the agent is close enough to the landmark, e.g., the distance is less than 0.1, we think it reaches the target and gets reward

. When two agents are close to each other (with distance less than 0.1), we treat this case as a collision and a penalty is received for each agent. The state includes the position of the landmarks and agents. The communication graph and mixing matrix are generated as that in section 6.1 with connectivity ratio 4/N. The value function is approximated by a two-hidden-layer neural network with Relu as the activation function where inputs are the state information. Each hidden-layer has hidden units. The dual function is also approximated by a two-hidden-layer neural network, where the only difference is that inputs are state-action pairs (s,a). The policy is approximated by a one-hidden-layer neural network with Relu as the activation function. The number of the hidden units is . The output is the softmax function to approximate . In all experiments, we use the multi-step version of value propagation and choose . We choose , . The learning rate of Adam is chosen as 5e-4 and are default value in Adam optimizer. The setting of PCL without communication is exactly same with value propagation except the absence of communication network.

In the left panel of Figure 3, we see the value function reaches consensus in value propagation. In the middle and right panel of Figure 3, we compare value propagation with PCL without communication and MA-AC. In PCL without communication, each agent maintains its own policy, value function and dual variable, which is trained by the algorithm SBEED Dai et al. (2018) with . Since there is no communication between agents, intuitively agents may have more collisions in the learning process than those in value propagation. Indeed, In the middle and right panel, we see value propagation learns the policy much faster than PCL without communication. We also observe that value propagation outperforms MA-AC. One possible reason is that value propagation is an off-policy method thus we can apply experience replay which exploits data more efficiently than the on-policy method MA-AC.

In the following, we consider a more realistic setting in the traffic. For each agent, we count the number of other agents within a radius of a agent as the penalty to mimic the congestion level in the real traffic. This setting is more challenging than the previous experiment where the penalty is just 0 or -1. Particularly, there are totally 10 agents and initialized positions of them are uniformly randomly picked in . The destination is set to be a random point with distance to its starting point. The collision radius is set to be 0.15. For instance, if there are 3 other agents appearing within the radius of agent , the penalty of agent is . Other settings such as neural networks, transition and termination condition are the same with previous experiment. In Figure 4, we present the result.

Figure 4: Results on the experiment considering the congestion level. Our algorithm outperforms MA-AC and PCL without communication.

In the early stage of the training, the penalty of agents are large (around -4) since we consider the congestion level and agents are initialized relatively closer to each other. After about 500 episodes training, value propagation starts to learn the coordination and outperforms the PCL without communication. At the end of the training, the reward of PCL without communication is around 2 while value propagation is around 7.2 and 9. MA-AC learns the policy slowly due to its on-policy learning nature.

7 Conclusions

We have presented value propagation, a new principled and data-efficient algorithm for fully decentralized deep multi-agent reinforcement learning. First, it can be trained and executed via message passing between networked agents, scalable for large real-world applications. Second, we have provided a theoretical convergence guarantee with deep neural approximation. Third, we have demonstrated its effectiveness empirically.

References

  • Adler and Blue (2002) Jeffrey L Adler and Victor J Blue. A cooperative multi-agent transportation management and route guidance system. Transportation Research Part C: Emerging Technologies, 10(5-6):433–454, 2002.
  • Antos et al. (2008) András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.
  • Arslan and Yüksel (2017) Gürdal Arslan and Serdar Yüksel. Decentralized q-learning for stochastic teams and games. IEEE Transactions on Automatic Control, 62(4):1545–1558, 2017.
  • Boyd et al. (2006) Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms. IEEE transactions on information theory, 52(6):2508–2530, 2006.
  • Bu et al. (2008) Lucian Bu, Robert Babu, Bart De Schutter, et al. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172, 2008.
  • Callaway and Hiskens (2011) Duncan S Callaway and Ian A Hiskens. Achieving controllability of electric loads. Proceedings of the IEEE, 99(1):184–199, 2011.
  • Cattivelli et al. (2008) Federico S Cattivelli, Cassio G Lopes, and Ali H Sayed. Diffusion recursive least-squares for distributed estimation over adaptive networks. IEEE Transactions on Signal Processing, 56(5):1865–1877, 2008.
  • Cortes et al. (2004) Jorge Cortes, Sonia Martinez, Timur Karatas, and Francesco Bullo. Coverage control for mobile sensing networks. IEEE Transactions on robotics and Automation, 20(2):243–255, 2004.
  • Dai et al. (2018) Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, pages 1133–1142, 2018.
  • Dann et al. (2014) Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: A survey and comparison. The Journal of Machine Learning Research, 15(1):809–883, 2014.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • El-Tantawy et al. (2013) Samah El-Tantawy, Baher Abdulhai, and Hossam Abdelgawad. Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (marlin-atsc): methodology and large-scale application on downtown toronto. IEEE Transactions on Intelligent Transportation Systems, 14(3):1140–1150, 2013.
  • Fax and Murray (2002) J Alexander Fax and Richard M Murray. Information flow and cooperative control of vehicle formations. In IFAC World Congress, volume 22, 2002.
  • Foerster et al. (2016) Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
  • Foerster et al. (2018) Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • Ghadimi and Lan (2016) Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
  • Hong (2016) Mingyi Hong. Decomposing linearly constrained nonconvex problems by a proximal primal dual approach: Algorithms, convergence, and applications. arXiv preprint arXiv:1604.00543, 2016.
  • Hong et al. (2017) Mingyi Hong, Davood Hajinezhad, and Ming-Min Zhao. Prox-pda: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks. In International Conference on Machine Learning, pages 1529–1538, 2017.
  • Hu and Wellman (2003) Junling Hu and Michael P Wellman. Nash q-learning for general-sum stochastic games. Journal of machine learning research, 4(Nov):1039–1069, 2003.
  • Jiang et al. (2018) Jiechuan Jiang, Chen Dun, and Zongqing Lu. Graph convolutional reinforcement learning for multi-agent cooperation. arXiv preprint arXiv:1810.09202, 2018.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Lauer and Riedmiller (2000) Martin Lauer and Martin Riedmiller. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning. Citeseer, 2000.
  • Littman (1994) Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994, pages 157–163. Elsevier, 1994.
  • Lowe et al. (2017) Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pages 6379–6390, 2017.
  • Nachum et al. (2017) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems, pages 2775–2785, 2017.
  • Rabbat and Nowak (2004) Michael Rabbat and Robert Nowak. Distributed optimization in sensor networks. In Proceedings of the 3rd international symposium on Information processing in sensor networks, pages 20–27. ACM, 2004.
  • Raileanu et al. (2018) Roberta Raileanu, Emily Denton, Arthur Szlam, and Rob Fergus. Modeling others using oneself in multi-agent reinforcement learning. arXiv preprint arXiv:1802.09640, 2018.
  • Rashid et al. (2018) Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:1803.11485, 2018.
  • Shapiro et al. (2009) Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2009.
  • Shi et al. (2015) Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact first-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.
  • Tan (1993) Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pages 330–337, 1993.
  • Teh et al. (2017) Yee Teh, Victor Bapst, Wojciech M Czarnecki, John Quan, James Kirkpatrick, Raia Hadsell, Nicolas Heess, and Razvan Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pages 4496–4506, 2017.
  • (33) T Tieleman and G Hinton. Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning. Technical report, Technical Report. Available online: https://zh. coursera. org/learn ….
  • Xiao et al. (2005) Lin Xiao, Stephen Boyd, and Sanjay Lall. A scheme for robust distributed sensor fusion based on average consensus. In Information Processing in Sensor Networks, 2005. IPSN 2005. Fourth International Symposium on, pages 63–70. IEEE, 2005.
  • Zhang et al. (2018) Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Başar. Fully decentralized multi-agent reinforcement learning with networked agents. International Conference on Machine Learning, 2018.

Appendix A Acceleration of value propagation

The algorithm 1 trains the agent with vanilla gradient decent method with a extra consensus update. In practice, the adaptive momentum gradient methods including Adagrad Duchi et al. [2011], Rmsprop Tieleman and Hinton and Adam Kingma and Ba [2014] have much better performance in training the deep neural network. In Algorithm 2, we adapt Adam in our setting, which has better performance than algorithm 1 in practice. Unfortunately, we can not give the convergence analysis of the algorithm 2. To the best of our knowledge, the convergence of Adam to the stationary point in the decentralized setting is still an open problem.

   Input: Environment ENV, learning rate , , discount factor , a mixing matrix , number of step to train dual parameter , replay buffer capacity .
  Initialization of

, moment vectors

, .
  for  do
      sample trajectory and add it into the replay buffer.// Update the dual parameter Do following update times:
     Random sample a mini-batch of transition from the replay buffer.
     for agent to  do
        Calculate the stochastic gradient of w.r.t. . // update momentum parameters:
     end for// Do consensus update for each agent // End the update of dual problem// Update primal parameters .
     Random sample a mini-batch of transition from the replay buffer.
     for  agent to  do
        Calculate the stochastic gradient , of , w.r.t. , //update the momentum parameter: // Using Adam to update for each agent .// Do consensus update on for each agent :
     end for
  end for
Algorithm 2 Accelerated value propagation

Mixing Matrix: In Algorithm 2, there is a mixing matrix in the consensus update. As its name suggests, it mixes information of the agent and its neighbors. This nonnegative matrix need to satisfy the following condition.

  • needs to be doubly stochastic, i.e., and .

  • respects the communication graph , i.e., if .

  • The spectral norm of is strictly smaller than one.

Here is one particular choice of the mixing matrix used in our work which satisfies above requirement called Metropolis weights Xiao et al. [2005].

(12)

where is the set of neighbors of the agent and