## 1 Introduction

Multi-Agent Reinforcement Learning (MARL) is a system of agents (robots, machines, cars, etc.) which are interacting within a common environment. Each agent makes a decision in each time-step and works along with the other agent(s) to achieve a given goal. Basically, the agents are learnable units which want to learn policy on the fly in order to maximize the long-term reward through the interaction with the environment. Due to the complexities of the environments and the combinatorial nature of the problem, training the agents is a challenging problem and most MARL problems are categorized as NP-Hard problems [bernstein2002complexity].

With the motivation of recent success on deep reinforcement learning (RL)—super-human level control on Atari games [mnih2015human], mastering the game of Go [silver2016mastering], chess [silver2017mastering], robotic [kober2013reinforcement], health care planning [liu2017deep], power grid [glavic2017reinforcement], routing [nazari2018reinforcement], and inventory optimization [oroojlooyjadid2017deep]—several researches have been focused on deep MARL. One naive approach to train a MARL problem could be training all the agents using a centralized controller, which in fact turns to a single agent RL problem to obtain the joint action of all agents to execute in each time step. However, in this approach, the number of actions exponentially increases, which makes the problem intractable. Besides, each agent needs to send its local information to the central controller and with increasing the number of agents, this approach becomes very expensive or impossible. In addition to the communication cost, this approach is vulnerable to the presence of the central unit and any incident for that results in the loss of the network. Moreover, usually in MARL problems, each agent accesses only to some local information and due to privacy issues they may not be allowed to share their information with the other agents.

There are several properties of the system that is important in modeling a MARL system such as (i) centralized or decentralized control, (ii) fully or partially observable environment, (iii) cooperative or competitive environment. Within a centralized controller, a central unit takes the decision for each agent in each time step. On the other hand, in the decentralized system, each agent takes a decision for itself. Also, the agents might cooperate to achieve a common goal, e.g. a group of robots who want to identify a source or they might compete with each other to maximize their own reward, e.g. the players in different teams of a game. In each of these cases, the agent might be able to access the whole information and the sensory observation (if any) of the other agents, or on the other hand, each agent might be able to observe only its local information. In this paper, we have focused on the decentralized models with the cooperative goal, and most of the relevant papers with either full or partial observability are reviewed. Note that, [matignon2012independent, bucsoniu2010multi, bu2008comprehensive] provide reviews on cooperative games and general MARL papers published till 2012. Also, [da2019survey]

provides a survey over the utilization of transfer learning in MARL. We mostly focused on the recent works on the cooperative Deep MARL. Furthermore, MARL problems usually include large state/action spaces. Therefor, the classical tabular RL algorithm are not efficient to solve them, and we mostly focus on the approximated RL algorithms.

The rest of the paper is organized as the following: in Section 1.1 we briefly explain the single agent RL problem and some of its components. Then the multi-agent formulation is represented and some of the main challenges of MARL environment is described in Section 1.2. Section 2 explains the IQL-type algorithm, Section 3 reviews the papers with a fully observable critic model, Section 4 includes the value decomposition papers, Section 5 explains the consensus approach, Section 6 reviews the learn-to-communicate approach, Section 7 explains some of the emerging research directions, Section 8 provides some applications of the MARL models in real-world, Section 9 very briefly mentions some of the available multi-agent environments, and finally Section 10 concludes the paper.

### 1.1 Single-Agent RL Formulation

RL considers a sequential decision making problem in which an agent interacts with an environment. The agent at time period observes state in which is the state space, takes action where is the valid action space for state , and executes it in the environment to receive reward and transfer to the new state . The goal of the agent is to determine a policy that maximizes the long-term reward. The value function starting from state and following policy denoted by is given by

(1) |

where,

is the discounting factor. This problem is known as the Markov Decision Process (MDP) and provides a framework to study problems in a given environment. Within a given state transition probability distribution

and reward matrix , Bellman [bellman1957markovian] showed that the following equation holds for all state in any time step :(2) |

where here and in the rest of the paper we use to denote the successor state of .

Instead of averaging, by maximizing over the actions the optimal state-value and optimal policy can be obtained by:

(3) |

Similarly, the optimal Q-value for each state-action can be obtained by:

(4) |

One can obtain an optimal policy utilizing the above equations. The relevant methods are called value-based methods. However, in the real world usually is not available and one cannot obtain the optimal policy using (3), or (4). In order to address this issue approximating the state-value, the Q-value, or the policy has been a common practice. Therefore, a function is used to approximate the state-value (called value-function) or the policy (called policy-function) for any possible state. Similarly, Q-value is approximated by a Q-function. The function approximator with parameters

can be a simple linear regression model or a deep neural network. The goal is to maximize the utility function,

(5) |

There are two main approaches to solve this problem. In the first approach, a value function or state/value is being learned to approximate the expected sum of the discounted long-term reward. Given these approximations, the optimal policy can be achieved. In the second approach, the policy is directly learned which determines the probability of choosing an action for a given state. We explain a brief overview of the first and second approaches in Sections 1.1.1 and 1.1.2, respectively. In either of these approaches, the distribution of and are unknown, so that the expectation is approximated by averaging over samples from some trajectories. Note that there is a large body of literature on the single-agent RL algorithms, and we only reviewed those algorithms which are actively used in multi-agent literature too. So, we did not explain advanced algorithms like TRPO [schulman2015trust] and PPO [schulman2017proximal] which are not so common in MARL literature. For more details about other single-agent RL algorithms see [li2017deep, sutton2018reinforcement].

#### 1.1.1 Value Approximation

Within the value approximation, a function to estimate

or is learned. It is showed that with large enough number of observation, a linear approximation converges to a local optimal [bertsekas96, sutton2000policy]. Nevertheless, the linear approximators are not powerful enough to capture all complexities of a complex environment. To address this issue, there has been a tremendous amount of research on non-linear approximators and especially neural networks in recent years [mnih2013playing, bhatnagar2009convergent, gao2019reduced]. Among the recent works on the value-based methods, deep Q-network (DQN) algorithm [mnih2015human] got a lot of attention due to its supper human-level performance on the Atari-2600 games [bellemare2013arcade], in which it only observes the video of the game. DQN utilizes the replay memory buffer [lin1992self] and a moving target network to stabilize the training. The replay memory holds the previous observation tuple in whichdetermines if the episode ended with this observation. Then the approximator is trained by taking a random mini-batch from the replay memory. DQN algorithm uses the convolutional neural network (CNN) to the state

as the input and then pass the flatted output of the CNN to a fully connected neural network such that it includes outputs to approximate the Q-value for each possible action. This chain of the neural networks with weights is then trained by taking a mini-batch of sizeto minimize loss function:

(6) | ||||

(7) |

where, is the weights of the *target network* which is updated by every iterations.
Later, new DQN-based approaches proposed for solving RL problems. For example, inspired by [hasselt2010double] which proposed double Q-learning, the Double-DQN algorithm proposed in [van2016deep]

. Similarly, another heuristic formula to obtain

is proposed as Dueling DQN [wang2015dueling]. In addition, another extension of theDQN algorithm is proposed by combining recurrent neural network with DQN. DRQN

[hausknecht2015deep] is a DQN algorithm which uses an LSTM instead of a feed-forward network and is applied to a partially observable environment. In all these algorithms, usually, the -Greedy algorithm is used to ensure the exploration. That is, in each time-step with a probability of the action is chosen randomly and otherwise it is selected greedily by getting an argmax over the Q-values for the state. Typically, the value of is annealed during the training. To see other extensions of the value-based algorithm and other exploration algorithms, see [li2017deep].#### 1.1.2 Policy Approximation

In the value-based methods, the key idea is learning the optimal value-function or the Q-function, and from there a greedy policy could be obtained. In another direction, one can parametrize the policy, make a merit function, and try to optimize this function over the policy parameter through a training process. This class of RL algorithms is called policy-based methods. For example, in the policy gradient method, a stochastic policy by parameters is learned. First, define

where, is called the

feature vector

of the state-action pair . Then, the stochastic policy can be obtained by:which is the softmax function. The goal is to directly learn the parameter using a gradient-based algorithm. Let us measures the value function for the policy starting from . Then the policy gradient theorem provides an analytical expression for as the following:

(8) |

Then the policy-based methods update the parameter as below:

(9) |

where, is an approximation of the true gradient, and is the learning rate.

There are two main benefits of the policy-gradient method over value-based methods [sutton2018reinforcement]: (i) By following the policy gradient algorithm, there is a chance of converging to a deterministic policy, though with a value-based algorithm, there is always a of taking a random action. (ii) With the value-based algorithms there is no way of learning a stochastic policy. This is important when the optimal policy is stochastic which might be the case in competitive games. Considering those benefits, numerous policy gradient methods are proposed in the literature and depending on the gradient approximation strategy, they have different applications and effectiveness. We briefly review some of the most important ones which we recall them frequently in the next sections. For a review on other policy gradient-based algorithms see [sutton2000policy, li2017deep].

REINFORCE algorithm is one of the first successful policy-gradient algorithms which proposed in 2000 [sutton2000policy]. Particularly, REINFORCE applies Monte Carlo method and uses the actual return (T is the end of the episode) as an approximation for the in equation (8).

The Actor-Critic (AC) model extends REINFORCE algorithm by adding another approximator, called critic, to learn the Q-values. The critic receives state and returns the approximation of the . The critic is trained by calculating the TD-error , and updating by , in which

is the critic’s learning rate. A widely useful extension of REINFORCE and actor-critic algorithm is to subtract a baseline from the return (or the approximated return) to reduce the approximation variance. In this order, advantage function

can be used instead of the Q-value in which value function is used as the baseline. Thus, the critic learns the value-function and then the advantage is achieved by . Following this technique, several AC-based methods were proposed. A3C proposed in [mnih2016asynchronous] uses the advantage function and runs several instances of the actor-critic model with the advantage function and asynchronously gathers the gradients and updates the weights of a master node. Afterward, the master node broadcast the new weights to the sender node and in this way all nodes are updated asynchronously. A2C [a2c] uses the same framework, but updates the weights in synchronous manner.### 1.2 Multi-Agent RL Notations and Formulation

We denote a multi-agent setting with tuple , in which is the number of agents, is state space, is the set of actions for all agents, is the transition probability among the states, is the reward function, and is the set of observations for all agents. Within any type of the environment, we use to denote the vector of actions for all agents, is the vector of all agents except agent , represents observation-action history of agent , and is the observation-action of all agents. Also, , , and are the observation-action space, state space, and action space. Then, in a cooperative problem with agents with full observability of the environment, each agent at time-step observes the global state and uses the local stochastic policy to take action and then receives reward . If the environment is fully cooperative, at each time step all agents observe a joint reward value , i.e., . If the agents are not able to fully observe the state of the system, each agent only accesses its own local observation .

Similar to the single-agent case, each agent is able to learn the optimal Q-value or the optimal stochastic policy. However, since the policy of each agent changes as the training progresses; Therefore, the environment becomes non-stationary from the perspective of any individual agent. Basically, when any so that we lose the underlying assumption of MDP. This means that the experience of each agent involves different co-player policies, so we cannot fix them and train an agent such that any attempt to train such models results in fluctuations of the training. This makes the model training quite challenging. Therefore, the adopted Bellman equation for MARL [foerster2017stabilising] (assuming the full observability) also does not hold for the multi-agent system:

(10) |

where . Due to the fact that changes over time as the policy of other agents changes, in MARL one cannot obtain the optimal Q-value using the classic Bellman equation.

On the other hand, since the policy of each agent changes during the training; so, one cannot use experience replay without dealing with the non-stationarity. Without experience replay, the DQN algorithm [mnih2015human] and its extensions or basically neither of the approximated value-based algorithms can be directly used. The same issue exists within AC-based algorithms which use a DQN-like algorithm for the critic. Besides, in most problems in MARL agents are not able to observe the full state of the system, which are categorized as decentralized POMDP (Dec-POMDP). Due to the partial observability and the non-stationarity of the local observations, Dec-POMDPs are even harder problems to solve and it can be shown that they are in the class of NEXP-complete problems [bernstein2002complexity]. A similar equation to (10) can be obtained for the partially observable environment too.

In the Multi-agent RL, the noise and variance of the rewards increase which results in the instability of the training. The reason is that the reward of one agent depends on the actions of other agents, and the conditioned reward on the action of a single agent can exhibit much more noise and variability than a single agent’s reward. Therefore, training a policy gradient algorithm also would not be effective in general.

Finally, we define the following notation which is used in a couple of papers in Nash equilibrium. A joint policy defines a Nash equilibrium if and only if:

(11) |

in which is the expected long-term return of agent in state and is the set of all possible policies for agent . Basically, it means that each agent prefers to not change its policy, if wants to attain the long-term reward. Further, if the following holds for policy :

policy is called Pareto-optimal.

## 2 IQL and Extensions

The most naive approach to solve the multi-agent RL problem is to treat each agent independently such that it considers the rest of the agents as part of the environment. This idea is formalized in independent Q-Learning (IQL) algorithm [tan1993multi], in which each agent accesses its local observation and the overall agents try to maximize a joint reward. Each agent runs a separate Q-learning algorithm [watkins1992q] (or it can be the newer extensions like DQN [mnih2015human], DRQN [hausknecht2015deep], etc.). IQL is an appealing algorithm since (i) it does not have the scalability problem with increasing the number of agents, (ii) each agent only needs its local history of observations during the training and the inference time. The tabular IQL usually works well in practice for small size problems [matignon2012independent, matignon2012independent]; however, in the case of function approximation, especially deep neural network (DNN) it fails. One of the main reasons for this failure is the need for the replay memory to stabilize the training with DNNs [foerster2017stabilising]. In an extension of IQL, Distributed Q-learning [lauer2000algorithm] considers a decentralized fully cooperative multi-agent problem such that all agents observe the full state of the system and do not know the actions of the other agents, although in the training time it assumes the joint action is available for all agents. The joint action is executed in the environment and it returns the joint reward that each agent receives. This algorithm updates the Q-values only when there is a guaranteed improvement, assuming that the low returns are the result of a bad exploration of the teammates. In other words, it maximizes over the possible actions for agent , assuming other agents selected the local optimal action, i.e., for a given joint action , it updates the Q-values of agent by:

(12) |

in which and is the environment function which results in . Therefore, Distributed Q-learning completely ignores the low rewards. Thus, it results in an overestimated Q-values. This issue and the course of dimensionality results in poor performance in the problems with high dimension.

Hysteretic Q-learning [matignon2007hysteretic] considers the same problem and tries to obtain a good policy assuming that the low return might be the result of stochasticity in the environment so that it does not ignore them as Distributed Q-Learning does. In particular, when the TD-error is positive, it updates the Q-values by the learning rate , and otherwise, it updates the Q-values by the learning rate . Thus, the model is also robust to negative learning due to the team-mate explorations. [bowling2002multiagent] also propose to use a variable learning rate to improve the performance of tabular IQL. In another extension for the IQL, [fuji2018deep] propose to train one of the agents at each time and fix the policy of other agents within a periodic manner in order to stabilize the environment. So, during the training, other agents do not change their policies and the environment from the view-point of the single agent is stationary.

DQN algorithm [mnih2015human] utilized *reply memory* and *target network* and was able to attain super-human level control on most of the Atari games. The classical IQL uses the tabular version, so one naive idea could be using DQN algorithm instead of each single Q-learner. [tampuu2017multiagent] implemented this idea and was one of the first papers which took the benefit of the neural network as a general powerful approximator in IQL-like setting. Specifically, this paper analyzes the performance of the DQN in a decentralized two-agent game for both competitive and cooperative settings. They assume that each agent observes the full state (the video of the game), takes an action by its own policy and the reward values are also known to both agents. The paper is mainly built on the *Pong* game (from Atari-2600 environment [bellemare2013arcade]) in which by changing the reward function the competitive and cooperative behaviors are obtained. In the competitive version, each agent that drops the ball loses a reward-point, and the opponent wins the reward-point so that it is a zero-sum game. In the cooperative setting, once either of agents drops the ball, both agents lose a reward-point. The numerical results show that in both cases the agents are able to learn how to play the game very efficiently, that is in the cooperative setting, they learn to keep the ball for long periods, and in the competitive setting the agents learn to quickly beat the competitor.

Replay memory is one of the core elements in the DQN algorithm. It helps to stabilize the training of the neural network and improves the sample efficiency of the history of observations. However, due to the non-stationarity of the environment, using the experience replay in a multi-agent environment is problematic. Basically, the policy that generates the data for the replay memory is different than the current policy so that the learned policy of each agent can be misleading. In order to address this issue, [foerster2016learning] disables the replay memory part of the algorithm, or in [leibo2017multi] the old transitions are discarded and the replay memory uses only the recent experiences. Even though these approaches help to reduce the non-stationarity of the environment, but both limit the sample efficiency. To resolve this problem, [foerster2017stabilising] propose two algorithms to stabilize the reply memory in IQL-type algorithms. They consider a fully cooperative MARL with local observation-action. In the first approach, each transition is augmented with the probability of choosing the joint action. Then, during the loss-calculation, the importance sampling correction is calculated using the current policy. Thus, the loss function is changed to:

(13) |

in which is the policy parameters for agent , is the current time-step, and is the time of collecting sample. In this way, the effect of the transitions generated from dissimilar policies is regularized on gradients. In the second algorithm, named FingerPrint, they propose augmenting the replay memory with some parts of the policies of the other agents. However, the number of parameters in DNN is usually large and as a result, it is intractable in practice. Thus, they propose to augment each instance in the replay memory by the iteration number and the of the -greedy algorithm. In the numerical experiments, they share the weights among the agent, while the id of each agent is also available as the input. They provide the results of two proposed algorithms plus the combination of them on the StarCraft game [samvelyan19smac] and compare the results by a classic experience replay and one no-experience replay algorithm. They conclude that the second algorithm obtains better results compared to the other algorithm.

[omidshafiei2017deep] proposes another extension of the replay memory for the MARL. They consider multi-task cooperative games, with independent partially observable learners such that each agent only knows its own action, with a joint reward. An algorithm, called HDRQN, is proposed which is based on the DRQN algorithm [hausknecht2015deep] and the Hysteretic Q-learning [matignon2007hysteretic]. Also, to alleviate the non-stationarity of MARL, the idea of Concurrent Experience Replay Trajectories (CERTs) is proposed, in which the experience replay memory gathers the experiences of all the agents in any period of one episode and also during the sampling of a mini-batch, it obtains the experiences of one period of all agents together. Since they use LSTM [hochreiter1997long]

, the experiences in the replay memory are zero-padded (adds zero to the end of the experiments with smaller sizes to make the size of all experiments equal). Moreover, in the multi-task version of HRDQN, there are different tasks that each has its own transition probability, observation, and reward function. During the training, each agent observes the task ID, while it is not accessible in the inference time. To evaluate the model, a two-player game is utilized, in which agents are rewarded only when all the agents simultaneously capture the moving target. In order to make the game partially observable, a flickering screen is used such that with

chance the screen is flickering. The actions of the agents are moving north, south, west, east, or waiting. Additionally, actions are noisy, i.e. with probability the agent might act differently than what it wanted.## 3 Fully Observable Critic

Non-stationarity of the environment is the main issue in MARL problems. One of the common approaches to address this issue is using a fully observable critic. The fully observable critic involves the observation and actions of all agents and as a result the environment is stationary even though the policy of other agents changes. In other words, even if , since the environment returns an equal next-state regardless of the changes in the policy of other agents. So, once the critic is fully observable, the non-stationarity is resolved and it can be used as a good leader for local actors. Using this idea, [lowe2017multi] proposes a model-free multi-agent reinforcement learning algorithm to the problem in which agent at time step of execution accesses its own local observation , local actions , and local rewards . They consider cooperative, competitive, and mixed competitive and cooperative games, and proposed Multi-agent DDPG (MADDPG) algorithm in which each agent trains a DDPG algorithm such that the actor with policy weights observes the local observations while the critic is allowed to access the observations, actions, and the target policies of all agents in the training time. Then, each critic concatenates all states-actions together as the input and using the local reward obtains the corresponding Q-value. The critic is trained by minimizing a DQN-like loss function:

in which is the target policy and is the target critic. As a result, the *critic of each agent deals with a stationary environment*, and in the inference time, it only needs to access the local information. MADDPG is compared with the decentralized trained version of DDPG [lillicrap2015continuous], DQN [mnih2015human], REINFORCE [sutton2000policy], and TRPO [schulman2015trust] algorithm in a set of grounded communication environments from particle environment [haarnoja2018soft], e.g., predator-prey, arrival task, etc. The continues space-action predator-prey environment from this environment is usually considered as a benchmark for MARL problem with local observation and cooperative rewards. In the most basic version of the predator-prey, there are two predators which are randomly placed in a grid, along with one prey which also randomly is located in the grid. Each predator observes its direct neighbor cells, i.e., cells, and the goal is to catch the prey together to receive a reward, and in all other situations, each predator obtains a negative reward.

Several extensions of MADDPG algorithm are proposed in the literature and we review some of them in the rest of this section. [ryu2018multi] proposes an actor-critic model with local actor and critic for a DEC-POMDP problem, in which each agent observes a local observation , observes its own reward , and learns a deterministic policy . The goal for agent is to maximize its own discounted return . An extension of MADDPG with generative cooperative policy network, called MADDPG-GCPN, is proposed in which there is an extra actor network

to generate action samples of other agents. Then, the critic uses the replay buffer filled with sampled actions from GCPN, and not from the actor of the other agents. So, there is no need to share the target policy of other agents during the training time. Further, the algorithm modified in a way such that the critic can use either immediate individual or joint reward during the training. They presented a new version of the predator-prey game in which each agent receives an individual reward plus a shared one if they catch the prey. The experimental analysis on the predator-prey game and a controlling energy storage systems problem shows that the standard deviation of obtained Q-values is lower in MADDPG-GCPN compared to MADDPG.

As another extension of MADDPG, [chu2017parameter] considers multi-agent cooperative problems with agents and proposes three actor-critic algorithms based on MADDPG. The first one assumes all the agents know the global reward and shares the weights between agents so that it actually includes one actor and one critic network. The second algorithm assumes the global reward is not shared each agent updates its own critic using the local reward so that there are critic networks. Although, agents share their weights so that there is only one actor network and in sum networks are trained. The third algorithm also assumes non-shared global reward, though uses only two networks, one actor network and one critic network such that the critic has heads in which head provides the Q-value of the agent . They compare the results of their algorithms on three new games with MADDPG, PPO [schulman2017proximal], and PS-TRPO algorithms (PS-TRPO is the TRPO algorithm [schulman2015trust] with parameter sharing, see [sukthankar2017autonomous] for more details).

[mao2018modelling] presents another algorithm based on MADDPG for a cooperative game, called ATT-MADDPG which considers the same setting as in [lowe2017multi]. It enhances the MADDPG algorithm by adding an attention layer in the critic network. In the model, each agent trains a critic which accesses the actions and observations of all agents. To obtain the Q-value, an attention layer is added on the top of the critic model to determine the corresponding Q-value. In this way, at agent , instead of just using and for time step , ATT-MADDPG considers combinations of possible action-vector , and obtains the corresponding

Q-values. Also, using an attention model, it obtains the weights of all

action-sets such that the hidden vector of the attention model is generated via the actions of other agents (). Then, the attention weights are used to obtain the final Q-value by a weighted sum of the possible Q-values. Indeed, their model combines the MADDPG algorithm with k-head Q-value model [van2017hybrid]. They provide some numerical experiments on cooperative navigation, predator-prey, a packet-routing problem, and compare the performance with MADDPG and few other algorithms. Moreover, the effect of small or large is analyzed within each environment. In [wang2019r] again the problem setting is considered as the MADDPG, though they assume a limit on the communication bandwidth. Due to this limitation, the agents are not able to share all the information, e.g., they cannot share their locations on "arrival task" (from Multi-Agent Particle Environment [mordatch2017emergence]), which limits the ability of the MADDPG algorithm to solve the problem. To address this issue, they propose R-MADDPG, in which a recurrent neural network is used to remember the last communication in both actor and critic. In this order, they modified the replay memory such that each tuple includes (), in which is the hidden state of the actor network. The results are compared with MADDPG over the "arrival task" with communication limit, where each agent can select either to send a message or not, and the message is simply the position of the agent. With the fully observable state, their algorithm works as well as MADDPG. On the other hand, within the partially observable environment, recurrent actor (with fully connected critic) does not provide any better results than MADDPG; though, applying both recurrent actor and recurrent critic, R-MADDPG obtains higher rewards than MADDPG.Since MADDPG concatenates all the local observations in the critic, it faces the curse of dimensionality with increasing the number of agents. To address this issue,

[iqbal2018actor] proposed Multiple Actor Attention-Critic (MAAC) algorithm which efficiently scales up with the number of agents. The main idea in this work is to use an attention mechanism [choi2017multi, jiang2018learning] to select relevant information for each agent during the training. In particular, agent receives the observations, , and actions, from all agents. The value function parameterized by , , is defined as a function of agent ’s observation-action, as well as the information received from the other agents:where

is a two-layer perceptron,

is a one-layer embedding function, and is the contribution of other agents. In order to fix the size of , it is set equal to the weighted sum of other agents’ observation-action:where is a function of the embedding of agent

, encoded with an embedding function and then linearly transformed by a shared matrix

, andis the activation function. Denoting

, using the query-key system [vaswani2017attention], the attention weight is proportional to:where transforms into a “query” and transforms into a “key”. The critic step updates the through minimizing the following loss function:

where,

where and are the parameters of the target critics and target policies respectively. In order to encourage exploration, they also use the idea of soft-actor-critic (SAC) [haarnoja2018soft]. MAAC is compared with COMA [foerster2018counterfactual] (will be discussed shortly), MADDPG, and their updated version with SAC, as well as an independent learner with DDPG [lillicrap2015continuous], over two environments: treasure collection and rover-tower. MAAC obtains better result than the other algorithms such that the performance gap becomes shorter as the number of agents increases.

[jiang2018graph] assumes a graph connection among the agents such that each node is an agent. A partially observable environment is assumed in which each agent observes a local observation, takes a local action and receive a local reward, while agents can share their observations with their neighbors and the weights of all agents are shared. A graph convolutional reinforcement learning for cooperative multi-agent is proposed. The multi-agent system is modeled as a graph such that agents are the nodes, and each has some features which are the encoded local observations. A multi-head attention model is used as the convolution kernel to obtain the connection weights to the neighbor nodes. To learn Q-function an end-to-end algorithm, named DGN, is proposed, which uses the centralized training and distributed execution approach. The goal is to maximize the sum of the reward of all agents. During the training, DGN allows the gradients of one agent flow to K-neighbor agents and its receptive field to stimulate cooperation. In particular, DGN consists of three phases: (i) an observation encoder, (ii) convolutional layer, and (iii) Q-network. The encoder (which is a simple MLP, or convolution layer if it deals with images) receives observation of agent at time and encodes it to a feature vector . In phase (ii), the convolutional layer integrates the local observations of neighbors to generate a latent feature . In order to obtain the latent vector, an attention model is used to make the input independent of the number of input features. The attention model gets all feature vectors of the -neighbors, generates the attention weights, and then calculates the weighted sum of the feature vectors to obtain . Another convolution layer may be added to the model to increase the receptive field of each agent, such that are the inputs of that layer and are the outcomes. Finally, in phase (iii), the Q-network provides the Q-value of each possible action. Based on the idea of DenseNet [huang2017densely], the Q-network gathers observations and all the latent features and concatenates them for the input. (Note that this procedure is followed for each agent, and the weights of the network are shared among all agents.) In the loss function, besides the Q-value loss function, a penalized KL-divergence is added. This function measures the changes between the current attention weights and the next state attention weights and tries to not allow drastic changes in the attention weight. Using the trained network, in the execution time each agent observes plus the observation of its neighbors () to get an action. The results of their algorithm are compared with DQN, CommNet [sukhbaatar2016learning], and MeanField Q-Learning [yang2018mean], on Jungle, Battle, and Routing environments.

[yang2018mean] considers the multi-agent RL problems in the case that there exists a huge number of agents collaborating or competing with each other to optimize some specific long-term rewards. They propose Mean Field Reinforcement Learning framework, where every single agent only considers an average effect of its neighborhoods, instead of exhaustive communication with all other agents within the population. Two algorithms namely Mean Field Q-learning (MF-Q) and Mean Field Actor Critic (MF-AC) are developed following the mean-field idea. There exists a single state visible to all agents, and the local reward and the local action of all agents are also visible to the others during the training. Applying Tylor’ theorem, it is proved that can be approximated by , where concatenates the action of all agents, is the action of agent , and denotes the average action from the neighbors. Furthermore, utilizing the contraction mapping technique, it is shown that mean-field Q-values converge to the Nash Q-values under some particular assumptions. The proposed algorithms are tested on three different problems: Gaussian Squeeze and the Ising Model (a framework in statistical mechanics to mathematically model ferromagnetism), which are cooperative problems, and the battle game, which is a mixed cooperative-competitive game. The numerical results show the effectiveness of the proposed method in the case of many-agent RL problems.

[foerster2018counterfactual] proposes COMA, a model with a single centralized critic which uses the global state, the vector of all actions, and a joint reward. This critic is shared among all agents, while the actor is trained locally for each agent with the local observation-action history. The joint reward is used to train . Then, for the agent , with the fixed the actor uses a counterfactual baseline

to obtain contribution of action via advantage function

. Also, each actor shares their weights with other agents, and uses gated recurrent unit (GRU)

[cho2014learning] to utilize the history of the observation. They present the results of their algorithm on StarCraft game [samvelyan19smac] and compare COMA with central-V, central-QV, and two implementations of independent actor-critic (IAC).[yang2018cm3] considers a multi-agent cooperative problem in which in addition to the cooperative goal, each agent needs to attain some personal goals. Each agent observes a local observation, local reward corresponding to the goal, and its own history of actions, while the actions are executed jointly in the environment. This is a common situation in problems like autonomous car driving. For example, each car has to reach a given destination and all cars need to avoid the accident and cooperate in junctions. The authors propose an algorithm (centralized training, decentralized execution) called CM3, with two phases. In the first phase, one single network is trained for all agents to learn personal goals. The output of this network, a hidden layer, is passed to a given layer of the second network to initialize it, and the goal of the second phase is to attain the global goal. Also, since the collective optimal solution of all agents is not necessarily optimal for every individual agent, a credit assignment approach is proposed to obtain the global solution of all agents. This credit assignment, motivated by [foerster2018counterfactual], is embedded in the design of the second phase. In the first phase of the algorithm, each agent is trained with an actor-critic algorithm as a single agent problem learning to achieve the given goal of the agent. In this order, the agent is trained with some randomly assigned goal very well, i.e., agent wants to learn the local policy to maximize for any arbitrary given goal. All agents share the weights of the policy, so the model is reduced to maximize . Using the advantage approximation, the update is performed by , in which is the goal of agent . The second phase starts by the pre-trained agents, and trains a new global network in the multi-agent setting to achieve a cooperative goal by a comprehensive exploration. The cooperative goal is the sum of local rewards, i.e., , in which . In the first phase, each agent only observes the part of the state that involves the required observation to complete the personal task. In the second phase, additional observations are given to each agent to achieve the cooperative goal. This new information is used to train the centralized critic and also used in the advantage function to update the actor’s policy. The advantage function uses the counterfactual baseline [foerster2018counterfactual], so that the global objective is updated by . Finally, a combined version of the local and global models are used in this phase to train the model with the centralized critic. They present the experiments on an autonomous vehicle negotiation problem, and compare the results with COMA [foerster2018counterfactual] and independent actor-critic learners model.

[sartoretti2019distributed] extends A3C [mnih2016asynchronous] algorithm for a centralized actor and a centralized critic. The proposed algorithm is based on centralized training and decentralized execution, in a fully observable environment. They consider a construction problem, TERMES [petersen2012termes], in which each agent is responsible to gather, carry, and place some blocks to build a certain structure. Each agent observes the global state plus its own location, takes its own local action and executes it locally (no joint action selection/execution). Each agent receives its local sparse reward, such that the reward is +1 if it puts down a block in a correct position and -1 if it picks up a block from a correct position. Any other actions receive 0 reward. During the training, all agents share the weights (as it is done in A3C) for both actor and critic models and train a central agent asynchronously. In the execution time, each agent uses one copy of the learned policy without communicating with other agents. The goal of all agents is to achieve the maximum common reward. The learned policy can be executed in an arbitrary number of agents, and each agent can see the other agents as moving elements, i.e., as part of the environment.

Finally, in [kim2019learning] a MARL problem is considered under the following two assumptions: 1) The communication bandwidth among the agents is limited. 2) There exists a shared communication medium such that at each time step only a subset of agents is able to use it in order to broadcast their messages to the other agents. Therefore, communication scheduling is required to determine which agents are able to broadcast their messages. Utilizing the proposed framework, which is called SchedNet, the agents are able to schedule themselves, learn how to encode the received messages, and also learn how to pick actions based on these encoded messages. SchedNet focuses on centralized training and distributed execution. Therefore, in training the global state is available to the critic, while the actor is local for each agent and the agents are able to communicate through the limited channel. To control the communication, a Medium Access Control (MAC) protocol is proposed, which uses a weight-based scheduler (WSA) to determine which nodes can access to the shared medium. The local actor contains three networks: 1) message encoder, which takes the local observation and outputs the messages . 2) weight generator, which takes the local observation and outputs the weight . Specifically, the determines the importance of observations in node . 3) action selector. This network receives the observation , encoded messages plus the information from the scheduling module, which selects agents who are able to broadcast their messages. Then maps this information to the action . A centralized critic is used during the training to criticize the actor. In particular, the critic receives the global state of the environment and the weigh vector generated by the wight generator networks as an input, and the output will be as well as . The first one is used to update the actor wight while the second one is used for adjusting the weights of two other networks, namely action selector and message encoder. Experimental results on the predator-prey, cooperative-communication,and navigation task demonstrate that intelligent communication scheduling can be helpful in MARL.

## 4 Value Function Factorization

Consider a cooperative MARL problem in which we are allowed to share all information among the agents and there is no communication limitation among the agents. Further, let’s assume that we are able to deal with the huge action space. In this scenario, a centralized RL approach can be used to solve the problem, i.e., all state observations are merged together and the problem is reduced to a single agent problem with a combinatorial action space. However, [sunehag2018value] shows that naive centralized RL methods fail to find the global optimum, even if we are able to solve the problems with such huge state and action space. The issue comes from the fact that some of the agents may get lazy and not learn and cooperate as they are supposed to. This may lead to failure of the whole system. One possible approach to address this issue is to determine the role of each agent in the joint reward and then somehow isolate its share out of it. This category of algorithms is called Value Function Factorization.

In POMDP settings, if the optimal reward shaping is available, the problem reduces to train several independent learners, which simplifies the learning. Therefore, having a reward shaping model would be appealing for any cooperative MARL. Although, it is not easy to divide the received reward among the agents based on their contribution. Following this idea, the rest of this section discusses the corresponding algorithms.

In the literature of tabular RL, there are two common approaches for reward shaping: (i) *difference rewards* [agogino2004unifying] which tries to isolate the reward of each agent from the joint reward, i.e. , where denotes the other agents’ share than agent in the global reward,
(ii) * Potential-based reward shaping* [ng1999policy]. In this class of value function factorization methods, term is used instead of mere , in which defines the desirability of the agent to be at state . This approach is also extended for online POMDP settings [eck2016potential] and multi-agent setting [devlin2011theoretical], though defining the potential function is challenging and usually needs specific domain knowledge. In order to address this issue, [devlin2014potential] combines these approaches and proposes two tabular algorithms.
Following this idea, several value function factorization algorithms are proposed to automate reward shaping and avoid the need for a field expert, which are summarized in the following.

[sunehag2018value] considers a fully cooperative MARL (so a single shared reward exists) in which each agent observes its own state and action history. An algorithm, called VDN, is proposed to decompose the value function for each agent. Intuitively, VDN measures the impact of each agent on the observed joint reward. It is assumed that the joint action-value function can be additively decomposed into Q-functions for agents, in which each Q-function only relies on the local state-action history, i.e.,

(14) |

In other words, for the joint observation-action history , it assumes validity of the individual-global-max (IGM) condition. Individual action-value functions satisfies IGM condition for the joint action-value function , if:

(15) |

in which is the vector of local observation of all agents, and

is the vector of actions for all agents. Therefore, each agent observes its local state, obtains the Q-values for its action, selects an action, and then the sum of Q-values for the selected action of all agents provides the total Q-value of the problem. Using the shared reward and the total Q-value, the loss is calculated and then the gradients are backpropagated into the networks of all agents. In the numerical experiments, a recurrent neural network with dueling architecture

[wang2015dueling] is used to train the model. Also, two extensions of the model are analyzed: (i) shared the policy among the agent, by adding the one-hot-code of the agent id to state input, (ii) adding information channels to share some information among the agents. Finally, VDN is compared with independent learners, and centralized training, in three versions of the two-player 2D grid.QMIX [rashid2018qmix] considers the same problem as VDN does, and proposed an algorithm which is basically an improvement over VDN [sunehag2018value]. As mentioned, VDN adds some restrictions to have the additivity of the Q-value and further shares the action-value function during the training. QMIX also shares the action-value function during the training (a centralized training algorithm, decentralized execution); however, adds the below constraint to the problem:

(16) |

which enforces positive weights on the mixer network, and as a result, it can guarantee (approximately) monotonic improvement. Particularly, in this model, each agent has a network and they are part of the general network () that provides the Q-value of the whole game. Each has the same structure as DRQN [hausknecht2015deep], so it is trained using the same loss function as DQN. Besides the monotonicity constraint over the relationship between and each , QMIX adds some extra information from the global state plus a non-linearity into to improve the solution quality. They provide numerical results on StarCraft II and compare the solution with VDN.

Even though VDN and QMIX cover a large domain of MARL problems, the existing restrictions for these two do not hold for all problems. To the end of relaxing these restrictions, [son2019qtran] proposes QTRAN algorithm. The general settings are the same as VDN and QMIX (i.e., general DEC-POMDP problems in which each agent has its own partial observation, action history, and all agents share a joint reward). The key idea here is that the actual may be different than . However, they consider an alternative joint action-value , assumed to be factorizable by additive decomposition. Then, to fill the possible gap between and they introduce

(17) |

in which is . Given, , they prove that

(18) |

Based on this theory, three networks are built: individual , , and the joint regularizer and three loss functions are demonstrated to train the networks. The local network at each agent is just a regular value-based network, with the local observation which provides the Q-value of all possible actions and runs locally at the execution time. Both and the regularizer networks use hidden features from the individual value-based network to help sample efficiency. In the experimental analysis, the comparisons of QTRAN with VDN and QMIX on Multi-domain Gaussian Squeeze [holmesparker2014exploiting] and modified predator-prey [stone2000multiagent] is provided.

Within the cooperative setting with the absence of joint reward, the reward-shaping idea can be applied too. Specifically, assume at time step agent observes its own local reward . In this setting, [mguni2018inducing] considers a MARL problem in which each agent observes the full state, takes its local action based on a stochastic policy. A general reward-shaping algorithm for the multi-agent problem is discussed and proof for obtaining the Nash equilibrium is provided. In particular, a meta-agent (MA) is introduced to modify the agent’s reward functions to get the convergence to an efficient Nash equilibrium solution. The MA initially does not know the parametric reward modifier and learns it through the training. Basically, MA wants to find the optimal variables to reshape the reward functions of each agent, though it only observes the corresponding reward of chosen . With a given , the MARL problem can convergence while the agents do not know anything about the MA function. The agents only observe the assigned reward by MA and use it to optimize their own policy. Once all agents execute their actions and receive the reward, the MA receives the feedback and updates the weight . Training the MA with a gradient based algorithm is quite expensive, so in the numerical experiments, a Bayesian optimization with an expected improvement acquisition function is used. To train the agents, an actor-critic algorithm is used with a two-layer neural network as the value network. The value network shares its parameters with the actor network, and an A2C algorithm [a2c] is used to train the actor. It is proved that under a given condition, a reward modifier function exists such that it maximizes the expectation of the reward modifier function. In other words, a Markov-Nash equilibrium (M-NE) exists in which each agent follows a policy that provides the highest possible value for each agent. Then, convergence to the optimal solution is proved under certain conditions. To demonstrate the performance of their algorithm, a problem with 2000 agents is considered in which the desire location of agents changes through time.

## 5 Consensus

The idea of the centralized critic, which is discussed in Section 3, works well when there exists a small number of agents in the communication network. However, with increasing the number of agents, the volume of the information might overwhelm the capacity of a single unit. Moreover, in the sensor-network applications, in which the information is observed across a large number of scattered centers, collecting all this local information to a centralized unit, under some limitations such as energy limitation, privacy constraints, geographical limitations, and hardware failures, is often a formidable task. One idea to deal with this problem is to remove the central unit and allow the agents to communicate through a sparse network, and share the information with only a subset of agents, in order to reach a consensus over a variable with these agents (called neighbors). Besides the numerous real-world applications of this setting, this is quite a fair assumption in the applications of MARL [zhang2018fully, jiang2018graph]. By limiting the number of neighbors to communicate, the amount of communication remains linear in the order of the number of neighbors. In this way, each agent uses only its local observations, though uses some shared information from the neighbors to stay tuned with the network. Further, applying the consensus idea, there exist several works which prove the convergence of the proposed algorithms when the linear approximators are utilized. In the following, we review some of the leading and most recent papers in this area.

[varshavskaya2009efficient] studies a problem in which each agent has a local observation, executes its own local policy and receives a local reward. A tabular policy optimization agreement algorithm is proposed which uses Boltzmann’s law (similar to soft-max function) to solve this problem. The agreement (consensus) algorithm assumes that an agent can send its local reward, a counter on the observation, and the taken action per observation to its neighbors. The goal of the algorithm is maximizing weighted average of the local rewards. In this way, they guarantee that each agent learns as much as a central learner could, and therefore converges to a local optimum.

In [kar2013distributed, kar2013cal] the authors proposed a decentralized multi-agent version of the tabular Q-learning algorithm called -learning. In this paper, the global reward is expressed as the sum of the local rewards, though each agent is only aware of its own local reward. In the problem setup, the authors assume that the agents communicate through a time-invariant un-directed weakly connected network, to share their observations with their direct neighbors. All agents observe the global state and the global action, and the goal is to optimize the network-averaged infinite horizon discounted reward. The -learning works as follows. Assume that we have agents in the network and agent can communicate with its neighbors . This agent stores values for all possible state-action pairs. Each update for the agent at time includes the regular Q-learning plus the deviation of the Q-value from its neighbor as below:

(19) |

where is the set of all possible actions, and and are the stepsizes of the -learning algorithm. It is proved that this method converges to the optimal Q-values in an asymptotic behavior under some specific conditions on the step-sizes. In other words, under the given condition, they prove that their algorithm obtains the result that the agent could achieve if the problem was solved centrally.

In [DAC_Pennesi_2010] a distributed Actor-Critic (D-AC) algorithm is proposed under the assumption that the states, actions, and rewards are local for each agent; however, each agent’s action does not change the other agents’ transition models. The critic step is performed locally, meaning that each agent evaluates its own policy using the local reward receives from the environment. In particular, the state-action value function is parameterized using a linear function and the parameters are updated in each agent locally using Temporal Difference algorithm together with the eligibility traces. The Actor step on the other hand is conducted using information exchange among the agents. First, the gradient of the average reward is calculated. Then a gradient step is performed to improve the local copy of the policy parameter along with a consensus step. A convergence analysis is provided, under diminishing step-sizes, showing that the gradient of the average reward function tends to zero for every agent as the number of iterations goes to infinity. In this paper, a sensor network problem with multiple mobile nodes has been considered for testing the proposed algorithm. In particular, there are target points and mobile sensor nodes. Whenever one node visits a target point a reward will be collected. The ultimate goal is to train the moving nodes in the grid such that the long-term reward becomes maximized. They have been considered a grid with three target points and sixteen agents. The numerical results prove that the reward improves over time while the policy parameters reach consensus.

[macua2017diff] proposes a new algorithm, called Diffusion-based Distributed Actor-Critic (Diff-DAC) for single and multi-task multi-agent RL. In the setting, there are agents in the network such that there is one path between any two agents, each is assigned either the same task or a different task than the others, and the goal is to maximize the weighted sum of the value functions overall tasks. Each agent runs its own instance of the environment with a specific task, without intervening the other agents. For example, each agent runs a given cart-pole problem where the pole length and its mass are different for different agents. Basically, one agent does not need any information, like state, action, or reward of the other agents. Agent learns the policy with parameters

, while it tries to reach consensus with its neighbors using diffusion strategy. In particular, the Diff-DAC trains multiple agents in parallel with different and/or similar tasks to reach a single policy that performs well in average for all tasks, meaning that the single policy might obtain a high reward for some tasks but performs poorly for the others. The problem formulation is based on the average reward, average value-function, and average probability transition. Based on that, they provide a linear programming formulation of the tabular problem and provide its Lagrangian relaxation and the duality condition to have a saddle point. A dual-ascent approach is used to find a saddle point, in which (i) a primal solution is found for a given dual variable by solving an LP problem, and then (ii) a gradient ascend is performed in the direction of the dual variables. These steps are performed iteratively to obtain the optimal solution. Next, the authors propose a practical algorithm utilizing a DNN function approximator. During the training of this algorithm, agent

first performs an update over the weights of the critic as well as actor networks using local information. Then, a weighted average is taken over the weights of both networks which assures these networks reach consensus. The algorithm is compared with a centralized training for the Cart-Pole game.In [zhang2018fully] a MARL problem is considered with the following setup. There exists a common environment for all agents and the global state is available to all of them, each agent takes its own local action , and the global action is available to all agents, each agent receives reward after taking an action and this reward is visible only in agent . In this setup, the agents can do time-varying randomly communication with their direct neighbors () to share some information. Two AC-based algorithms are proposed to solve this problem. In the first algorithm, each agent has its own local approximation of the Q-function with weights , though a fair approximation needs global reward (not local rewards ). To address this issue, it is assumed that each agent shares parameter with its neighbors and in this way a consensual estimate of can be achieved. To update the critic, the temporal difference is estimated by

in which

(20) |

i.e. the moving average of the agent rewards with parameter , and the new weights are achieved locally. To achieve the consensus a weighted sum (with weights coming from the consensus matrix) of the parameters of the neighbors’ critics are calculated as below:

(21) |

This weighted sum provides the new weights of the critic for the next time step. To update the actor, each agent observes the global state and the local action to update its policy; though during the training, the advantage function requires actions of all agents, as mentioned earlier. In the critic update, the agents do not share any rewards info and neither actor policy; so, in some sense, the agents keep the privacy of their data and policy. However, they share the actions with all agents, so that the setting of the problem is pretty similar to MADDPG algorithm; although, MADDPG assumes the local observation in the actor. In the second algorithm, besides sharing the critic weights, the critic observes the moving average estimate for rewards of the neighbor agents and uses that to obtain a consensual estimate of the reward. Therefore, this algorithm performs the following update instead of (20):

in which Note that in the second algorithm agents share more information among their neighbors. From the theoretical perspective, The authors provide a global convergence proof for both algorithms in the case of linear function approximation. In the numerical results, they provide the results on two examples: (i) a problem with 20 agents and |S|=20, (ii) a completely modified version of cooperative navigation [lowe2017multi] with 10 agents and , such that each agent observes the full state and they added a given target landmark to cover for each agent; so agents try to get closer to the certain landmark. They compare the results of two algorithms with the case that there is a single actor-critic model, observing the rewards of all agents and the centralized controller is updated there. In the first problem, their algorithms converged to the same ’return value’ that the centralized algorithms achieve. In the second problem, it used a neural network and with that non-linear approximation and their algorithms got a small gap compared to the solutions of the centralized version.

[zhang2018networked] considers the MARL problem with continuous state and action space. The rest of the setting is similar to the [zhang2018fully] (i.e., global state, global action, and local reward). Again, an AC-based algorithm is proposed for the considered problem. In general, for the continuous spaces, stochastic policies lead to a high variance in gradient estimation. Therefore, to deal with this issue deterministic policy gradient (DPG) algorithm is proposed in [silver2014deterministic] which requires off-policy exploration. However, in the setting of [zhang2018networked] the off-policy information of each agent is not known to other agents, so the approach used in DPG [silver2014deterministic, lillicrap2015continuous] cannot be applied here. Instead, a gradient update based on the expected policy-gradient (EPG) [ciosek2018expected] is proposed, which uses a global estimate of Q-value, approximated by the consensus update. Thus, each agent shares parameters of each Q-value estimator with its neighbors. Given the mentioned assumptions, the convergence guarantees with a linear approximator are provided and the performance is compared with a centrally trained algorithm for the same problem.

Following similar setting as [zhang2018fully], [suttle2019multi] proposes a new distributed off-policy actor-critic algorithm, such that there exists a global state visible to all agents, each agent takes an action which is visible to the whole network, and receives a local reward which is available only locally. The main difference between this work and [zhang2018fully] comes from the fact that the critic step is conducted in an off-policy setting using emphatic temporal differences policy evaluation method [sutton2016emphatic]. In particular, uses state-dependent discount factor () and state-dependent bootstrapping parameter (). Besides, in this method there exists an interest function that takes into account the user’s interest in specific states. The algorithm steps are as the following: First, each agent performs a consensus step over the critic parameter. Since the behavior policy is different than the target policy for each agent, they apply importance sampling [kroese2012monte] to re-weight the samples from the behavior policy in order to correspond them to the target policy. Then, an inner loop starts to perform another consensus step over the importance sampling ratio. In the next step, a critic update using algorithm is performed locally and the updated weights are broadcast over the network. Finally, each agent performs the actor update using local gradient information for the actor parameter. Following the analysis provided for in [yu2015convergence], the authors proved the convergence of the proposed method for the distributed actor-critic method when linear function approximation is utilized.

[zhang2017datadriven] proposes a consensus RL algorithm, in which each agent uses its local observations as well as its neighbors within a given directed graph. The multi-agent problem is modeled as a control problem, and a consensus error is introduced. The control policy is supposed to minimize the consensus error while stabilizes the system and gets the finite local cost. A theoretical bound for the consensus error is provided, and the theoretical solution for having the optimal policy is discussed, which indeed needs environment dynamics. A practical actor-critic algorithm is proposed to implement the proposed algorithm. The practical version involves neural network approximator via a linear activation function. The critic measures the local cost of each agent, and the actor network approximates the control policy. The results of their algorithm on leader-tracking communication problem are presented and compared with the known optimal solution.

In [distributed_macua_2015], an off-policy distributed policy evaluation algorithm is proposed. In this paper, a linear function has been used to approximate the long-term reward of a given policy (target policy), which is assumed to be the same for all agents, while different agents follow different policies along the way. In particular, a distributed variant of the Gradient Temporal Difference (GTD) algorithm ^{1}^{1}1 GTD algorithm is proposed to stabilize the TD algorithm with linear function approximation in an off-policy setting. [GTD2_Sutton_2009] is developed utilizing a primal-dual optimization scenario. In order to deal with the off-policy setting, they have applied the importance sampling technique. The state space, action space, and transition probabilities are the same for every node, but their actions do not influence each other. This assumption makes the problem stationary. Therefore, the agents do not need to know the state and the action of the other agents. Regarding the reward, it is assumed that there exists only one global reward in the problem. First, they showed that GTD algorithm is a stochastic Arrow-Hurwicz ^{2}^{2}2Arrow-Hurwicz is a primal-dual optimization algorithm which performs the gradient step on the Lagrangian over the primal and dual variables iteratively [arrow_hurwicz] algorithm applied to the dual problem of the original optimization problem. Then, inspiring from [chen2012diffusion], they proposed a diffusion-based distributed GTD algorithm. Under sufficiently small but constant step-sizes, they provide a mean-square-error performance analysis which proves that the proposed algorithms converge to a unique solution. In order to evaluate the performance of the proposed method, a 2-D grid world problem with 15 agents is considered. Two different policies are evaluated using distributed GTD algorithm. It is shown that the diffusion strategy helps the agents to benefit from the other agents’ experiences.

Considering similar setup as [distributed_macua_2015], in [multiagent_stankovic_2016] two multi-agent policy evaluation algorithms were proposed over a time-varying communication network. A given policy is evaluated using the samples derived from different policies in different agents (i.e. off-policy). Same as [distributed_macua_2015], it is assumed that the actions of the agents do not interfere with each other. Weak convergence is provided for both algorithms.

Another variant of the distributed GTD algorithm was proposed in [PrimalDual_Lee_2018]. Each agent in the network is following a local policy and the goal is to evaluate the global long term reward which is the sum of the local rewards. In this work, it is assumed that each agent can observe the global joint state. A linear function which combines the features of the states is used as the estimator for the value function. The problem is modeled as a constrained optimization problem (consensus constraint), and then following the same process as [distributed_macua_2015]

, a primal-dual algorithm was proposed to solve it. A rigorous analysis based on the ordinary differential equation (ODE)

[borkar_ode_2000] is provided for the proposed algorithm. Particularly, they showed that the set of KKT solutions of the Lagrangian dual problem can be compactly written by an ODE in the form of . In order to keep the stability of the algorithm, they add some box constraints over the variables. Finally, under diminishing step-size, they prove that the distributed GTD (DGTD) converges with probability one. One of the numerical examples is considered a stock market problem where different agents have different policies for trading the stocks. DGTD is utilized to estimate the average of long term discounted profit of all agents. The results are compared with a single GTD algorithm in the case the sum of the reward is available. The comparison results show that each agent can successfully approximate the global value function.[cassano2018multi] considers two different scenarios for the policy evaluation task: (i) each agent is following a policy (behavior policy) different than others, and the goal is to evaluate a target policy (i.e. off-policy). In this case, each agent has only knowledge about its own state and reward, which is independent of the other agents’ state and reward. (ii) The state is global and visible to all agents, the reward is local for each agent, and the goal is to evaluate the target team policy. They propose a Fast Diffusion for Policy Evaluation (FDPE) for the case with a finite data set, which combines off-policy learning, eligibility traces, and linear function approximation. This algorithm can be applied to both scenarios mentioned earlier. The main idea here is to apply variance-reduced algorithm called AVRG [ying2018convergence] over a finite data set to get a linear convergence rate. Further, they modified the cost function to control the bias term. In particular, they use -stage Bellman Equation to derive -truncated -weighted Mean Square Projected Bellman Error (- MSPBE) compare to the usual cases (e.g. [distributed_macua_2015]) where they use Mean Square Projected Bellman Error (MSPBE). It is shown that the bias term can be controlled through . Also, they add a regularization term to the cost function, which can be useful in some cases.

A distributed off-policy actor-critic algorithm is proposed in [zhang_distributed_2019]. In contrast to the [zhang2018fully], where the actor step is performed locally and the consensus update is proposed for the critic, in [zhang_distributed_2019] the critic is performed locally, and the agents asymptotically achieve consensus on the actor parameter. The state space and action space are continuous and each agent has the local state and action; however, the global state and the global action are visible to all agents. Both policy function and value function are linearly parameterized. A convergence analysis is provided for the proposed algorithm under diminishing step-sizes for both actor and critic steps. The effectiveness of the proposed method was studied on the distributed resource allocation problem.

## 6 Learn to Communicate

As mentioned earlier in Section 5, some environments allow the communication of agents. The consensus algorithms use the communication bandwidth to pass raw observation, policy weight/gradients, critic weight/gradients, or some combination of them. A different approach to use the communication bandwidth is to learn a communication-action (like a message) to allow agents to be able to send information that they want. In this way, the agent can learn the time for sending a message, the type of message, and the destination agent. Usually, the communication-actions do not interfere with the environment, i.e., the messages do not affect the next state or reward. [kasai2008learning] proposed one of first learning to communicate algorithms, in which tabular Q-learning agents learn messages to communicate with other agents in the predator-prey environment. The same approach with a tabular RL is followed in [varshavskaya2009efficient]. Besides these early works, there are several recent papers in this area which utilize the function approximator. In this section, we discuss some of the more relevant papers in this research area.

In one of the most recent works, [foerster2016learning] consider a problem to learn how to communicate in a fully cooperative (recall that in a fully cooperative environment, agents share a global reward) multi-agent setting in which each agent accesses a local observation and has a limited bandwidth to communicate to other agents. Suppose that and denote message space and action space respectively. In each time-step, each agent takes action which affects the environment, and decides for action which does not affect the environment and only other agents observe it.
The proposed algorithm follows the centralized learning decentralized execution paradigm under which it is assumed in the training time agents do not have any restriction on the communication bandwidth.
They propose two main approaches to solve this problem. Both approaches use DRQN [hausknecht2015deep] to address partial observability, and disabled experience replay to deal with the non-stationarity.
The input of Q-network for agent at time includes , (the hidden state of RNN), , and for all . When parameter sharing is used, is also added in the input which helps learn specialized networks for agent within parameter sharing. All of the input values are converted into a vector of the same size either by a look-up table or an embedding (separate embedding of each input element), and the sum of these *same-size vectors* is the final input to the network. The network returns outputs for selecting actions and . The network includes two layers of GRU, followed by two MLP layers, and the final layer with representing Q-values. is different on two algorithms and is explained in the following. First they propose *reinforced inter-agent learning* (RIAL) algorithm. To select the communication action, the network includes additional Q-values to select discrete action . They also proposed a practical version of RIAL in which the agents share the policy parameters so that RIAL only needs to learn one network.
The second algorithm is *differentiable inter-agent learning* (DIAL), in which the message is continuous and the message receiver provides some feedback—in form of gradient—to the message sender, to minimize the DQN loss. In other words, the receiver obtains the gradient of its Q-value w.r.t the received message and sends it back to the sender so that the sender knows how to change the message to optimize the Q-value of the receiver.
Intuitively, agents are rewarded for the communication actions, if the receiver agent correctly interprets the message and acts upon that.
The network also creates a continues vector for the communication action so that there is no action selector for the communication action , and instead, a regularizer unit discretize it, if necessary.
They provide numerical results on switch riddle prisoner game and three communication game with mnist dataset. The results are compared with no-communication and parameter sharing version of RIAL and DIAL methods.

[jorge2016learning] extends DIAL in three directions: (i) allow communication of arbitrary size, (ii) gradually increase noise on the communication channels to make sure that the agents learn a symbolic language, (iii) agents do not share parameters. They provide the results of their algorithm on a version of "Guess Who?" game, in which two agents, namely "asking" and "answering", participate. The game is around guessing the true image that the answering agent knows, while the asking agent has images and by asking questions should guess the correct image. The answering agent returns only "yes/no" answers, and after questions the asking agent guesses the target image. The results of their algorithm with different parameters is presented. Following a similar line, [lazaridou2016multi] considers a problem with two agents and one round of communication to learn an interpretable language among the sender and receiver agents. The sender receives two images, while it knows the target image, and sends a message to the receiver along with the images. If the receiver guesses the correct image, both win a reward. Thus, they need to learn to communicate through the message. Each individual agent converts the image to a vector using VGG ConvNet [simonyan2014very]. The sender builds a neural network on top of the input vector to select one of the available symbols (values and are used in the experiments) as the message. The receiver embeds the message into the same size as of the images’ vector, and then through a neural network combines them together to obtain the guess. Both agents use REINFORCE algorithm [williams1992simple] to train their model, and do not share their policies with each other. There is not any pre-designed meaning associated with the utilized symbols. Their results demonstrate high success rate and show that the learned communications are interpretable. In another work in this direction, in [das2017learning] a fully cooperative two-agent game is considered for the task of image guessing. In particular, two bots, namely questioner bot (Q-BOT) and an answerer bot (ABOT) communicate in natural language and the task for Q-BOT is to guess an unseen image from a set of images. At every round of the game, Q-BOT asks a question, A-BOT provides an answer. Then the Q-BOT updates its information and makes a prediction about the image. The action space is common among both agents consisting of all possible output sequences under a token vocabulary , though the state is local for each agent. For A-BOT the state includes the sequence of questions and answers, the caption provided for the Q-BOT besides the image itself; while the state for Q-BOT does not include the image information. There exists a single reward for both agents in this game. Similar to [simonyan2014very] the REINFORCE algorithm [williams1992simple] is used to train both agents. Note that [jorge2016learning] allows "yes/no" actions within multiple rounds of communication, [lazaridou2016multi] consists of one single round with continues messages, and [das2017learning] combines them such that multiple rounds of continues communication are allowed.

Similarly, [mordatch2018emergence] studies a joint reward problem, in which each agent observes locations and communication messages of all agents. Each agent has a given goal vector accessed only privately (like moving to or gazing at a given location), and the goal may involve interacting with other agents. Each agent chooses one physical action (e.g., moving or gazing to a new location) and chooses one of the

symbols from a given vocabulary list. The symbols are treated as abstract categorical variables without any predefined meaning and agents learn to use each symbol for a given purpose. All agents have the same action space and share their policies. Unlike

[lazaridou2016multi] there is an arbitrary number of agents and they do not have any predefined rules, like speaker and listener, and the goals are not specifically defined such as the correct utterance. The goal of the model is to maximize the reward while creating an interpretable and understandable language for humans. To this end, a soft penalty also is added to encourage small vocabulary sizes, which results in having multiple words to create a meaning. The proposed model uses the state variable of all agents and uses a fully connected model to obtain the embedding . Similarly, is obtained as an embedding of all messages. Then, it combines the goal of the agent , , and through a fully connected model to obtain and . Then, the physical action is equal to and the communication message is , in which and is a Gumble-softmax estimator [jang2016categorical]. The results of the algorithm are compared with a no-communication approach in the mentioned game.[sukhbaatar2016learning] considers a fully cooperative MARL task in which each agent observes a local state and is able to send a continues communication message to the other agents. They propose a model, called CommNet, in which a central controller takes the state observations and the communication messages of all agents, and runs multi-step communications to provide actions of all agents in the output. CommNet assumes that each agent receives the messages of all agents. In the first round, the state observations of agent are encoded to , and the communication messages are zero. Then in each round , the controller concatenates all and , passes them into function , which is a linear layer followed by a non-linear function and obtains and for all agents. To obtain the actions, is decoded to provide a distribution over the action space. Furthermore, they provide a version of the algorithm that assumes each agent only observes the messages of its neighbor. Note that, compared to [foerster2016learning], commNet allows multiple rounds of communication between agents, and the number of agents can be different in different episodes. The performance of CommNet is compared with independent learners, the fully connected, and discrete communication over a traffic junction and Combat game from [sukhbaatar2015mazebase].

To extend CommNet, [hoshen2017vain] proposes Vertex Attention Interaction Network (VAIN), which adds an attention vector to learn the importance weight of each message. Then, instead of concatenating the messages together, the weighted sum of them is obtained and used to take the action. VAIN works well when there are sparse agents who interact with each agent. They compare their solution with CommNet over several environments.

In [peng2017multiagent] the authors introduce a bi-directional communication network (BiCNet) using recurrent neural network such that heterogeneous agents can communicate with different sets of parameters. Then a multi-agent vectorized version of AC algorithm is proposed for combat game. In particular, there exists two vectorized networks, namely actor and critic network which are shared among all agents, and each component of the vector represents an agent. The policy network takes the shared observation together with the local information and returns the actions for all agents in the network. The Bi-directional recurrent network is designed in a way to be served as a local memory too. Therefore, each individual agent is capable of maintaining its own internal states besides sharing the information with its neighbors. In each iteration of the algorithm, the gradient of both networks are calculated and the weights of the networks get updated accordingly using Adam algorithm. In order to reduce the variance, they applied the deterministic off-policy AC algorithm [silver2014deterministic]. The proposed algorithm applied to the multi-agent StarCraft combat game [samvelyan19smac]. It is shown that BiCNet is able to discover several effective ways to collaborate during the game.

[singh2018learning] considers the MARL problem in which each agent has a local reward and observation. An algorithm called Individualized Controlled Continuous Communication Model (IC3Net) is proposed to learn to *what and when to communicate*, which can be applied to cooperative, competitive, and semi-cooperative environments ^{3}^{3}3 Semi-cooperative environments are those that each agent looks for its own goal while all agents also want to maximize a common goal..
IC3Net allows multiple continues communication cycles and in each round uses a gating mechanism to decide to communicate or not. The local observation are encoded and passed to an LSTM model, which its weights are shared among the agents. Then, the final hidden state of the LSTM for agent in time step is used to get the final policy.
A Softmax function over returns a binary action to decide whether to communicate or not. Considering message of agent at time , the action and are:

(22) | ||||

(23) | ||||

(24) | ||||

(25) |

in which is the cell state of the LSTM cell, is a linear transformator, and is an embedding function. Policy and gating function are trained using REINFORCE algorithm [williams1992simple]. In order to analyze the performance of IC3Net, predator-pray, traffic junction [sukhbaatar2016learning], and StarCraft with explore and combat tasks [samvelyan19smac] are considered. The results are compared with CommNet [sukhbaatar2016learning], no-communication model, and no-communication model with only global reward.

In [jaques2018intrinsic], the authors aim to avoid centralized learning in multi-agent RL problems, when each agent observes local state , takes a local action, and observes local reward from the environment. The key idea is to define a reward called intrinsic reward for influencing the other agents’ actions. In particular, each agent simulates the potential actions that it can take and measures the effect on the behavior of other agents by having the selected action. Then, the actions which have a higher effect on the action of other agents will be rewarded more. Following this idea, the reward function is used, where is the casual influence reward on the other agents, , and are some trade-off weights. is computed by measuring the KL difference in the policy of agent when is known or is unknown as below:

(26) |

In order to measure the influence reward, two different scenarios are considered: (i) a centralized training, so each agent observes the probability of another agent’s action for a given counterfactual, (ii) modeling the other agents’ behavior. The first case can be easily handled by the equation (26). In the second case, each agent is learning through a separate neural network. In order to train these neural networks, the agents use the history of observed actions and the cross-entropy loss functions. The proposed algorithm is analyzed on harvest and clean-up environments and is compared with an A3C baseline and baseline which allows agents to communicate with each other. This work is partly relevant to *Theory of Mind* which tries to explain the effect of agents on each other in multi-agent settings. To see more details see [rabinowitz2018machine].

[das2018tarmac] proposes an algorithm, called TarMAC, to learn to communicate in a multi-agent setting, where the agents *learn to what to sent* and also learn to *communicate to which agent*.
They show that the learned policy is interpretable, and can be extended to competitive and mixed environments. To make sure that the message gets enough attention by the intended agents, each agent also encodes some information in the continuous message to define the type of the agent that the message is intended for. This way, the receiving agent can measure the relevance of the message to itself.
The proposed algorithm follows a centralized training with decentralized execution paradigm. Each agent accesses local observation, observes the messages of all agents, and the goal is maximizing the team reward , while the discrete actions are executed jointly in the environment.
Each agent sends a message including two parts, the signature and the value . The signature part provides the information of the intended agent to receive the message, and the value is the message. Each recipient receives all messages and learns variable to receive the messages. Multiplying and for all results in the attention weights of for all messages from agents . Finally, the aggregated message is the weighted sum of the message values, which the weights are the obtained attention values. This aggregated message and the local observation are the input of the local actor. Then, a regular actor-critic model is trained which uses a centralized critic. The actor is a single layer GRU layer, and critic uses the joint actions and the hidden state to obtain the Q-value. Also, the actors share the policy parameters to speed-up the training, and multi-round communication is used to increase efficiency. The proposed method (with no attention, no communication version of the algorithm) is compared with SHAPES [andreas2016neural], traffic junction in which they control the cars, and House3D [wu2018building] as well as the CommNet [sukhbaatar2016learning] when it was possible.

All the papers discussed so far in this section assumed the existence of a communication message and basically, they allow each agent to learn what to send. In a different approach, [jiang2018learning] fixes the message type and only allows each agent to decide to start a communication with the agents in its receptive field. They consider the problem that each agent observes the local observation, takes a local action, and receives a local reward. The key idea here is that when there is a large number of agents, sharing the information of all agents and communication might not be helpful since it is hard for the agent to differentiate the valuable information from the shared information. In this case, the communication might impair the learning. To address this issue, an algorithm, called ATOC is proposed in which an attention unit learns when to integrate the shared information from the other agents. In ATOC, each agent encodes the local observation, i.e. in which is the weights of a MLP. Every time step, agent runs an attention unit with input to determine whether to communicate with the agents in its receptive field or not. If it decides to communicate, a communication group with at most collaborator is created and does not change for time-steps. Each agent in this group sends the encoded information to the communication channel, in which they are combined and is returned for each agent , where is the list of the selected agents for the communication channel. Then, agent merges with , passes it to the MLP and obtains . Note that one agent can be added in two communication channel, and as a result, the information can be transferred among a larger number of agents. The actor and critic models are trained following the same way as the DDPG model, and the gradients of the actor () are also passed in the communication channel, if relevant. Also, the difference of the Q-value with and without communication is obtained and is used to train the attention unit. Some numerical experiments on the particle environment are done and ATOC is compared with CommNet, BiCNet, and DDPG (ATOC without any communication). Their experiments involve at least 50 agents, and MADDPG was not a benchmark.

## 7 Emerging Areas

In this section, we discuss a few recent papers which either combine the approaches in section

Comments

There are no comments yet.