1 Introduction
With the success of RL in the singleagent domain Mnih et al. (2015); Lillicrap et al. (2015), MARL is being actively studied and applied to realworld problems such as traffic control systems and connected selfdriving cars, which can be modeled as multiagent systems requiring coordinated control Li et al. (2019); Andriotis and Papakonstantinou (2019). The simplest approach to MARL is independent learning, which trains each agent independently while treating other agents as a part of the environment. One such example is independent Qlearning (IQL) Tan (1993), which is an extension of Qlearning to multiagent setting. However, this approach suffers from the problem of nonstationarity of the environment. A common solution to this problem is to use fullycentralized critic in the framework of centralized training with decentralized execution (CTDE) OroojlooyJadid and Hajinezhad (2019); Rashid et al. (2018). For example, MADDPG Lowe et al. (2017) uses a centralized critic to train a decentralized policy for each agent, and COMA Foerster et al. (2018) uses a common centralized critic to train all decentralized policies. However, these approaches assume that decentralized policies are independent and hence the joint policy is the product of each agent’s policy. Such noncorrelated factorization of the joint policy limits the agents to learn coordinated behavior due to negligence of the influence of other agents Wen et al. (2019); de Witt et al. (2019). However, learning coordinated behavior is one of the fundamental problems in MARL Wen et al. (2019); Liu et al. (2020).
In this paper, we introduce a new framework for MARL to learn coordinated behavior under CTDE without previouslyused explicit dependency or communication in the execution phase. Our framework is based on regularizing the expected cumulative reward with mutual information among agents’ actions induced by injecting a latent variable. The intuition behind the proposed framework is that agents can coordinate with other agents if they know what other agents will do with high probability, and the dependence between action policies can be captured by the mutual information. High mutual information among actions means low uncertainty of other agents’ actions. Hence, by regularizing the objective of the expected cumulative reward with mutual information among agents’ actions, we can coordinate the behaviors of agents implicitly without explicit dependence enforcement. However, the optimization problem with the proposed objective function has several difficulties since we consider decentralized policies without explicit dependence or communication in the execution phase. In addition, optimizing mutual information is difficult because of the intractable conditional distribution. We circumvent these difficulties by exploiting the property of the latent variable injected to induce mutual information, and applying variational lower bound on the mutual information. With the proposed framework, we apply policy iteration by redefining value functions to propose the VM3AC algorithm for MARL with coordinated behavior under CTDE.
Due to space limitation, related works are provided in Appendix A.
2 Background
We consider a Markov Game Littman (1994)
, which is an extention of Markov Decision Process (MDP) to multiagent setting. An
agent Markov game is defined by an environment state space , action spaces for agents , a state transition probability , where is the joint action space, and a reward function . At each time step , agent executes action based on state . The actions of all agents yields next state according to and yields shared common reward according to under the assumption of fullycooperative MARL. The discounted return is defined as , where is the discounting factor.We assume CTDE incorporating resource asymmetry between training and execution phases, widely considered in MARL Lowe et al. (2017); Iqbal and Sha (2018); Foerster et al. (2018). Under CTDE, each agent can access all information including the environment state, observations and actions of other agents in the system in the training phase, whereas the policy of each agent can be conditioned only on its own actionobservation history or observation in the execution phase. For given joint policy , the goal of fully cooperative MARL is to find the optimal joint policy that maximizes the objective .
Maximum Entropy RL The goal of maximum entropy RL is to find an optimal policy that maximizes the entropyregularized objective function, given by
(1) 
It is known that this objective encourages the policy to explore widely in the state and action spaces and helps the policy avoid converging to a local minimum. Soft actorcritic (SAC), which is based on the maximum entropy RL principle, approximates soft policy iteration to the actorcritic method. SAC outperforms other deep RL algorithms in many continuous action tasks Haarnoja et al. (2018).
We can simply extend SAC to multiagent setting in the manner of independent learning. Each agent trains decentralized policy using decentralized critic to maximize the weighted sum of the cumulative return and the entropy of its policy. We refer to this method as Independent SAC (ISAC). Adopting the framework of CTDE, we can replace decentralized critic with centralized critic which incorporates observations and actions of all agents. We refer to this method as multiagent soft actorcritic (MASAC). Both ISAC and MASAC are considered as baselines in the experiment section.
3 The Proposed Maximum Mutual Information Framework
We assume that the environment is fully observable, i.e., each agent can observe the environment state for theoretical development in this section, and will consider the partially observable environment for practical algorithm construction under CTDE in the next section.
Under the proposed MMI framework, we aims to find the policy that maximizes the mutual information between actions in addition to cumulative return. Thus, the MMIregularized objective function for joint policy is given by
(2) 
where and is the temperature parameter that controls the relative importance of the mutual information against the reward.
As aforementioned, we assume decentralized policies and want the decentralized policies to exhibit coordinated behavior. Furthermore, we want the coordinated behavior of the agents without explicit dependency previously used to enforce coordinated behavior. Here, explicit dependency Jaques et al. (2018) means that for two agents and , the action of agent follows and then the action of agent follows , i.e., the input to the policy function of agent explicitly requires the information about the action of agent for coordinated behavior. By regularization with mutual information in the proposed objective function (2), the policy of each agent is implicitly encouraged to coordinate with other agents’ policies without explicit dependency by reducing the uncertainty about other agents’ policies. This can be seen as follows: Mutual information is expressed in terms of the entropy and the conditional entropy as
(3) 
If the knowledge of does not provide any information about , the conditional entropy reduces to the unconditional entropy, i.e., , and the mutual information becomes zero. Maximizing mutual information is equivalent to minimizing the uncertainty about other agents’ policies conditioned on the agent’s own policy, which can lead the agent to learn coordinated behavior based on the reduced uncertainty about other agents’ policies.
However, direct optimization of the objective function (2) is not easy. Fig. 1(a) shows the causal diagram of the considered system model described in Section 2 in the case of two agents with decentralized policies. Since we consider the case of no explicit dependency, the two policy distributions can be expressed as and . Then, for given environment state observed by both agents, and are conditionally independent and the mutual information . Thus, the MMI objective (2) reduces to the standard MARL objective of only the accumulated return. In the following subsections, we present our approach to circumvent this difficulty and implement the MMI framework and its operation under CTDE.
3.1 Inducing Mutual Information Using Latent Variable
First, in order to induce mutual information among agents’ policies under the considered system causal diagram shown in Fig. 1(a), we introduce latent variable . For illustration, consider the new diagram with latent variable in Fig. 1(b). Suppose that the latent variable has a prior distribution , and assume that both actions and are generated from the observedrandom variable and the unobserved random variable . Then, the policy of agent is given by the marginal distribution marginalized over . With the unobserved latent random variable , the conditional independence does not hold for and and the mutual information can be positive, i.e., . Hence, we can induce the mutual information between actions without explicit dependence by introducing the latent variable. In the general case of agents, we have . Note that in this case we inject a common latent variable into all agents’ policies.
3.2 Variational Bound of Mutual Information
Even with nontrivial mutual information , it is difficult to directly compute the mutual information. Note that we need the conditional distribution of given to compute the mutual information as seen in (4), but it is difficult to know the conditional distribution directly. To circumvent this difficulty, we use a variational distribution to approximate and derive a lower bound on the mutual information as
(4) 
where the inequality holds because KL divergence is always nonnegative. The lower bound becomes tight when approximates well. Using the symmetry of mutual information, we can rewrite the lower bound as
(5) 
Then, we can maximize the lower bound of mutual information by using the tractable approximation .
3.3 Modified Policy Iteration
In this subsection, we develop policy iteration for the MMI framework. First, we replace the original MMI objective function (2) with the following tractable objective function based on the variational lower bound (5):
(6) 
where is the variational distribution to approximate the conditional distribution . Then, we determine the individual objective function for agent as the sum of the terms in (6) associated with agent ’s policy or action , given by
(7) 
where is the temperature parameter. Note that maximizing the term (a) in (7) implies that each agent maximizes the weighted sum of the policy entropy and the return, which can be interpreted as an extension of maximum entropy RL to multiagent setting. On the other hand, maximizing the term (b) with respect to means that we update the policy so that agent well predicts agent ’s action by the first term in (b) and agent well predicts agent ’s action by the second term in (b). Thus, the objective function (7) can be interpreted as the maximum entropy MARL objective combined with predictability enhancement for other agents’ actions. Note that predictability is reduced when actions are uncorrelated. Since the policy entropy term enhances individual exploration due to maximum entropy principle Haarnoja et al. (2018) and the term (b) in (7) enhances predictability or correlation among agents’ actions, the proposed objective function (7) can be considered as one implementation of the concept of correlated exploration in MARL Mahajan et al. (2019).
Now, in order to learn policy to maximize the objective function (7), we modify the policy iteration in standard RL. For this, we redefine the state and stateaction value functions for each agent as follows:
(8)  
(9) 
where . Then, the Bellman operator corresponding to and is given by
(10) 
where
(11) 
In the policy evaluation step, we compute the value functions defined in (19) and (20) by applying the modified Bellman operator repeatedly to any initial function .
Lemma 1.
Proof. See Appendix B.
In the policy improvement step, we update the policy and the variational distribution by using the value function evaluated in the policy evaluation step. Here, each agent updates its policy and variational distribution while keeping other agents’ policies fixed as follows:
(12) 
where . Then, we have the following lemma regarding the improvement step.
Lemma 2.
(Variational Policy Improvement). Let and be the updated policy and the variational distribution from (35). Then, for all .
Proof. See Appendix B.
The modified policy iteration is defined as applying the variational policy evaluation and variational improvement steps in an alternating manner. Each agent trains its policy, critics and the variational distribution to maximize its objective function (7).
4 Algorithm Construction
Summarizing the development above, we now propose the variational maximum mutual information multiagent actorcritic (VM3AC) algorithm, which can be applied to continuous and partially observable multiagent environments under CTDE. The overall operation of VM3AC is shown in Fig. 2. Under CTDE, each agent’s policy is conditioned only on local observation, and centralized critics are conditioned on either the environment state or the observations of all agents, depending on the situation Lowe et al. (2017). Let denote either the environment state or the observations of all agents
, whichever is used. In order to deal with the large continuous stateaction spaces, we adopt deep neural networks to approximate the required functions. For agent
, we parameterize the variational distribution with as , the statevalue function with as , two actionvalue functions with and as , and the policy with as. We assume normal distribution for the latent variable which plays a key role in inducing coordination among agents’ policies, i.e.,
, and further assume that the variational distribution is Gaussian distribution with constant variance
, i.e., , where is the mean of the distribution.4.1 Centralized Training
As aforementioned, the policy is the marginalized distribution over the latent variable , where the policies of all agents take the same generated from as an input variable. We perform the required marginalization based on Monte Carlo numerical expectation as follows:
(13) 
and we use for simplicity. The value functions , are updated based on the modified Bellman operator defined in (21) and (22). The statevalue function
is trained to minimize the following loss function:
(14) 
where , is the replay buffer that stores the transitions , and is the minimum of the two actionvalue functions to prevent the overestimation problem Fujimoto et al. (2018). The two actionvalue functions are updated by minimizing the loss
(15) 
where
(16) 
and
is the target value network, which is updated by the exponential moving average method. We implement the reparameterization trick to estimate the stochastic gradient of policy loss. Then, the action of agent
is given by , where and . The policy for agent and the variational distribution are trained to minimize the following policy improvement loss,(17) 
where
(18) 
Since approximation of the variational distribution is not accurate in the early stage of training and the learning via the term (a) in (18) is more susceptible to approximation error, we propagate the gradient only through the term (b) in (18) to make learning stable. Note that minimizing is equivalent to minimizing the meansquared error between and due to our Gaussian assumption on the variational distribution.
4.2 Decentralized Execution
In the centralized training phase, we pick the actions by using Monte Carlo expectation based on common latent variable generated from zeromean Gaussian distribution, as seen in (13). We can also achieve the same operation in the decentralized execution phase. This can be done by making all agents have the same Gaussian random sequence generator and distributing the same seed to this random sequence generator only once in the beginning of the execution phase. This eliminates the necessity of communication for sharing the latent variable. In fact, this way of sharing can be applied to the centralized training phase too. The proposed VM3AC algorithm is summarized in Appendix C.
5 Experiment
In this section, we provide numerical results to evaluate VM3AC. Since we focus on the continuous actionspace case in this paper, we considered four baselines relevant to the continuous actionspace case: 1) MADDPG Lowe et al. (2017)  an extension of DDPG with a centralized critic to train a decentralized policy for each agent. 2) ISAC  an example of independent learning where each agent learns policy based on SAC while treating other agents as a part of the environment. 3) MASAC  an extension of ISAC with a centralized critic instead of a decentralized critic. 4) Multiagent actorcritic (MAAC)  a variant of MASAC, i.e., the same algorithm with MASAC without the entropy term. All algorithms used neural networks to approximate the required functions. In the algorithms except ISAC, we used the neural network architecture proposed in Kim et al. (2019) to emphasize the agent’s own observation and action for centralized critics. For agent , we used the shared neural network for the variational distribution for
, and the network takes the onehot vector which indicates
as input. Experimental details are given in Appendix E.We evaluated the proposed algorithm and the baselines in the three multiagent environments with varying number of agents: multiwalker Gupta et al. (2017), predatorprey Lowe et al. (2017), and cooperative navigation Lowe et al. (2017). The detailed setting of each environments is provided in Appendix D.
(a) MW (N=3)  (b) MW (N=4)  (c) PP (N=2) 
(d) PP (N=3)  (e) PP (N=4)  (f) CN (N=3) 
5.1 Result
Fig. 3
shows the learning curves for the considered three environments with the different number of agents. The yaxis denotes the average of all agents’ rewards averaged over 7 random seeds, and the xaxis denotes time step. The hyperparameters including the temperature parameter
and the dimension of the latent variable are provided in Appendix E.As shown in Fig. 3, VM3AC outperforms the baselines in the considered environments. Especially, in the case of the multiwalker environment, the proposed VM3AC algorithm has large performance gain. This is because the agents in the multiwalker environment are required especially to learn coordinated behavior to obtain high rewards. Hence, we can see that the proposed MMI framework improves performance in complex multiagent tasks requiring highquality coordination. The performance gap between VM3AC and MASAC indicates the effect of regularization with the variational term (b) of the objective function (7). Recall that VM3AC without the variational term (b) of the objective function (7) reduces to MASAC. Recall also that MASAC without entropy regularization reduces to MAAC, and MASAC with decentralized critics instead of centralized critics reduces to ISAC. Hence, regularization with entropy and use of centralized critics are also important in multiagent tasks from the fact that MASAC outperforms ISAC and MAAC. Note that VM3AC also maximizes the entropy through the term (a) of the objective function (7). Indeed, it is seen that regularization with the variation term in addition to policy entropy enhances coordinated behavior in MARL.
(a) MW (N=3)  (b) MW (N=4)  (c) MW (N=3)  (d) MW (N=4) 
Due to the space limitation, more result on comparison with the latest algorithm MAVEN Mahajan et al. (2019) is provided in Appendix F. It is seen there that VM3AC significantly outperforms MAVEN.
5.2 Ablation Study
In this section, we provide ablation study on the major techniques and hyperparameter of VM3AC: 1) the latent variable, and 2) the temperature parameter .
Latent variable: The role of the latent variable is to induce mutual information among actions and promote coordinated behavior. We compared VM3AC and VM3AC without the latent variable (implemented by setting ) in the multiwalker environment with and . In both cases, VM3AC yields better performance that VM3AC without the latent variable as shown in Fig.4(a) and 4(b).
Temperature parameter : The role of temperature parameter is to control the relative importance between the reward and the mutual information. We evaluated VM3AC by varying in the multiwalker environment with and . Fig. 4(c) and 4(d) show that VM3AC with the temperature value around yields good performance.
6 Conclusion
In this paper, we have proposed the MMI framework for MARL to enhance multiagent coordinated learning under CTDE by regularizing the cumulative return with mutual information among actions. The MMI framework is implemented practically by using a latent variable and variational technique and applying policy iteration. Numerical results show that the derived algorithm named VM3AC outperforms other baselines, especially in multiagent tasks requiring high coordination among agents. Furthermore, the MMI framework can be combined with the other techniques for cooperative MARL, such as value decomposition Rashid et al. (2018) to yield better performance.
Broader Impact
The research topic of this paper is multiagent reinforcement learning (MARL). MARL is an important branch in the field of reinforcement learning. MARL models many of practical control problems in the real world such as smart factories, coordinated robots and connected selfdriving cars. With the advance of knowledge and technologies in MARL, solutions to such realworld problems can be improved and more robust. For example, if the control of selfdriving cars are coordinated among several nearby cars, the safety involved in selfdriving cars will be improved much. So, we believe that the research advances in this field can benefit our safety and future society.
References
 Managing engineering systems with large state and action spaces through deep reinforcement learning. Reliability Engineering & System Safety 191, pp. 106483. Cited by: §1.
 Multiagent common knowledge reinforcement learning. In Advances in Neural Information Processing Systems, pp. 9924–9935. Cited by: §1.
 Learning to communicate with deep multiagent reinforcement learning. In Advances in neural information processing systems, pp. 2137–2145. Cited by: Appendix A: Related Work.

Counterfactual multiagent policy gradients.
In
Thirtysecond AAAI conference on artificial intelligence
, Cited by: §1, §2.  Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477. Cited by: §4.1.
 Cooperative multiagent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pp. 66–83. Cited by: §5, Appendix D: Environment Detail.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §2, §3.3.
 Actorattentioncritic for multiagent reinforcement learning. arXiv preprint arXiv:1810.02912. Cited by: §2.
 Social influence as intrinsic motivation for multiagent deep reinforcement learning. arXiv preprint arXiv:1810.08647. Cited by: §3, Appendix A: Related Work.
 Messagedropout: an efficient training method for multiagent deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6079–6086. Cited by: §5.

Crafting papers on machine learning
. In Proceedings of the 17th International Conference on Machine Learning (ICML 2000), P. Langley (Ed.), Stanford, CA, pp. 1207–1216. Cited by: Broader Impact.  Efficient ridesharing order dispatching with mean field multiagent reinforcement learning. In The World Wide Web Conference, pp. 983–994. Cited by: §1.
 Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
 Markov games as a framework for multiagent reinforcement learning. In Machine learning proceedings 1994, pp. 157–163. Cited by: §2.
 Multiagent interactions modeling with correlated policies. arXiv preprint arXiv:2001.03415. Cited by: §1, Appendix A: Related Work.
 Multiagent actorcritic for mixed cooperativecompetitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390. Cited by: §1, §2, §4, §5, §5, Appendix D: Environment Detail.
 MAVEN: multiagent variational exploration. In Advances in Neural Information Processing Systems, pp. 7611–7622. Cited by: §3.3, §5.1, Appendix A: Related Work, Appendix F: Comparison against MAVEN.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.
 A review of cooperative multiagent deep reinforcement learning. arXiv preprint arXiv:1908.03963. Cited by: §1.
 Improving coordination in smallscale multiagent deep reinforcement learning through memorydriven communication. arXiv preprint arXiv:1901.03887. Cited by: Appendix A: Related Work.
 QMIX: monotonic value function factorisation for deep multiagent reinforcement learning. arXiv preprint arXiv:1803.11485. Cited by: §1, §6, Appendix A: Related Work.
 Qtran: learning to factorize with transformation for cooperative multiagent reinforcement learning. arXiv preprint arXiv:1905.05408. Cited by: Appendix A: Related Work.
 Valuedecomposition networks for cooperative multiagent learning. arXiv preprint arXiv:1706.05296. Cited by: Appendix A: Related Work.
 Multiagent reinforcement learning: independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330–337. Cited by: §1.
 Probabilistic recursive reasoning for multiagent reinforcement learning. arXiv preprint arXiv:1901.09207. Cited by: §1, Appendix A: Related Work.
 Coordinating multiagent reinforcement learning with limited communication. In Proceedings of the 2013 international conference on Autonomous agents and multiagent systems, pp. 1101–1108. Cited by: Appendix A: Related Work.
 Structured exploration via hierarchical variational policy networks. Cited by: Appendix A: Related Work.
Appendix A: Related Work
For cooperative MARL, several approaches have been studied. One of the approaches is value decomposition techniques Sunehag et al. (2017); Rashid et al. (2018); Son et al. (2019). For example, QMIX Rashid et al. (2018) factorizes the joint actionvalue function into a combination of local actionvalue functions while imposing a monotonicity constraint. QMIX achieves stateoftheart performance in complex discreteaction MARL tasks and has been widely used as a baseline in discreteaction environments. Since the focus of VM3AC is on continuousaction environments, the direct comparison of VM3AC to QMIX is irrelevant. However, the basic concept of QMIX can also be applied to the MMI framework, and this remains as future work.
Learning coordinated behavior in the multiagent systems is studied extensively in the MARL community. To promote coordination, some previous works used communication among agents Zhang and Lesser (2013); Foerster et al. (2016); Pesce and Montana (2019). For example, Foerster et al. (2016) proposed the DIAL algorithm to learn communication protocol that enables the agents to coordinate their behaviors. Jaques et al. (2018) proposed the social influence intrinsic reward which is related to the mutual information between actions to achieve coordination. Although the social influence algorithm increases the performance in challenging social dilemma environments, the limitation is that explicit dependency across actions is required and imposed for this algorithm to compute the intrinsic reward. As already mentioned, the MMI framework can be viewed as indirect enhancement of correlated exploration. The correlated policies are considered in several other works too. Liu et al. (2020)
proposed the explicit modeling of correlated policies for multiagent imitation learning, and
Wen et al. (2019) proposed a probabilistic recursive reasoning framework. By introducing a latent variable and variational lower bound on mutual information, the proposed VM3AC increases the correlation among policies without communication in the execution phase and without explicit dependency across agents’ actions.As mentioned in the main paper, the proposed MMI framework can be interpreted as enhancing correlated exploration by increasing the entropy of own policy while decreasing the uncertainty about other agents’ actions. Some previous works also proposed other techniques to enhance correlated exploration Mahajan et al. (2019); Zheng and Yue (2018). For example, MAVEN addressed the poor exploration of QMIX by maximizing the mutual information between the latent variable and the observed trajectories Mahajan et al. (2019). However, MAVEN does not consider the correlation among policies. We compare the proposed VM3AC with MAVEN and the comparison result is given in Appendix F.
Appendix B: Variational policy evaluation and policy improvement
In the main paper, we defined the state and stateaction value functions for each agent as follows:
(19)  
(20) 
Lemma 3.
(21) 
where
(22) 
Proof.
Define the mutual information augmented reward as
(23)  
(24)  
(25)  
(26) 
Then, we can apply the standard convergence results for policy evaluation. Define
(27) 
for . Then, the operator is a contraction.
(28)  
(29)  
(30)  
(31) 
Note that the operator has an unique fixed point by the contraction mapping theorem, and we define the fixed point as . Since
(32) 
we have
(33) 
and this implies
(34) 
∎
Lemma 4.
(Variational Policy Improvement). Let and be the updated policy and the variational distribution from (35). Then, for all .
(35)  
(36) 
Proof.
Let be determined as
(37)  
(38) 
Then, the following inequality is hold
(39)  
(40)  
(41) 
From the definition of the Bellman operator,
(42)  
(43)  
(44)  
Comments
There are no comments yet.