With the success of RL in the single-agent domain Mnih et al. (2015); Lillicrap et al. (2015), MARL is being actively studied and applied to real-world problems such as traffic control systems and connected self-driving cars, which can be modeled as multi-agent systems requiring coordinated control Li et al. (2019); Andriotis and Papakonstantinou (2019). The simplest approach to MARL is independent learning, which trains each agent independently while treating other agents as a part of the environment. One such example is independent Q-learning (IQL) Tan (1993), which is an extension of Q-learning to multi-agent setting. However, this approach suffers from the problem of non-stationarity of the environment. A common solution to this problem is to use fully-centralized critic in the framework of centralized training with decentralized execution (CTDE) OroojlooyJadid and Hajinezhad (2019); Rashid et al. (2018). For example, MADDPG Lowe et al. (2017) uses a centralized critic to train a decentralized policy for each agent, and COMA Foerster et al. (2018) uses a common centralized critic to train all decentralized policies. However, these approaches assume that decentralized policies are independent and hence the joint policy is the product of each agent’s policy. Such non-correlated factorization of the joint policy limits the agents to learn coordinated behavior due to negligence of the influence of other agents Wen et al. (2019); de Witt et al. (2019). However, learning coordinated behavior is one of the fundamental problems in MARL Wen et al. (2019); Liu et al. (2020).
In this paper, we introduce a new framework for MARL to learn coordinated behavior under CTDE without previously-used explicit dependency or communication in the execution phase. Our framework is based on regularizing the expected cumulative reward with mutual information among agents’ actions induced by injecting a latent variable. The intuition behind the proposed framework is that agents can coordinate with other agents if they know what other agents will do with high probability, and the dependence between action policies can be captured by the mutual information. High mutual information among actions means low uncertainty of other agents’ actions. Hence, by regularizing the objective of the expected cumulative reward with mutual information among agents’ actions, we can coordinate the behaviors of agents implicitly without explicit dependence enforcement. However, the optimization problem with the proposed objective function has several difficulties since we consider decentralized policies without explicit dependence or communication in the execution phase. In addition, optimizing mutual information is difficult because of the intractable conditional distribution. We circumvent these difficulties by exploiting the property of the latent variable injected to induce mutual information, and applying variational lower bound on the mutual information. With the proposed framework, we apply policy iteration by redefining value functions to propose the VM3-AC algorithm for MARL with coordinated behavior under CTDE.
Due to space limitation, related works are provided in Appendix A.
We consider a Markov Game Littman (1994)
, which is an extention of Markov Decision Process (MDP) to multi-agent setting. An-agent Markov game is defined by an environment state space , action spaces for agents , a state transition probability , where is the joint action space, and a reward function . At each time step , agent executes action based on state . The actions of all agents yields next state according to and yields shared common reward according to under the assumption of fully-cooperative MARL. The discounted return is defined as , where is the discounting factor.
We assume CTDE incorporating resource asymmetry between training and execution phases, widely considered in MARL Lowe et al. (2017); Iqbal and Sha (2018); Foerster et al. (2018). Under CTDE, each agent can access all information including the environment state, observations and actions of other agents in the system in the training phase, whereas the policy of each agent can be conditioned only on its own action-observation history or observation in the execution phase. For given joint policy , the goal of fully cooperative MARL is to find the optimal joint policy that maximizes the objective .
Maximum Entropy RL The goal of maximum entropy RL is to find an optimal policy that maximizes the entropy-regularized objective function, given by
It is known that this objective encourages the policy to explore widely in the state and action spaces and helps the policy avoid converging to a local minimum. Soft actor-critic (SAC), which is based on the maximum entropy RL principle, approximates soft policy iteration to the actor-critic method. SAC outperforms other deep RL algorithms in many continuous action tasks Haarnoja et al. (2018).
We can simply extend SAC to multi-agent setting in the manner of independent learning. Each agent trains decentralized policy using decentralized critic to maximize the weighted sum of the cumulative return and the entropy of its policy. We refer to this method as Independent SAC (I-SAC). Adopting the framework of CTDE, we can replace decentralized critic with centralized critic which incorporates observations and actions of all agents. We refer to this method as multi-agent soft actor-critic (MA-SAC). Both I-SAC and MA-SAC are considered as baselines in the experiment section.
3 The Proposed Maximum Mutual Information Framework
We assume that the environment is fully observable, i.e., each agent can observe the environment state for theoretical development in this section, and will consider the partially observable environment for practical algorithm construction under CTDE in the next section.
Under the proposed MMI framework, we aims to find the policy that maximizes the mutual information between actions in addition to cumulative return. Thus, the MMI-regularized objective function for joint policy is given by
where and is the temperature parameter that controls the relative importance of the mutual information against the reward.
As aforementioned, we assume decentralized policies and want the decentralized policies to exhibit coordinated behavior. Furthermore, we want the coordinated behavior of the agents without explicit dependency previously used to enforce coordinated behavior. Here, explicit dependency Jaques et al. (2018) means that for two agents and , the action of agent follows and then the action of agent follows , i.e., the input to the policy function of agent explicitly requires the information about the action of agent for coordinated behavior. By regularization with mutual information in the proposed objective function (2), the policy of each agent is implicitly encouraged to coordinate with other agents’ policies without explicit dependency by reducing the uncertainty about other agents’ policies. This can be seen as follows: Mutual information is expressed in terms of the entropy and the conditional entropy as
If the knowledge of does not provide any information about , the conditional entropy reduces to the unconditional entropy, i.e., , and the mutual information becomes zero. Maximizing mutual information is equivalent to minimizing the uncertainty about other agents’ policies conditioned on the agent’s own policy, which can lead the agent to learn coordinated behavior based on the reduced uncertainty about other agents’ policies.
However, direct optimization of the objective function (2) is not easy. Fig. 1(a) shows the causal diagram of the considered system model described in Section 2 in the case of two agents with decentralized policies. Since we consider the case of no explicit dependency, the two policy distributions can be expressed as and . Then, for given environment state observed by both agents, and are conditionally independent and the mutual information . Thus, the MMI objective (2) reduces to the standard MARL objective of only the accumulated return. In the following subsections, we present our approach to circumvent this difficulty and implement the MMI framework and its operation under CTDE.
3.1 Inducing Mutual Information Using Latent Variable
First, in order to induce mutual information among agents’ policies under the considered system causal diagram shown in Fig. 1(a), we introduce latent variable . For illustration, consider the new diagram with latent variable in Fig. 1(b). Suppose that the latent variable has a prior distribution , and assume that both actions and are generated from the observedrandom variable and the unobserved random variable . Then, the policy of agent is given by the marginal distribution marginalized over . With the unobserved latent random variable , the conditional independence does not hold for and and the mutual information can be positive, i.e., . Hence, we can induce the mutual information between actions without explicit dependence by introducing the latent variable. In the general case of agents, we have . Note that in this case we inject a common latent variable into all agents’ policies.
3.2 Variational Bound of Mutual Information
Even with non-trivial mutual information , it is difficult to directly compute the mutual information. Note that we need the conditional distribution of given to compute the mutual information as seen in (4), but it is difficult to know the conditional distribution directly. To circumvent this difficulty, we use a variational distribution to approximate and derive a lower bound on the mutual information as
where the inequality holds because KL divergence is always non-negative. The lower bound becomes tight when approximates well. Using the symmetry of mutual information, we can rewrite the lower bound as
Then, we can maximize the lower bound of mutual information by using the tractable approximation .
3.3 Modified Policy Iteration
In this subsection, we develop policy iteration for the MMI framework. First, we replace the original MMI objective function (2) with the following tractable objective function based on the variational lower bound (5):
where is the variational distribution to approximate the conditional distribution . Then, we determine the individual objective function for agent as the sum of the terms in (6) associated with agent ’s policy or action , given by
where is the temperature parameter. Note that maximizing the term (a) in (7) implies that each agent maximizes the weighted sum of the policy entropy and the return, which can be interpreted as an extension of maximum entropy RL to multi-agent setting. On the other hand, maximizing the term (b) with respect to means that we update the policy so that agent well predicts agent ’s action by the first term in (b) and agent well predicts agent ’s action by the second term in (b). Thus, the objective function (7) can be interpreted as the maximum entropy MARL objective combined with predictability enhancement for other agents’ actions. Note that predictability is reduced when actions are uncorrelated. Since the policy entropy term enhances individual exploration due to maximum entropy principle Haarnoja et al. (2018) and the term (b) in (7) enhances predictability or correlation among agents’ actions, the proposed objective function (7) can be considered as one implementation of the concept of correlated exploration in MARL Mahajan et al. (2019).
Now, in order to learn policy to maximize the objective function (7), we modify the policy iteration in standard RL. For this, we redefine the state and state-action value functions for each agent as follows:
where . Then, the Bellman operator corresponding to and is given by
Proof. See Appendix B.
In the policy improvement step, we update the policy and the variational distribution by using the value function evaluated in the policy evaluation step. Here, each agent updates its policy and variational distribution while keeping other agents’ policies fixed as follows:
where . Then, we have the following lemma regarding the improvement step.
(Variational Policy Improvement). Let and be the updated policy and the variational distribution from (35). Then, for all .
Proof. See Appendix B.
The modified policy iteration is defined as applying the variational policy evaluation and variational improvement steps in an alternating manner. Each agent trains its policy, critics and the variational distribution to maximize its objective function (7).
4 Algorithm Construction
Summarizing the development above, we now propose the variational maximum mutual information multi-agent actor-critic (VM3-AC) algorithm, which can be applied to continuous and partially observable multi-agent environments under CTDE. The overall operation of VM3-AC is shown in Fig. 2. Under CTDE, each agent’s policy is conditioned only on local observation, and centralized critics are conditioned on either the environment state or the observations of all agents, depending on the situation Lowe et al. (2017). Let denote either the environment state or the observations of all agents
, whichever is used. In order to deal with the large continuous state-action spaces, we adopt deep neural networks to approximate the required functions. For agent, we parameterize the variational distribution with as , the state-value function with as , two action-value functions with and as , and the policy with as
. We assume normal distribution for the latent variable which plays a key role in inducing coordination among agents’ policies, i.e.,, i.e., , where is the mean of the distribution.
4.1 Centralized Training
As aforementioned, the policy is the marginalized distribution over the latent variable , where the policies of all agents take the same generated from as an input variable. We perform the required marginalization based on Monte Carlo numerical expectation as follows:
is trained to minimize the following loss function:
where , is the replay buffer that stores the transitions , and is the minimum of the two action-value functions to prevent the overestimation problem Fujimoto et al. (2018). The two action-value functions are updated by minimizing the loss
is the target value network, which is updated by the exponential moving average method. We implement the reparameterization trick to estimate the stochastic gradient of policy loss. Then, the action of agentis given by , where and . The policy for agent and the variational distribution are trained to minimize the following policy improvement loss,
Since approximation of the variational distribution is not accurate in the early stage of training and the learning via the term (a) in (18) is more susceptible to approximation error, we propagate the gradient only through the term (b) in (18) to make learning stable. Note that minimizing is equivalent to minimizing the mean-squared error between and due to our Gaussian assumption on the variational distribution.
4.2 Decentralized Execution
In the centralized training phase, we pick the actions by using Monte Carlo expectation based on common latent variable generated from zero-mean Gaussian distribution, as seen in (13). We can also achieve the same operation in the decentralized execution phase. This can be done by making all agents have the same Gaussian random sequence generator and distributing the same seed to this random sequence generator only once in the beginning of the execution phase. This eliminates the necessity of communication for sharing the latent variable. In fact, this way of sharing can be applied to the centralized training phase too. The proposed VM3-AC algorithm is summarized in Appendix C.
In this section, we provide numerical results to evaluate VM3-AC. Since we focus on the continuous action-space case in this paper, we considered four baselines relevant to the continuous action-space case: 1) MADDPG Lowe et al. (2017) - an extension of DDPG with a centralized critic to train a decentralized policy for each agent. 2) I-SAC - an example of independent learning where each agent learns policy based on SAC while treating other agents as a part of the environment. 3) MA-SAC - an extension of I-SAC with a centralized critic instead of a decentralized critic. 4) Multi-agent actor-critic (MA-AC) - a variant of MA-SAC, i.e., the same algorithm with MA-SAC without the entropy term. All algorithms used neural networks to approximate the required functions. In the algorithms except I-SAC, we used the neural network architecture proposed in Kim et al. (2019) to emphasize the agent’s own observation and action for centralized critics. For agent , we used the shared neural network for the variational distribution for
, and the network takes the one-hot vector which indicatesas input. Experimental details are given in Appendix E.
We evaluated the proposed algorithm and the baselines in the three multi-agent environments with varying number of agents: multi-walker Gupta et al. (2017), predator-prey Lowe et al. (2017), and cooperative navigation Lowe et al. (2017). The detailed setting of each environments is provided in Appendix D.
|(a) MW (N=3)||(b) MW (N=4)||(c) PP (N=2)|
|(d) PP (N=3)||(e) PP (N=4)||(f) CN (N=3)|
shows the learning curves for the considered three environments with the different number of agents. The y-axis denotes the average of all agents’ rewards averaged over 7 random seeds, and the x-axis denotes time step. The hyperparameters including the temperature parameterand the dimension of the latent variable are provided in Appendix E.
As shown in Fig. 3, VM3-AC outperforms the baselines in the considered environments. Especially, in the case of the multi-walker environment, the proposed VM3-AC algorithm has large performance gain. This is because the agents in the multi-walker environment are required especially to learn coordinated behavior to obtain high rewards. Hence, we can see that the proposed MMI framework improves performance in complex multi-agent tasks requiring high-quality coordination. The performance gap between VM3-AC and MA-SAC indicates the effect of regularization with the variational term (b) of the objective function (7). Recall that VM3-AC without the variational term (b) of the objective function (7) reduces to MA-SAC. Recall also that MA-SAC without entropy regularization reduces to MA-AC, and MA-SAC with decentralized critics instead of centralized critics reduces to I-SAC. Hence, regularization with entropy and use of centralized critics are also important in multi-agent tasks from the fact that MA-SAC outperforms I-SAC and MA-AC. Note that VM3-AC also maximizes the entropy through the term (a) of the objective function (7). Indeed, it is seen that regularization with the variation term in addition to policy entropy enhances coordinated behavior in MARL.
|(a) MW (N=3)||(b) MW (N=4)||(c) MW (N=3)||(d) MW (N=4)|
Due to the space limitation, more result on comparison with the latest algorithm MAVEN Mahajan et al. (2019) is provided in Appendix F. It is seen there that VM3-AC significantly outperforms MAVEN.
5.2 Ablation Study
In this section, we provide ablation study on the major techniques and hyperparameter of VM3-AC: 1) the latent variable, and 2) the temperature parameter .
Latent variable: The role of the latent variable is to induce mutual information among actions and promote coordinated behavior. We compared VM3-AC and VM3-AC without the latent variable (implemented by setting ) in the multi-walker environment with and . In both cases, VM3-AC yields better performance that VM3-AC without the latent variable as shown in Fig.4(a) and 4(b).
Temperature parameter : The role of temperature parameter is to control the relative importance between the reward and the mutual information. We evaluated VM3-AC by varying in the multi-walker environment with and . Fig. 4(c) and 4(d) show that VM3-AC with the temperature value around yields good performance.
In this paper, we have proposed the MMI framework for MARL to enhance multi-agent coordinated learning under CTDE by regularizing the cumulative return with mutual information among actions. The MMI framework is implemented practically by using a latent variable and variational technique and applying policy iteration. Numerical results show that the derived algorithm named VM3-AC outperforms other baselines, especially in multi-agent tasks requiring high coordination among agents. Furthermore, the MMI framework can be combined with the other techniques for cooperative MARL, such as value decomposition Rashid et al. (2018) to yield better performance.
The research topic of this paper is multi-agent reinforcement learning (MARL). MARL is an important branch in the field of reinforcement learning. MARL models many of practical control problems in the real world such as smart factories, coordinated robots and connected self-driving cars. With the advance of knowledge and technologies in MARL, solutions to such real-world problems can be improved and more robust. For example, if the control of self-driving cars are coordinated among several near-by cars, the safety involved in self-driving cars will be improved much. So, we believe that the research advances in this field can benefit our safety and future society.
- Managing engineering systems with large state and action spaces through deep reinforcement learning. Reliability Engineering & System Safety 191, pp. 106483. Cited by: §1.
- Multi-agent common knowledge reinforcement learning. In Advances in Neural Information Processing Systems, pp. 9924–9935. Cited by: §1.
- Learning to communicate with deep multi-agent reinforcement learning. In Advances in neural information processing systems, pp. 2137–2145. Cited by: Appendix A: Related Work.
Counterfactual multi-agent policy gradients.
Thirty-second AAAI conference on artificial intelligence, Cited by: §1, §2.
- Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Cited by: §4.1.
- Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pp. 66–83. Cited by: §5, Appendix D: Environment Detail.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §2, §3.3.
- Actor-attention-critic for multi-agent reinforcement learning. arXiv preprint arXiv:1810.02912. Cited by: §2.
- Social influence as intrinsic motivation for multi-agent deep reinforcement learning. arXiv preprint arXiv:1810.08647. Cited by: §3, Appendix A: Related Work.
- Message-dropout: an efficient training method for multi-agent deep reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6079–6086. Cited by: §5.
Crafting papers on machine learning. In Proceedings of the 17th International Conference on Machine Learning (ICML 2000), P. Langley (Ed.), Stanford, CA, pp. 1207–1216. Cited by: Broader Impact.
- Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning. In The World Wide Web Conference, pp. 983–994. Cited by: §1.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
- Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp. 157–163. Cited by: §2.
- Multi-agent interactions modeling with correlated policies. arXiv preprint arXiv:2001.03415. Cited by: §1, Appendix A: Related Work.
- Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, pp. 6379–6390. Cited by: §1, §2, §4, §5, §5, Appendix D: Environment Detail.
- MAVEN: multi-agent variational exploration. In Advances in Neural Information Processing Systems, pp. 7611–7622. Cited by: §3.3, §5.1, Appendix A: Related Work, Appendix F: Comparison against MAVEN.
- Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §1.
- A review of cooperative multi-agent deep reinforcement learning. arXiv preprint arXiv:1908.03963. Cited by: §1.
- Improving coordination in small-scale multi-agent deep reinforcement learning through memory-driven communication. arXiv preprint arXiv:1901.03887. Cited by: Appendix A: Related Work.
- QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. arXiv preprint arXiv:1803.11485. Cited by: §1, §6, Appendix A: Related Work.
- Qtran: learning to factorize with transformation for cooperative multi-agent reinforcement learning. arXiv preprint arXiv:1905.05408. Cited by: Appendix A: Related Work.
- Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296. Cited by: Appendix A: Related Work.
- Multi-agent reinforcement learning: independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, pp. 330–337. Cited by: §1.
- Probabilistic recursive reasoning for multi-agent reinforcement learning. arXiv preprint arXiv:1901.09207. Cited by: §1, Appendix A: Related Work.
- Coordinating multi-agent reinforcement learning with limited communication. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pp. 1101–1108. Cited by: Appendix A: Related Work.
- Structured exploration via hierarchical variational policy networks. Cited by: Appendix A: Related Work.
Appendix A: Related Work
For cooperative MARL, several approaches have been studied. One of the approaches is value decomposition techniques Sunehag et al. (2017); Rashid et al. (2018); Son et al. (2019). For example, QMIX Rashid et al. (2018) factorizes the joint action-value function into a combination of local action-value functions while imposing a monotonicity constraint. QMIX achieves state-of-the-art performance in complex discrete-action MARL tasks and has been widely used as a baseline in discrete-action environments. Since the focus of VM3-AC is on continuous-action environments, the direct comparison of VM3-AC to QMIX is irrelevant. However, the basic concept of QMIX can also be applied to the MMI framework, and this remains as future work.
Learning coordinated behavior in the multi-agent systems is studied extensively in the MARL community. To promote coordination, some previous works used communication among agents Zhang and Lesser (2013); Foerster et al. (2016); Pesce and Montana (2019). For example, Foerster et al. (2016) proposed the DIAL algorithm to learn communication protocol that enables the agents to coordinate their behaviors. Jaques et al. (2018) proposed the social influence intrinsic reward which is related to the mutual information between actions to achieve coordination. Although the social influence algorithm increases the performance in challenging social dilemma environments, the limitation is that explicit dependency across actions is required and imposed for this algorithm to compute the intrinsic reward. As already mentioned, the MMI framework can be viewed as indirect enhancement of correlated exploration. The correlated policies are considered in several other works too. Liu et al. (2020)
proposed the explicit modeling of correlated policies for multi-agent imitation learning, andWen et al. (2019) proposed a probabilistic recursive reasoning framework. By introducing a latent variable and variational lower bound on mutual information, the proposed VM3-AC increases the correlation among policies without communication in the execution phase and without explicit dependency across agents’ actions.
As mentioned in the main paper, the proposed MMI framework can be interpreted as enhancing correlated exploration by increasing the entropy of own policy while decreasing the uncertainty about other agents’ actions. Some previous works also proposed other techniques to enhance correlated exploration Mahajan et al. (2019); Zheng and Yue (2018). For example, MAVEN addressed the poor exploration of QMIX by maximizing the mutual information between the latent variable and the observed trajectories Mahajan et al. (2019). However, MAVEN does not consider the correlation among policies. We compare the proposed VM3-AC with MAVEN and the comparison result is given in Appendix F.
Appendix B: Variational policy evaluation and policy improvement
In the main paper, we defined the state and state-action value functions for each agent as follows:
Define the mutual information augmented reward as
Then, we can apply the standard convergence results for policy evaluation. Define
for . Then, the operator is a -contraction.
Note that the operator has an unique fixed point by the contraction mapping theorem, and we define the fixed point as . Since
and this implies
(Variational Policy Improvement). Let and be the updated policy and the variational distribution from (35). Then, for all .
Let be determined as
Then, the following inequality is hold
From the definition of the Bellman operator,