Multi-Agent Deep Reinforcement Learning (MADRL) is gaining increasing attention from the research community with the recent success of deep learning because many of practical decision-making problems such as connected self-driving cars and collaborative drone navigation are modeled as multi-agent systems requiring action control. There are mainly two approaches in MADRL: one is centralized control and the other is decentralized control. The centralized control approach assumes that there exists a central controller which determines the actions of all agents based on all the observations of all agents. That is, the central controller has a policy which maps the joint observation to a joint action. Since the action is based on the joint observation, this approach eases the problem of lack of full observability of the global state in partially observable environments[Goldman and Zilberstein2004]
. However, this approach has the problem of the curse of dimensionality because the state-action space grows exponentially as the number of agents increases[Buşoniu, Babuška, and De Schutter2010]. Moreover, exploration, which is essential in RL, becomes more difficult than the single-agent RL case due to the huge state-action space. Hence, to simplify the problem, the decentralized control approach was considered. In fully decentralized multi-agent control, each agent decides its action based only on own observation, while treating other agents as a part of the environment, to reduce the curse of dimensionality. However, this approach eliminates the cooperative benefit from the presence of other agents and suffers from performance degradation.
In order to improve the performance of the decentralized control, several methods have been studied. First, multi-agent systems with decentralized control with communication (DCC) was studied [Goldman and Zilberstein2004]
. In the framework of MADRL with DCC, the agents communicate with each other by sending messages both in the training and execution phases, and the policy of each agent parameterized by a deep neural network determines the action of the agent based on own observation and the received messages from other agents. To incorporate the messages from other agents, the size of the deep neural network of each agent should be increased as the number of message-passing agents increases. However, if the network size becomes too large, training becomes difficult and may fail. Another approach iscentralized learning with decentralized execution, allowing each agent to use the information of other agents only in the training phase. In particular, recently-introduced MADDPG [Lowe and Mordatch2017], which uses a centralized critic to train a decentralized policy for each agent, belongs to this second category. In MADDPG, the centralized critic takes all of other agents’ observations and actions as input and hence the input space of each critic exponentially grows with the number of agents. In both approaches, as the number of agents in the system increases, the input dimension increases, learning becomes difficult, and much data is required for training. Hence, it is an important problem to properly handle the increased input dimension and devise an efficient learning algorithm for such MADRL with information exchange.
In this paper, motivated from dropout [Srivastava et al.2014], we propose a new training method, named message-dropout, yielding efficient learning for MADRL with information exchange with large input dimensions. The proposed method improves learning performance when it is applied to MADRL with information exchange. Furthermore, when it is applied to the scenario of DCC, the proposed method makes learning robust against communication errors in the execution phase.
2.1 A Partially Observable Stochastic Game
In MADRL, multiple agents learn how to act to maximize their future rewards while sequentially interacting with the environment. The procedure can be described by a Partially Observable Stochastic Game (POSG) defined by the tuple , where is the set of agents , is the action space of agent , and is the observation space of agent . At each time step , the environment has a global state and agent observes its local observation , which is determined by the observation probability , where and are the joint action space and the joint observation space, respectively. Agent executes action , which yields the next global state with the state transition probability , receives the reward according to the reward function , and obtains the next observation . The discounted return for agent is defined as where is the discounting factor. In POSG, the Q-function of agent can be approximated by , where is the joint action-observation history of agent . However, learning the action-value function based on the action-observation history is difficult. In this paper, we simply approximate with
. Note that a recurrent neural network can be used to minimize the approximation error betweenand [Hausknecht and Stone2015]. The goal of each agent is to maximize its expected return, which is equivalent to maximizing its Q-function. (Note that in this paragraph, we explained the fully decentralized case.)
2.2 Independent Q-Learning
Independent Q-Learning (IQL), which is one of the popular decentralized multi-agent RL algorithms in the fully observable case [Tan1993]
, is a simple extension of Q-learning to multi-agent setting. Each agent estimates its own optimal Q-function,, which satisfies the Bellman optimality equation . Under the assumption of full observability at each agent and fully decentralized control, Tampuu et al. combined IQL with deep Q-network (DQN), and proposed that each agent trains its Q-function parameterized by a neural network
by minimizing the loss function[Tampuu et al.2017]
where is the replay memory and is the target Q-value for agent . Here, is the target network for agent .
In the case of POSG with fully decentralized control, the above loss function can be modified to
where . Here, in the fully observable case is approximated with with the local partial observation , as described in Section 2.1.
2.3 Multi-Agent DDPG
As an extension of DDPG to multi-agent setting, MADDPG was proposed to use a decentralized policy with a centralized critic for each agent [Lowe and Mordatch2017]. The centralized critic uses additional information about the policies of other agents, and this helps learn the policy effectively in the training phase. The centralized critic for agent is represented by parameterized by a neural network , where is the collection of all agents’ deterministic policies, , and . Each agent trains its Q-function by minimizing the loss function
where , , and is the replay memory. Here, is the set of target policies and is the parameter of the target Q network for agent .
Then, the policy for agent , which is parameterized by , is trained by deterministic policy gradient to maximize the objective , and the gradient of the objective is given by
Dropout is a successful neural network technique. For a given neural network, constituting units (or nodes) in the neural network are randomly dropped out with probability independently in the training phase to avoid co-adaptation among the units in the neural network [Srivastava et al.2014]. In the test phase, on the other hand, all the units are included but the outgoing weights of those units affected by dropout are multiplied by . There is a good interpretation of dropout: efficient model averaging over randomly generated neural networks in training. For a neural network with units, dropout samples the network to be trained from differently-thinned networks which share the parameter in the training phase due to independent dropping out of the units. Scaling the weights of the units at the test phase can be regarded as averaging over the ensemble of subnetworks. It is shown that in the particular case of a single logistic function, dropout corresponds to geometric averaging [Baldi and Sadowski2013]. Thus, dropout efficiently combines the exponentially-many different neural networks as one network and can also be regarded as a kind of ensemble learning [Hara, Saitoh, and Shouno2016].
3.1 Decentralized Control with Communication
In this subsection, we consider the direct communication framework in [Goldman and Zilberstein2004] for DCC. This framework can be modeled by adding messages to POSG. Each agent exchanges messages before action, and chooses action based on its own observation and the received messages. Hence, the policy of agent is given by , where and is the space of received messages from agent to agent . We denote agent ’s received message from agent as , and is the collection of the received messages at agent . For the learning engine at each agent, we use the Double Deep Q-Network (DDQN) which alleviates the overestimation problem of Q-learning [Van Hasselt, Guez, and Silver2016]. In this case, the Q-function of agent parameterized by the neural network is given by . Then, the Q-function of agent is updated by minimizing the loss function
where . We will refer to this scheme as simple DCC.
The problem of simple DCC is that the input dimension of the neural network at each agent linearly increases and the size of the input state space increases exponentially, as the number of agents increases. Thus, the number of required samples for learning increases, and this decreases the speed of learning. Another issue of simple DCC is that the portion of each agent’s own observation space in the input space of the Q-function decreases as the number of agents increases. Typically, the own observation of each agent is most important. Hence, the importance of each agent’s own observation is not properly weighted in simple DCC.
To address the aforementioned issues of simple DCC, we propose a new neural network technique, named message-dropout, which can be applied to decentralized control with communication of messages.
Message-dropout drops out the received other agents’ messages at the input of the Q-network at each agent independently in the training phase. That is, all units corresponding to the message from one dropped other agent are dropped out simultaneously in a blockwise fashion with probability and this blockwise dropout is independently performed across all input unit blocks corresponding to all other agents’ messages. On the other hand, the outgoing weights of those input units on which dropout is applied are multiplied by , when the policy generates actual action. Note that dropout is not applied to the input units corresponding to each agent’s own observation. To illustrate, let us consider agent in an environment with total three agents, as shown in Fig. 1. The Q-network of agent 1 has three input blocks: one for own observation and two for the messages from two other agents. By applying the proposed message dropout, as shown in Fig. 1, we have the four possible configurations for the input of the Q-network of agent 1:
where with all zero elements represents the dropped-out input units.
Now, we explain the Q-learning process for DCC with message-dropout (DCC-MD). Consider the learning at agent . Agent stores the transition into its replay memory . To train the Q-function, agent samples a random mini-batch of transitions from , denoted . Message-dropout is performed independently for each and the message-dropout performed observation and its transition are given by
where the scalar value Bernoulli(). Note that the same binary mask is used to define and . Then, the Q-function is updated by minimizing the loss
where Here, is the Q-function parameterized by the neural network whose outgoing weights of are multiplied by . Note that we use to predict the next action when evaluating the target Q-value . Finally, the policy is given by
In the training phase with message-dropout, agent drops out some of other agents’ messages in , while keeping own observation. As a result, the input space of the Q-network of agent is projected onto a different subspace (of the original full input space) that always includes own observation space at each training time since the dropout masks change at each training time. The input spaces of the four Q-networks in the example of Fig. 1 are , , , (all include agent’s own observation ). In the general case of agents, message-dropout samples the network to be trained from differently-thinned networks which always include the agent’s own observation. Note that the Q-network whose all received messages are retained is the Q-network of simple DCC and the Q-network whose all observations from other agents are dropped is the Q-network for fully decentralized DDQN without communication. Thus, differently-thinned networks include the Q-networks of simple DCC and fully decentralized DDQN. Message-dropping out in the training phase and multiplying the outgoing weights of by in the test phase yields an approximate averaging over the ensemble of those networks. Note that simple DCC and fully decentralized DDQN are the special cases of the dropout rate and , respectively. Thus, for , the proposed scheme realizes some network between these two extremes.
Message-dropout can also be applied to the framework of centralized training with decentralized execution, particularly to the setting in which each agent uses additional information from other agents during training as the input of the network. For example, in MADDPG, the centralized critic takes all agents’ observations and actions as input, and hence the input space of the centralized critic for each agent increases exponentially as the number of agents increases. The proposed technique, message-dropout, is applied to the training phase of the centralized critic to address the aforementioned problem. The centralized critic with message-dropout applied is trained to minimize the loss function:
where and . Note that , where Bernoulli(), and the same binary mask is used to obtain and as in DCC-MD. Then, the policy for agent is trained by maximizing the objective , and the gradient of the objective can be written as
We refer to MADDPG with message-dropout applied as MADDPG-MD.
In this section, we provide some numerical results to evaluate the proposed algorithm in the aforementioned two scenarios for MADRL with information exchange. First, we compare DCC-MD with simple DCC and Fully Decentralized Control (FDC) in two environments of pursuit and cooperative navigation. Second, we compare MADDPG-MD with MADDPG and independent DDPG (simply DDPG) in the environment of waterworld. Then, we provide in-depth ablation studies to understand the behavior of message-dropout depending on various parameters. Finally, we investigate DCC-MD in unstable environments in which some links of communication between agents can be broken in the execution phase
Although some compression may be applied, for simplicity we here assume that each agent’s message is its observation itself, which is shown to be optimal when the communication cost is ignored in the framework of DCC [Goldman and Zilberstein2004]. Hence, for all agent , and the policy function becomes , where . (A brief study on message-dropout with message generation based on auto-encoder applied to raw observation is given in the supplementary file of this paper. It is seen that similar performance improvement is achieved by message-dropout in the case of compressed message. Hence, message-dropout can be applied on top of message compression for MADRL with message communication.)
The pursuit game is a standard task for multi-agent systems [Vidal et al.2002]. The environment is made up of a two-dimensional grid and consists of pursuers and evaders. The goal of the game is to capture all evaders as fast as possible by training the agents (i.e., pursuers). Initially, all the evaders are at the center of the two-dimensional grid, and each evader randomly and independently chooses one of five actions at each time step: move North, East, West, South, or Stay. (Each evader stays if there exists a pursuer or a map boundary at the position where it is going to move.) Each pursuer is initially located at a random position of the map and has five possible actions: move North, East, West, South or Stay. When the four sides of an evader are surrounded by pursuers or map boundaries, the evader is removed and the pursuers who capture the evader receive reward. All pursuers receive reward for each time step and reward if the pursuer hits the map boundary (the latter negative reward is to promote exploration). An episode ends when all the evaders are captured or time steps elapse. As in [Gupta, Egorov, and Kochenderfer2017], each pursuer observes its surrounding which consists of map boundary, evader(s), or other pursuer(s). We assume that each pursuer can observe up to distances in four directions. Then, the observed information of each pursuer can be represent by a observation window (which is the observation of each agent): a window detecting other pursuer(s), a window detecting evader(s), and a window detecting the map boundary. For the game of pursuit, we set , , , , and and simulate two cases: and . The map size of the two cases are and respectively.
Cooperative navigation, which was introduced in [Mordatch and Abbeel2017], consists of agents and landmarks. The goal of this environment is to occupy all of the landmarks while avoiding collisions among agents. The observation of each agent is the concatenation of its position and velocity, the locations of landmarks, and the locations of other agents. Since we consider partially observable setting, we assume that each agent observes the locations of other agents only if the distance is less than . Each agent receives a negative reward as the negative of the minimun of the distances from the agent to the landmarks and receives a negative reward if the collision among agents occurs. In this environment, we set , and simulate two cases: and .
Waterworld is an extended environment of pursuit to a continuous domain [Gupta, Egorov, and Kochenderfer2017]. The environment is made up of a two-dimensional space and consists of pursuers and food targets, poison targets, and one obstacle. The goal of the environment is to capture as many food targets as possible within a given episode of time steps while avoiding poison targets. In order to make the game more cooperative, at least agents need to cooperate to catch a food target. Each agent takes two-dimensional physical actions to the environment and has observation which consists of its position and information from 25 range-limited sensors of the agent. The sensors of each agent are used to offer the distances and velocities of other agents, food targets, and poison targets. The agents receive a reward when they capture a food target and are penalized by getting reward when they encounter a poison target. To promote exploration, a reward is given to an agent if the agent touches a food target. They also receive an action penalty reward defined as the square norm of the action. In this environment, we set , , , and , and simulate two cases: and .
The three environments are briefly illustrated in Fig. 2.
4.2 Model Architecture
Instead of using the concatenation of the agent’s own observation and the received messages as the input of the Q-function, we use an architecture for the Q-function that emphasizes the agent’s own observation. The proposed neural network architecture for the Q-function for agent is represented by
where and are the neural networks that extracts features of own observation and the received messages, respectively, and is the neural network that produces the expected return by using the output of and . In MADDPG, we replace with the concatenation of and , where and . Note that , and are dependent on the task since the input and action dimensions of each task are different. The detailed structures of , and are explained in supplementary material in a separate file.
For the pursuit game, we compared DCC-MD with simple DCC and FDC by varying the dropout rate as . Note that DCC-MD with corresponds to simple DCC, whereas DCC-MD with corresponds to FDC. The performance of each algorithm was measured by the number of evader catches in 500 time steps after training. Fig. 2(a) and Fig. 2(b) show the number of evader catches (in 500 time steps after training) averaged over episodes and 7 random seeds, with respect to the drop rate. It is seen that the performance improves as the dropout rate increases initially and then the performance deteriorates as the dropout rate further increases after a certain point. In the considered tasks, the best dropout rate is around . It is seen that DCC-MD with proper drop rate significantly outperforms both simple DCC and FDC. Note that in the case that the number of agents is , simple DCC has even worse performance than FDC. This is because simple DCC does not learn properly due to the large state space for large .
In this environment, we compared DCC-MD with the dropout rate and with simple DCC and FDC. Figs. 2(c) and 2(d) show the learning curves for two cases () and (), respectively. The y-axis is the sum of all agents’ rewards averaged over random seeds, and the x-axis is time step. It is seen that DCC-MD with the dropout rate and outperforms simple DCC and FDC.
In the waterworld environment, we now considered (independent) DDPG, vanilla MADDPG, MADDPG, and MADDPG-MD with the drop rate and , and compared their performances. Here, MADDPG is the modified version of vanilla MADDPG to which the proposed network architecture described in Section 4.2 is applied. Figs. 2(e) and 2(f) show the learning curves of the four algorithms for two cases () and (), respectively. The y-axis is the number of food target catches averaged over random seeds, and the x-axis is time step. It is seen that MADDPG-MD outperforms both (independent) DDPG and MADDPG. Note that fully decentralized (independent) DDPG has the fastest learning speed at the initial stage due to its small input dimension, but its performance degrades as time step goes because of no cooperation. It is noteworthy that MADDPG-MD almost achieves the initial learning speed of (independent) DDPG while it yields far better performance at the steady state.
4.4 Ablation Studies
With the verification of the performance gain of the message-dropout technique, we performed in-depth ablation studies regarding the technique with respect to the four key aspects of the technique: 1) the dropout rate, 2) block-wise dropout versus element-wise dropout, 3) retaining agent’s own observation without dropout, and 4) the model architecture.
As mentioned in Section 2, we can view that message-dropout generates an ensemble of differently-thinned networks and averages these thinned networks. From this perspective, the dropout rate determines the distribution of the thinned networks. For example, all the networks are uniformly used to train the ensemble Q-network if the dropout rate is . As the dropout rate increases, the overall input space shrinks in effect and the portion of the own observation space becomes large in the overall input space, since we apply message-dropout only to the message inputs from other agents. Hence, it is expected that the learning speed increases especially for large as the dropout rate increases. Figs. 3(a) and 3(b) show the learning performance of the algorithms in the training phase. The x-axis is the current time step, and the y-axis is the number of evader catches in time steps. As expected, it is seen that the learning speed increases as the dropout rate increases. This behavior is clearly seen in Fig. 3(b), where the number of agents is . Note that message-dropout with proper drop rate achieves gain in both the learning speed and the steady-state performance. Figs. 2(a) and 2(b) show the corresponding performance in the execution phase after the training. It seems that the drop rate of 0.2 to 0.5 yields similar performance and the performance is not so sensitive to the drop rate within this range.
Block-wise dropout versus element-wise dropout
We compared message-dropout with standard-dropout which drops the messages of other agents out in an element-wise manner while retaining each agent’s own observation. Fig. 4(a) shows that message-dropout yields better performance than the standard element-wise dropout. The difference between message-dropout and standard-dropout is the projected subspaces of the input space. Message-dropout projects the input space onto subspaces which always include own observation space, whereas standard-dropout projects the input space onto subspaces which contain the projected subspaces by message-dropout.
Retaining agent’s own observation without dropout
We compared message-dropout with full message-dropout which applies message-dropout to each agent’s own observation as well as the messages of other agents. Fig. 4(a) shows that the full message-dropout increases the training time, similarly to the known fact that in general dropout increases the training time [Srivastava et al.2014]. Whereas full message-dropout yields slower training than simple DCC, message-dropout makes training faster than simple DCC. Note that standard elementwise dropout without dropping agent’s own observation also yields faster training than simple DCC, but full standard elementwise dropout fails to train. Hence, it is important to retain each agent’s own observation without dropping out when dropout is applied to MADRL with information exchange.
We used the neural network architecture of Q-function that is described in Section 4.2 for all environments. In pursuit with and waterworld, learning failed with the simple model architecture that uses the concatenation of each agent’s own observation and the received messages as input. Hence, the proposed model architecture is advantageous when the input space of Q-function is large. Note that the proposed model architecture has more layers for agent’s own observation than those for the received messages from other agents as shown in Fig. 4(b), and hence the feature for more important agent’s own observation is well extracted. An interested reader is referred to the supplementary file.
4.5 Test in The Unstable Environment
Up to now, we have assumed that communication among agents is stable without errors in both training and execution phases. Now, we consider the situation in which the communication is stable in the training phase but unstable in the actual execution phase. Such situations occur when the training is done in a controlled environment but the execution is performed in an uncontrolled environment with real deployment. We considered two communication-unstable cases: case 1 is the case that randomly chosen half of the connections among agents are broken, and case 2 is the case that all connections among agents is broken. When the communication between two agents is broken, we use a zero vector instead of the message received from each other. Note that the performance of FDC does not change since it requires no communication.
Fig. 6 shows the average number of evader catches in the two considered cases in pursuit. It is seen that DCC-MD outperforms both simple DCC and FDC when the communication link is broken but not all links are broken. It means that message-dropout in the training phase makes the learning robust against communication errors and can still outperform FDC even with other agents’ messages coming less frequently. On the other hand, when the communication link is too unstable, DCC-MD cannot recover this communication loss (but still better than simple DCC), but FDC is better in this case. Hence, the message-dropout in the training phase can be useful in the situation in which communication among agents is erroneous with a certain probability in the real execution phase.
5 Related Work
Recent works in MADRL focus on how to improve the performance compared to FDC composed of independently learning agents. To harness the benefit from other agents, [Foerster et al.2016] proposed DIAL, which learns the communication protocol between agents by passing gradients from agent to agent. [Foerster et al.2018] proposed the multi-agent actor-critic method called COMA, which uses a centralized critic to train decentralized actors and a counterfactual baseline to address the multi-agent credit assignment problem.
In most MADRL algorithms such as those mentioned above, the input space of the network (policy, critic, etc.) grows exponentially with the number of agents. Thus, we expect that message-dropout can be combined with those algorithms to yield better performance. To handle the increased dimension in MADRL, [Yang et al.2018] proposed mean field reinforcement learning in which the Q-function is factorized by using the local action interaction and approximated by using the mean field theory. Whereas mean field reinforcement learning handles the action space within the input space consisting of action and observation, message-dropout can handle not only the action space but also the observation space.
In this paper, we have proposed the message-dropout technique for MADRL. The proposed message-dropout technique effectively handles the increased input dimension in MADRL with information exchange, where each agent uses the information of other agents to train the policy. We have provided ablation studies on the performance of message-dropout with respect to various aspects of the technique. The studies show that message-dropout with proper dropout rates significantly improves performance in terms of the training speed and the steady-state performance. Furthermore, in the scenario of decentralized control with communication, message-dropout makes learning robust against communication errors in the execution phase. Although we assume that the communication between agents is fully allowed, message-dropout can be applied to the scenario in which communication between limited pairs of agents is available too.
This work was supported in part by the ICT R
D program of MSIP/IITP (2016-0-00563, Research on Adaptive Machine Learning Technology Development for Intelligent Autonomous Digital Companion) and in part by the National Research Foundation of Korea(NRF) grant funded by the Korea government(Ministry of Science and ICT) (NRF-2017R1E1A1A03070788).
- [Baldi and Sadowski2013] Baldi, P., and Sadowski, P. J. 2013. Understanding dropout. In Advances in neural information processing systems, 2814–2822.
- [Buşoniu, Babuška, and De Schutter2010] Buşoniu, L.; Babuška, R.; and De Schutter, B. 2010. Multi-agent reinforcement learning: An overview. In Innovations in multi-agent systems and applications-1. Springer. 183–221.
- [Foerster et al.2016] Foerster, J.; Assael, I. A.; de Freitas, N.; and Whiteson, S. 2016. Learning to communicate with deep multi-agent reinforcement learning. In Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances in Neural Information Processing Systems 29. Curran Associates, Inc. 2137–2145.
- [Foerster et al.2017] Foerster, J.; Nardelli, N.; Farquhar, G.; Afouras, T.; Torr, P. H. S.; Kohli, P.; and Whiteson, S. 2017. Stabilising experience replay for deep multi-agent reinforcement learning. In Precup, D., and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 1146–1155. International Convention Centre, Sydney, Australia: PMLR.
- [Foerster et al.2018] Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; and Whiteson, S. 2018. Counterfactual multi-agent policy gradients.
Goldman, C. V., and Zilberstein, S.
Decentralized control of cooperative systems: Categorization and
Journal of artificial intelligence research22:143–174.
- [Gupta, Egorov, and Kochenderfer2017] Gupta, J. K.; Egorov, M.; and Kochenderfer, M. 2017. Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, 66–83. Springer.
- [Hara, Saitoh, and Shouno2016] Hara, K.; Saitoh, D.; and Shouno, H. 2016. Analysis of dropout learning regarded as ensemble learning. In International Conference on Artificial Neural Networks, 72–79. Springer.
- [Hausknecht and Stone2015] Hausknecht, M., and Stone, P. 2015. Deep recurrent q-learning for partially observable mdps. In AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents (AAAI-SDMIA15).
- [Hinton and Salakhutdinov2006] Hinton, G. E., and Salakhutdinov, R. R. 2006. Reducing the dimensionality of data with neural networks. science 313(5786):504–507.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Lowe and Mordatch2017] Lowe, Wu, T. H. A., and Mordatch. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in Neural Information Processing Systems.
- [Mordatch and Abbeel2017] Mordatch, I., and Abbeel, P. 2017. Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908.
- [Srivastava et al.2014] Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958.
- [Tampuu et al.2017] Tampuu, A.; Matiisen, T.; Kodelja, D.; Kuzovkin, I.; Korjus, K.; Aru, J.; Aru, J.; and Vicente, R. 2017. Multiagent cooperation and competition with deep reinforcement learning. PloS one 12(4):e0172395.
- [Tan1993] Tan, M. 1993. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, 330–337.
- [Van Hasselt, Guez, and Silver2016] Van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double q-learning. In AAAI, volume 16, 2094–2100.
[Vidal et al.2002]
Vidal, R.; Shakernia, O.; Kim, H. J.; Shim, D. H.; and Sastry, S.
Probabilistic pursuit-evasion games: theory, implementation, and experimental evaluation.IEEE transactions on robotics and automation 18(5):662–669.
- [Yang et al.2018] Yang, Y.; Luo, R.; Li, M.; Zhou, M.; Zhang, W.; and Wang, J. 2018. Mean field multi-agent reinforcement learning. In Dy, J., and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 5567–5576. Stockholmsmässan, Stockholm Sweden: PMLR.
8 Supplementary Material
8.1 Model Architecture and Training Detail
In this supplementary material, we describe the detailed structure of Q-function for DCC and MADDPG and the procedure of training for each task. The Q-function neural network structure for DCC-MD is the same as that of DCC, while only block-dropout is applied in the case of DCC-MD. Similarly, the Q-function neural network structure for MADDPG-MD is the same as that of MADDPG while only block-dropout is applied in the case of DCC-MD. As mentioned in the main paper, our neural network architecture of Q-function can be expressed as
where , , and are the sub-neural-networks designed properly depending on each task.
Fig. 8 shows the Q-function neural network architecture for DCC (or DCC-MD) and FDC used in the pursuit game with the number of agents .
In pursuit, the observation of three 2D windows is flattened as input units.
is a two multi-layer perceptron (MLP) with 64 hidden units and produces a 48-dimensional output.is a single-layer perceptron and produces a 96-dimensional output. Since the action is discrete in pursuit, is not set as an input of but as the output of , and
is a two MLP with 32 hidden units. The activation function of, and
are ReLU expect the final linear layer of. In FDC, i.e., the fully decentralized DDQN, each agent has 4 MLP whose activation functions are ReLU except the final linear layer. The four layers have 64, 48, 32, and 5 units, respectively. In the case of , only is changed to a single-layer perceptron producing a 128-dimensional output
All algorithms, FDC, DCC, and DCC-MD, used the -greedy policy with annealed from 1 to 0.02 over initial time steps and fixed at 0.02 thereafter. Although it is known that experience replay harms the performance in MADRL in general [Foerster et al.2017], we used experience replay because we observed performance improvement with experience replay for our tasks. Each agent used the replay memory size of . We used and Adam optimizer [Kingma and Ba2014] with the learning rate . For all tasks, we updated the Q-function every 4 time steps.
The basic architecture of the Q-function neural network used for cooperative navigation is the same as that shown in Fig. 8, except the number of nodes in each layer.
The dimension of observation for cooperative navigation is . is two MLP with 64 hidden units and produces a 64-dimensional output. is a single-layer perceptron and produces a 64-dimensional output. is 2 MLPs with 32 hidden units. The activation function of , and is ReLU expect the final linear layer of . In FDC, i.e., the fully decentralized DDQN, each agent has 4 MLP whose activation functions are ReLU except the final linear layer. The four layers have 64, 64, 32, 5 units, respectively.
All algorithms (FDC, DCC, and DCC-MD) used the -greedy policy with annealed from 1 to 0.02 over time steps and fixed at 0.02 thereafter. Each agent used the replay memory size of . We used and Adam optimizer with the learning rate . For all tasks, we updated the Q-function every 4 time steps.
Fig. 9 shows the neural network architecture used for the waterworld game. MADDPG and MADDPG-MD have the same neural network architecture, while block-dropout is applied to the message input units from other agents in the case of MADDPG-MD.
The neural network architecture of the decentralized actor in MADDPG is two MLP with 64 hidden units. The neural network architecture of the centralized critic is expressed by , and as mentioned previously. is 2 MLP with 200 hidden units and produces a 100-dimensional output. Here, the action is included twice to the input layer and the hidden layer in . is a single-layer perceptron and produces 100 hidden units. is 2 MLP with 64 hidden units. The activation function of , and are ReLU except the final linear layer of .
In the environment of waterworld, all agents share the parameter of critic and actors to promote learning. Each agent used the replay memory size of and Gaussian noise process with for efficient exploration. We used and Adam optimizer with the learning rate . We updated the centralized critic every 5 time steps and the decentralized actors every 10 time steps.
8.2 Compressing Message using Autoencoder
For the purpose of the experiment, we have assumed that the messages are the observation of other agents in the main paper. Message-dropout can also be applied in the scenario of DCC with the more practical assumption that communication load is limited. We compress each agent’s observation using the traditional autoencoder introduced in[Hinton and Salakhutdinov2006], and then use it as a message. In the environment of pursuit (N=8), we consider the simple DCC and DCC-MD which are applied autoencoder. Fig. 7 shows that the learning speed increases in the case of simple DCC and DCC-MD with dropout rate 0.2. Note that DCC-MD still performs better than simple DCC when the messages are compressed version of observation. However, the compressed messages using autoencoder degrades the steady-state performance.
Fig. 10 shows the autoencoder architecture used in the pursuit game with the number of agents . As shown in fig. 10, the autoencoder consists of 2-MLP encoder and 2-MLP decoder, and 147-dimensional observation is compressed to 32-dimensional message. We pretrained the autoencoder using samples and then used the encoder to compress the observation in training.