Multi-agent reinforcement learning (MARL) has long been a go-to tool in complex robotic and strategic domains [RoboCup2019, OpenAI2019]. A key difficulty, faced by a group of learning agents in such domains, is the need to efficiently and timely exploit the available communication resources. Communicated information enables action and belief correlation that benefits the group’s activity. While initially only pre-constructed communication protocols [Melo, Spaan, and Witwicki2011, Maravall, de Lope, and Dominguez2013, Zhang and Lesser2013]
were used, the advent of deep learning allows to make the communication aspects a part of the learning process[Foerster et al.2016, Sukhbaatar, Fergus, and others2016, Jiang and Lu2018, Singh, Jain, and Sukhbaatar2018].
Nonetheless, the limited bandwidth constraint remains a common issue for multi-agent communication to apply MARL algorithms to the real-world multi-agent systems, where agents need to carefully choose what, when and with whom to communicate. In fact, in existing works [Zhang and Lesser2013, Mao et al.2019, Kim et al.2019], the limited bandwidth constraint in MARL is not even well-defined, and is simply represented either by the topology of the agents’ relationship [Zhang and Lesser2013, Mao et al.2019] or by the bits used in the messages [Kim et al.2019]. This simplicity would be of a definite detriment in dynamic or limited-bandwidth scenarios such as autonomous driving [Shalev-Shwartz, Shammah, and Shashua2016], search and rescue [Nagatani et al.2013, Yuan, Liu, and Zhang2017], and space/deep sea exploration [Gao and Chien2017, Cressey2015]. In all of the above, the contextual content and the impact of the transmitted information matter.
To address this quality of information requirement, inspired by the variational information bottleneck method [Tishby, Pereira, and Bialek2000, Alemi et al.2016], we propose a general regularization method for learning informative communication protocols under the limited-bandwidth constraint, named Informative Multi-Agent Communication (IMAC). First, due to vagueness of the limited-bandwidth constraint definition in existing works, we clarify it by showing that limited bandwidth translates into a constraint on the communicated message entropy. In more detail, derived from source coding theorem [Shannon1948] and Nyquist criterion [Freeman2004], we state that in a noiseless channel, when a -ary communication system with bandwidth transmits symbols/messages per second, the entropy of the messages is limited by the bandwidth according to . Thus, agents should generate low-entropy messages to satisfy the limited-bandwidth constraint.
Additionally, due to the variety of the real-life communication source coding methods and communication protocols, bitstream can carry different amount of information in different situations. Hence we utilize the entropy as a general measurement and clip the messages’ variance to simulate the limited-bandwidth constraint.
IMAC utilizes the information bottleneck method to control the entropy of the messages. Specifically, IMAC applies the varitional information bottleneck to the communication channel by viewing the messages as latent variables and approximating its posterior distribution. By regularizing the mutual information between the channel’s inputs (the internal features extracted from agents) and the channel’s outputs (the messages), we constrain the content of messages to learn informative communication protocols, which convey low-entropy and useful messages.
We conduct extensive experiments in chosen environments: cooperative navigation and predator-prey. Results show that IMAC can convey low-entropy messages, enable effective communication among agents under the limited-bandwidth constraint, and lead to faster convergence as compared with MADDPG, MADDPG-M and AMP.
There are two lines of related works: communication in deep multi-agent reinforcement learning and the information bottleneck method in reinforcement learning.
Methods for solving large scale communication or limited-resource communication are also explored. For instance, learning the communication protocols by utilizing specific networks as communication channels or scheduling multi-agent communication via various mechanisms. However, these methods are susceptible to failure under the limited-bandwidth constraint. The former such as DIAL [Foerster et al.2016], CommNet [Sukhbaatar, Fergus, and others2016], BiCNet [Peng et al.2017], AMP [Peng, Zhang, and Luo2018], in particular, fails to ”extract valuable information for cooperation” as the analysis in [Jiang and Lu2018] shows. As the number of agents and the size of a single message grow, communication participants are overwhelmed by the message flow. The latter, scheduling mechanism methods like ATOC [Jiang and Lu2018], IC3Net [Singh, Jain, and Sukhbaatar2018], SchedNet [Kim et al.2019], MADDPG-M [Kilinc and Montana2018], and GACML [Mao et al.2019] cannot constrain the message content either. Rather, these methods limit the number of agents who can communicate to comply with the limited-bandwidth constraint. Concretely, ATOC proposes attention gating units, while IC3Net extends the work of CommNet by using a gating mechanism with MLP structure. However, these two methods are inflexible to real-world multi-agent systems because they are only for homogeneous agents. MADDPG-M generates a score for each pair of agents, allows the two agents in every pair to communicate. However, MADDPG-M constrains the number of agents with whom each agent can communicate, thus restrains potential cooperation. SchedNet allows a group of agents to send messages at every step, according to top- algorithm or sampling. GACML introduces a dynamic threshold for gating. However, these two methods are also inflexible because they do not show the direct relationship between the bandwidth and their controlled parameters, like in SchedNet and the threshold in GACML. These parameters need to be handcrafted or learned in different environments to satisfy specific bandwidth conditions.
The combination between the information bottleneck method and reinforcement learning has brought a few applications in the last few years, especially in imitation learning[Peng et al.2018], inverse reinforcement learning [Peng et al.2018] and exploration [Goyal et al.2019]. Among them, Goyal et al. goyal2019infobot mention the multi-agent communication in their appendix, showing a method to minimize the communication by penalizing the effect of one agent’s messages on another one’s policy. However, it does not consider the limited-bandwidth constraint.
Multi-agent Communication with Limited Bandwidth
In this section, we introduce the multi-agent communicative MDP, clarify the limited bandwidth constraint as well as prove that limited bandwidth restricts message entropy, and finally discuss its implementation.
Multi-agent Communicative MDP
Multi-agent reinforcement learning can be formalized in the framework of DEC-POMDP. An -agent multi-agent MDP is an -agent DEC-POMDP, which is described by a tuple , where represents the number of agents. represents the set of all agents’ states. denotes the sets of actions available to the agents. denotes the sets of observation for each agent.
denotes the state transition probability function. All agents share the same reward as a function of the states and agents’ actions. Each agent receives a private observation according to the observation function . denotes the discount factor. Generally, the objective of MARL for agent is to learn an optimal policy which maximizes the expected discounted return , where the collected reward by the th agent at time . For each agent , state-value function measures the expected return of state , and action-value function assesses the expected return of a state-action pair . So the objective of agent can be written as:
We extend multi-agent MDP to a communicative one. Compared with multi-agent MDP, multi-agent communicative MDP introduces a new component, i.e., messages. Unlike actions that directly change the world state, messages only affect the receiving agents’ individual strategy. Consequently, we define communicative multi-agent MDP as a tuple, , with a new component , which represents the sets of messages received by agents. The objective is slightly different from MARL: not only find an optimal policy but also learn optimal communication protocols that can generate messages to help with finding a policy to maximize the expected return.
The process of communication (Figure 1) consists of three stages, coding, transmission, and decoding. In the coding phase, communication system maps the messages of agent to a bitstream . Transmission means that the bitstream is transmitted through a channel, and becomes another bitstream due to some distortion in the channel. Then, decoding is the inverse operation of coding. We focus on the first two stages by assuming no loss of information in the decoding stage.
Remark 1 (Source Coding Theorem [Shannon1948]).
Assume a set of symbols is to be transmitted through the communication channel. These symbols can be treated as independent samples of a random variable
independent samples of a random variablewith entropy . Let be the average number of bits to encode the symbols. The minimum satisfies .
Remark 2 (The Maximum Data Rate [Freeman2004]).
The maximum data rate (bits per second) over a noiseless channel satisfies: , where is the bandwidth (Hz) and is the number of signal levels.
Remark 1 shows that when we map samples of a random variable to binary bits, the code rate (bits per symbol) is larger than , i.e., (Proof can be seen in the supplementary materials). Remark 2 is derived from the Nyquist criterion [Freeman2004] and specifies that a communication system can only transmit data at a rate for reliable transmission in the noiseless and limited-bandwidth condition (Proof can be seen in the supplementary materials). Based on these two remarks, we show how the limited-bandwidth constraint affects the multi-agent communication.
In a noiseless channel, the bandwidth of channel limits the entropy of the messages.
Proof of Proposition 1.
Given the message
as an i.i.d continuous random variable with differential entropyand entropy with quantization, its time series , the communication system’s bandwidth , as well as the signal level , the communication system transmits symbols per second. So the transmission rate .111Items in parentheses are units of measure for clarity. According to Remark 2, . Consequently, we derive . ∎
Proposition 1 means that if satisfies , when sending a series of messages , we can achieve reliable transmission under the limited bandwidth. A limited-bandwidth constraint equals to enforcing an upper bound to the message entropy .
Implementation of the Limited-bandwidth Constraint
We focus on the measurement of as well as how to implement the limited-bandwidth constraint. Given the messages as an i.i.d variable with a certain distribution, we find a quantity to measure the information of the messages because the exact distribution of is unknown to us.
When we have a historical record of the messages to estimate the messages’ mean
When we have a historical record of the messages to estimate the messages’ meanand variance , the information of the messages can be measured and represented by .
Proof of Proposition 2.
follows a certain distribution, and we are only certain about its mean and variance. According to the principle of maximum entropy[Jaynes1957] but having a finite mean and finite variance (proof see [Cover and Thomas2012]). We notice that if a random variable , then its entropy is . In short, with the message’s mean and variance , , where . ∎
We conclude that offers an upper bound to approximate . So Gaussian distribution can be viewed as a good approximation to the messages’ distribution.
Due to the variety of the real-life communication source coding methods, like Huffman coding, and communication protocols, like TCP/UDP, bitstream can carry different amount of information in different situations. As a result, we utilize entropy as a general measurement and clip the messages’ variance to simulate the limited-bandwidth constraint. More specifically, we use a batch-normalization-like layer which records the messages’ mean and variance during training, and it normalizes messages during inferring.
The purpose of our normalization layer is to simulate the external limited-bandwidth constraint only in inference (which is implemented by a clip on the variance of messages). It is customized and different from standard batch normalization [Ioffe and Szegedy2015]. Specifically, our normalization layer records the mean and variance of the messages in training, shifts and scales with predefined parameters in inference. The messages have more powerful representation when having a mean and variance according to VAE [Kingma and Welling2013]. The mean of the messages is trained to approach zero, while the variance is important because under Gaussian distribution assumption, the variance reflects the messages’ entropy, which is one of the regularization objective.
For example, the maximum bandwidth of a 4-ary communication system is bit/s, if we want to achieve reliable transmission at the rate of messages per second. Then we can determine the equivalent variance according to . During the training stage, we record the agent’s message variance, which is . However, at the inference stage, the bandwidth requires the message entropy not to excess . We, therefore, decrease the variance from to by using the specific normalization layer.
Informative Multi-agent Communication
As stated in the previous section, agents should generate low-entropy messages to satisfy the limited bandwidth constraint. The main idea behind our proposed method is to compress the messages of agents as much as possible to satisfy the limited-bandwidth constraint, and meanwhile to maintain agents’ performance. There is a trade-off between the degree of compression and the performance: when agents get all uncompressed messages from others’ information, they can reach a good performance with taking others into consideration; when the messages suffer a lossy compression, agents will perform poorly in cooperative tasks due to the incompleteness of shared information. We can alleviate the effect of incompleteness of shared information by using the information bottleneck method, which encourages the model to focus only on the most informative component in the shared information.
The Model of Informative Multi-agent Communication
Before introducing our model, we clarify there are two schemes in multi-agent communication: intra-step communication and inter-step communication. Illustration figures can be found in the supplementary material. In the intra-step communication, agents generate, send and receive messages in every single step, while in the inter-step communication, agents send messages along with agents’ actions as next-step agents’ input. The view of intra-step communication is widely discussed in most previous methods such as the CommNet, ATOC, AMP, SchedNet, GACML, etc. Hence we follow them and focus on the intra-step communication.
Our proposed model is shown in Figure 2
. We divide each agent’s decision epoch into three phases: pre-communication, communication, and post-communication phases. Formally, consideragents, we model these three phases for each agent: the pre-communication functions as , the channel functions as , and the post-communication functions as .222We omit the time step because all the operations happen in every single time step The pre-communication function takes as input the observation of agent and outputs the internal features of agent . The channel function takes as input the internal features of all agents and outputs the messages for agent . The post-communication function takes as input the messages of agent and outputs the action of agent .
The channel function plays a core role in the multi-agent communication. However, the channel function is different from the channel in the communication model (Figure 1). The former one actually acts like a coordinator to aggregate all agents’ internal features and to extract the messages for each agent . In the real-world applications, the channel function can be deployed in an agent who has stronger capability of computation.
Variational Information Bottleneck for Learning Protocols
We propose our variational information bottleneck solution in multi-agent communication for learning informative protocols in bandwidth-limited scenarios. Consider a scenario with agents with policies parameterized by , and let be the set of all agents’ policies. The optimal policy of agent can be determined according to . We utilize the centralized training such as MADDPG [Lowe et al.2017]. The objective of agent is:
where is the state distribution of the overall policy and is a centralized action-value function that takes as input the actions of all agents, in addition to the observations of all agents, , and outputs the Q-value for agent .
The information bottleneck method is used to encourage the model to focus only on the most informative features [Alemi et al.2016]. Base on the same principle, we incorporate a variational information bottleneck by viewing the channel function as an encoder that maps the internal features to a stochastic messages . From the perspective of agent , the bottleneck can be incorporated by enforcing an upper bound on the mutual information between the messages and the internal features . In the information bottleneck framework, can be viewed as a penalty term that restricts the complexity of channel function. Then, the objective with the regularization of information bottleneck can be written as:
Practically, we propose to maximize the following objective using the information bottleneck Lagrangian:
where the is the Lagrange multiplier. The mutual information is defined according to:
where is the probability of the message , is the probability of the internal features of all agents , is the joint probability of and , and is the conditional probability of given . However, computing the marginal distribution can be challenging since we do not know the prior distribution of hidden states and condition probability . So, we turn to a variational lower bound. We view channel function as multiple variational encoders , where becomes a random variable with learned means and variances. So, actually represents the probability that maps the features to a latent message distribution over , and then use an approximation of the prior distribution of messages.
Since , , an upper bound on the mutual information can be obtained via the KL divergence:
This provides a lower bound on the regularized objective that we maximize:
Consequently the objective’s derivative is:
Note that, with the regulation of , . We can easily control the messages to satisfy different bandwidth conditions with different distribution of in the training stage. That is, our method provides a direct relationship between the bandwidth and the parameters. Our approach is summarized in Algorithm 1.
Experiments are performed based on the multi-agent particle environment [Lowe et al.2017], which is a two-dimensional world with continuous state, action space and discrete time. Specifically, we slightly modify and conduct our experiments on (1) Cooperative Navigation (2) Predator Prey. These two environments can be good illustrations about multi-agent cooperation/competitive tasks in search and rescue, military operations, and space exploration. Our baselines are (1) MADDPG (2) MADDPG with communication (channel function) (3) MADDPG-M, which utilizes a gating mechanism and (4) AMP. We compare with MADDPG because it offers performance without communication. Ideally, algorithms with communication should outperform it. Also we consider MADDPG-M and AMP because they represent the methods of utilizing specific networks as communication channels and of scheduling mechanisms respectively. ATOC is not considered as baseline because ATOC is in the shared parameter setting for homogeneous agents. Considering ATOC’s attention mechanism, we choose AMP as baselines to represent the series of attention mechanism. Then, we evaluate IMAC across the dimensions: (1) number of agents; (2) different limited bandwidths, and also conduct ablation study.
As for training details, we set the number of agents as 3,5,10 respectively. We use MLP with hidden layer size of 64 as basic module as before communication model, after communication model. We use the ADAM as optimizer with learning rate of 0.01. Since the environment of cooperative navigation does not send ”terminal/done” to agents, we set each episode with a maximal steps of 25. The reward of each agent is identical, which equals to the sum of distances between agents to their nearest landmark. It means that agents are required not only to approach its nearest landmark, but also share information with each other for a common goal. We use the same hyper-parameter as MADDPG of openAI’s version.
In this scenario, agents cooperatively reach landmarks with avoiding collisions. Agents observe the relative positions of other agents and landmarks, and are rewarded with a shared credit based on the sum of distances between agents to their nearest landmark, while it is penalized when colliding with other agents. Agents learn to infer and occupy the landmarks without colliding with other agents based on their own observation and received information from other agents.
Comparison to the baselines. We compare IMAC with baselines in the 3-agent scenario. Figure 3(a) shows the learning curve of 100,000 episodes in terms of the mean episode reward over a sliding window of 1000 episodes. We can see at the end of training, agents trained with communication have higher mean episode reward. According to [Lowe et al.2019], ”increase in reward when adding a communication channel” is sufficient to effective communication. Additionally, IMAC outperforms other baselines along the process of training, i.e., IMAC can reach upper bound of performance early. By using the information bottleneck method, agents have a better sample efficiency, thus converging fast. (More analysis can be seen in the supplementary materials)
Increasing the number of agents. We investigate agents’ performance when the number of agents increases. We made a slight modification on environment about agents’ observation. According to [Jiang and Lu2018], we constrain that each agent can only observe the nearest three agents and landmarks with relative positions and velocity. Figure 3(b) and (c) show the the learning curve of 100,000 episodes in terms of mean episode reward in each episode. From these two figures, we can still see the leading performance of IMAC in the 5 and 10-agent scenario.
||AMP||IMAC||MADDPG||MADDPG w/ com|
|AMP||1.06 \-20.42||25.13 \6.39||44.62 \-60.14||29.97 \-23.43|
|IMAC||27.29 \-22.27||32.32 \-4.26||20.76 \-56.14||34.33 \-22.62|
|MADDPG||10.09 \-24.93||29.52 \-19.39||5.98 \-26.82||28.47 \-27.75|
|MADDPG w/ com||21.02 \-21.52||28.63 \-15.60||20.85 \-37.48||16.87 \-13.09|
Limited bandwidth. We evaluate algorithms by checking agents’ performance under different limited-bandwidth constraints during the inference stage. Figure 4 shows density plot of episode reward per agent during the inference stage. We first respectively train IMAC with different prior distributions of , , and , to satisfy different default limited-bandwidth constraints. Consequently the entropy of agents’ messages satisfies the bandwidth constraints. Also, we train MADDPG with communication for comparison. Then, in the inference stage, we constrain these algorithms into different bandwidths. As depicted in Figure 4(a), IMAC with different prior distributions can reach the same outcome as MADDPG with communication. Figure 4(b) shows that MADDPG with communication fails in the limited-bandwidth environment. From Figure 4(c) and (d), we can see that the same bandwidth constraint is less effective in IMAC compared with MADDPG with communication. Results here demonstrate that IMAC discards useless information without impairment on performance. We also evaluate how much information is used in these algorithms (The detaield results can be seen in the supplementary materials)
In this scenario, slower predators chase faster preys around an environment with landmarks impeding the way. As same as cooperative navigation, each agent observes the relative position of other agents and landmarks. Predators share common rewards, which are assigned based on the collision between predators and preys, as well as the minimal distance between two groups. Preys are penalized for running out of the boundary of the screen. In this way, predators would learn to approach and surround preys, while preys would learn to feint to save their teammates.
We set the number of predators as 4, the number of preys as 2, and the number of landmarks as 2. We use the same architecture and hyper-parameter as configuration in cooperative navigation. We trained our agents by self-play for 100,000 episodes and then evaluate performance by cross-comparing between IMAC and the baselines. We average the episode rewards across 1000 rounds (episodes) as scores.
Comparison to baselines. Table 1 represents the cross-comparing between IMAC and the baselines. Each cell consists of two numbers which denote the mean episode rewards of predator and prey respectively. The larger the score is, the better the algorithm is. We first focus on the mean episode rewards of predator row by row. Facing the same prey, IMAC has higher scores than the predators of all the baselines and hence are stronger than other predators. Then, the mean episode rewards of prey column by column shows the ability of prey to escape. We can see that IMAC has higher scores than the preys of most baselines and hence are stronger than other preys. We argue that IMAC leads to better cooperation than the baselines even in competitive environments and the learned policy of IMAC predators and preys can generalize to the opponents with different policies.
Limited bandwidth. Similar to the cooperative navigation, we evaluate algorithms by showing the performance under different limited-bandwidth constraints during inference. The detailed results can be seen in the the supplementary materials. We can see that with the limited-bandwidth constraint, MADDPG with communication and IMAC suffer a degradation of performance. However, IMAC outperforms MADDPG with communication in respect of resistance to the effect of limited bandwidth.
We investigate the effect of (1) the effect of limited bandwidth; (2) the effect of on multi-agent communication on the performance of agents.
The effect of limited bandwidth. Figure 5(a) shows the learning curve of IMAC with different prior distributions. IMAC with achieves the best performance. When the variance is smaller or bigger, the performance suffers some degradation. It is reasonable because a smaller variance means a more lossy compression, leading less information sharing, and a variance which is larger than the variance without regulation must bring about redundant information, thus leading to slow convergence.
The effect of controls the degree of compression between and for each agent : the larger , the more lossy compression. Figure 5(b) shows a similar result to the abalation on limited-bandwidth constraint. The reason is the same: a larger means a more strict compression while a smaller means a less strict one.
The ablation shows that as a compression algorithm, the information bottleneck method extracts the most informative elements from the source. A proper compression rate is good for multi-agent communication, because it can not only avoid lose much information caused by higher compression, but also resist much noisy caused by lower compression.
In this paper, we have proposed an informative multi-agent communication method in the limited-bandwidth environment, where agents utilize the information bottleneck to learn an informative protocol. We have given a well-defined explanation of the limited-bandwidth constraint: limited bandwidth restrains the entropy of the messages. We introduce a customized batch-norm layer, which controls the messages’entropy to simulate the bandwidth constraint. Inspired by the information bottleneck method, our proposed IMAC algorithm learns informative protocols, which convey low-entropy and useful messages. Empirical results and an accompanying ablation study show that IMAC improves the agents’ performance under limited-bandwidth constraint and leads to faster convergence.
- [Alemi et al.2016] Alemi, A. A.; Fischer, I.; Dillon, J. V.; and Murphy, K. 2016. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410.
- [Cover and Thomas2012] Cover, T. M., and Thomas, J. A. 2012. Elements of Information Theory. John Wiley & Sons.
- [Cressey2015] Cressey, D. 2015. Ocean-diving robot nereus will not be replaced. Nature News 528(7581):176.
- [Foerster et al.2016] Foerster, J.; Assael, I. A.; de Freitas, N.; and Whiteson, S. 2016. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, 2137–2145.
- [Freeman2004] Freeman, R. 2004. Telecommunication System Engineering. Wiley Series in Telecommunications and Signal Processing. Wiley. 398–399.
- [Gao and Chien2017] Gao, Y., and Chien, S. 2017. Review on space robotics: Toward top-level science through space exploration. Science Robotics 2(7).
- [Goyal et al.2019] Goyal, A.; Islam, R.; Strouse, D.; Ahmed, Z.; Botvinick, M.; Larochelle, H.; Levine, S.; and Bengio, Y. 2019. Infobot: Transfer and exploration via the information bottleneck. arXiv preprint arXiv:1901.10902.
- [Ioffe and Szegedy2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
- [Jaynes1957] Jaynes, E. T. 1957. Information theory and statistical mechanics. Physical Review 106(4):620.
- [Jiang and Lu2018] Jiang, J., and Lu, Z. 2018. Learning attentional communication for multi-agent cooperation. In Advances in Neural Information Processing Systems, 7265–7275.
- [Kilinc and Montana2018] Kilinc, O., and Montana, G. 2018. Multi-agent deep reinforcement learning with extremely noisy observations. arXiv preprint arXiv:1812.00922.
- [Kim et al.2019] Kim, D.; Moon, S.; Hostallero, D.; Kang, W. J.; Lee, T.; Son, K.; and Yi, Y. 2019. Learning to schedule communication in multi-agent reinforcement learning. arXiv preprint arXiv:1902.01554.
- [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- [Lowe et al.2017] Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, O. P.; and Mordatch, I. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. In Advances in Neural Information Processing Systems, 6379–6390.
- [Lowe et al.2019] Lowe, R.; Foerster, J.; Boureau, Y.-L.; Pineau, J.; and Dauphin, Y. 2019. On the pitfalls of measuring emergent communication. arXiv preprint arXiv:1903.05168.
- [Mao et al.2019] Mao, H.; Gong, Z.; Zhang, Z.; Xiao, Z.; and Ni, Y. 2019. Learning multi-agent communication under limited-bandwidth restriction for internet packet routing. arXiv preprint arXiv:1903.05561.
- [Maravall, de Lope, and Dominguez2013] Maravall, D.; de Lope, J.; and Dominguez, R. 2013. Coordination of communication in robot teams by reinforcement learning. Robotics and Autonomous Systems 61(7):661–666.
- [Melo, Spaan, and Witwicki2011] Melo, F. S.; Spaan, M. T.; and Witwicki, S. J. 2011. QueryPOMDP: POMDP-based communication in multiagent systems. In European Workshop on Multi-Agent Systems, 189–204.
- [Nagatani et al.2013] Nagatani, K.; Kiribayashi, S.; Okada, Y.; Otake, K.; Yoshida, K.; Tadokoro, S.; Nishimura, T.; Yoshida, T.; Koyanagi, E.; Fukushima, M.; et al. 2013. Emergency response to the nuclear accident at the fukushima daiichi nuclear power plants using mobile rescue robots. Journal of Field Robotics 30(1):44–63.
- [OpenAI2019] OpenAI. 2019. OpenAI Five. https://openai.com/blog/openai-five/. Accessed March 4, 2019.
- [Peng et al.2017] Peng, P.; Yuan, Q.; Wen, Y.; Yang, Y.; Tang, Z.; Long, H.; and Wang, J. 2017. Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXiv preprint arXiv:1703.10069 2.
- [Peng et al.2018] Peng, X. B.; Kanazawa, A.; Toyer, S.; Abbeel, P.; and Levine, S. 2018. Variational discriminator bottleneck: Improving imitation learning, inverse RL, and gans by constraining information flow. arXiv preprint arXiv:1810.00821.
- [Peng, Zhang, and Luo2018] Peng, Z.; Zhang, L.; and Luo, T. 2018. Learning to communicate via supervised attentional message processing. In Proceedings of the 31st International Conference on Computer Animation and Social Agents, 11–16.
- [RoboCup2019] RoboCup. 2019. Robocup Federation Official Website. https://www.robocup.org/. Accessed April 10, 2019.
- [Shalev-Shwartz, Shammah, and Shashua2016] Shalev-Shwartz, S.; Shammah, S.; and Shashua, A. 2016. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295.
- [Shamir, Sabato, and Tishby2010] Shamir, O.; Sabato, S.; and Tishby, N. 2010. Learning and generalization with the information bottleneck. Theoretical Computer Science 411(29-30):2696–2711.
- [Shannon1948] Shannon, C. E. 1948. A mathematical theory of communication. Bell System Technical Journal 27(3):379–423.
- [Singh, Jain, and Sukhbaatar2018] Singh, A.; Jain, T.; and Sukhbaatar, S. 2018. Learning when to communicate at scale in multiagent cooperative and competitive tasks. arXiv preprint arXiv:1812.09755.
[Sukhbaatar, Fergus, and
Sukhbaatar, S.; Fergus, R.; et al.
Learning multiagent communication with backpropagation.In Advances in Neural Information Processing Systems, 2244–2252.
- [Tishby and Zaslavsky2015] Tishby, N., and Zaslavsky, N. 2015. Deep learning and the information bottleneck principle. In IEEE Information Theory Workshop (ITW), 1–5.
- [Tishby, Pereira, and Bialek2000] Tishby, N.; Pereira, F. C.; and Bialek, W. 2000. The information bottleneck method. arXiv preprint physics/0004057.
- [Yuan, Liu, and Zhang2017] Yuan, C.; Liu, Z.; and Zhang, Y. 2017. Aerial images-based forest fire detection for firefighting using optical remote sensing techniques and unmanned aerial vehicles. Journal of Intelligent & Robotic Systems 88(2-4):635–654.
- [Zhang and Lesser2013] Zhang, C., and Lesser, V. 2013. Coordinating multi-agent reinforcement learning with limited communication. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems, 1101–1108.
Appendix A Source Coding
Source Code: A source code is a mapping from the range of a random variable or a set of random variables to finite length strings of symbols from a -ary alphabet.
Expected length of a source code denoted by is given as follows: , where is the length of codeword for a symbol , and is the probability of the symbol.
Intuitively, a good code should preserve the information content of an outcome. Since information content depends on the probability of the outcome (it is higher if probability is lower, or equivalently if the outcome is very uncertain), a good codeword will use fewer bits to encode a certain or high probability outcome and more bits to encode a low probability outcome. Thus, we expect that the smallest expected code length should be related to the average uncertainty of the random variable, i.e., the entropy.
Source coding theorem states that entropy is the fundamental limit of data compression; i.e., . Instead of encoding individual symbols, we can also encode blocks of symbols together. A length block code encodes length strings of symbols together and is denoted by .
Consider the optimization problem:
The above finds the shortest possible code length subject to satisfying the Kraft inequality. If we relax the the codelengths to be non-integer, then we can obtain a lower bound. To do this, the Lagrangian is:
Taking derivatives with respect to and and setting to 0, leading to:
Solving this for leads to , which can be verified by direct substitution. This proves the lower bound. ∎
Theoretical analysis can be seen in [Shannon1948].
Appendix B Maximum Data Rate
We first introduce the Nyquist ISI criterion:
Proposition 3 (Nyquist ISI criterion).
If we denote the channel impulse response as , then the condition for an ISI-free response can be expressed as:
for all integers , where is the symbol period. The Nyquist ISI criterion says that this is equivalent to:
where is the Fourier transform of
is the Fourier transform of.
We may now state the Nyquist ISI criterion for distortionless baseband transmission in the absence of noise: The frequency function eliminates intersymbol interference for samples taken at interval provide that it satisfies Equation 3.
The simplest way of satisfing Equation 3 is to specify the frequency function to be in the form of a rectangular function, as showing by
where stands for a rectangular function of unit amplitude and unit support centered on , and the overall system bandwidth is definded by
The special value of the bit rate is called the Nyquist rate, and is itself called the Nyquist bandwidth.
The key here is that we have restricted ourselves to binary transmission and are limited to
bits/s no matter how much we increase the signal-to-noise ratio. The way to attain a highervalue is to replace the binary transmission system with a multilevel system, often termed an -ary transmission system, with . An -ary channel can pass bits/s with an acceptable error rate.
Thus, we conclude that the bit rate .
Appendix C The Information Bottleneck Method
The information bottleneck method provides a principled way to extract information that is present in one variable that is relevant for predicting another variable. Consider and respectively as the input source and target, and let be an internal representation, i.e., a stochastic encoding, of any hidden layer of the network, defined by a parametric encoder . The goal is to learn an encoding that is maximally informative about the target , which is measured by the mutual information between and , where
Notice that taking the identity encoding always ensures a maximally informative representation if only with the above objective, but it is not a useful representation obviously. It is evident to constrain on encoding’s complexity if we want the best representation, i.e., . [Tishby and Zaslavsky2015] proposed the information bottleneck that expresses the trade-off between the mutual information measures and . This suggests the objective:
where is the information constraint. Equivalently, with the introduction of a Lagrange multiplier we can maximize the objective function:
where controls the trade-off. Intuitively, the first term encourages to be predictive of ; the second term encourages to “forget” . Essentially it forces to act like a minimal sufficient statistic of for predicting .
Why IMAC works?
We discuss why information bottleneck works in multi-agent communication.
We first introduce minimal sufficient statistics: a transformation of the data is a minimal sufficient statistic if ,
Information bottleneck principle generalizes the notion of minimal sufficient statistics and suggests using a summary of the data that has least mutual information with the data while preserving some amount of information about an auxiliary variable .
According to [Shamir, Sabato, and Tishby2010], from a learning perspective, we discuss the role of , the compression or minimality term in information bottleneck, as a regularizer when maximizing .
Reinforcement learning learns a optimal action given a state. If we know the optimal action in advance, like imitation learning, then we would maximize the mutual information between the state and its corresponding optimal action, which is a straightforward application of supervised learning. Here,represents the states, represents the messages, represents the actions. Note that without regularization, can be maximized by setting . However, cannot be estimated efficiently from a sample of a reasonable size; It means that more samples are needed in reinforcement learning. In another words, methods with regularization on , e.g., IMAC, can accelerate convergence.
Appendix D Experimental Details and Results
|maddpg w/ com||3.480+-0.042||1.530+-0.038||3.147+-0.088||3.891+-0.059|
|IMAC train w/ bw=1||0.244+-0.028||-||-||-|
|IMAC train w/ bw=5||2.227+-0.002||1.383+-0.003||-||-|
|IMAC train w/ bw=10||2.763+-0.215||1.695+-0.044||3.017+-0.026||-|
, which is calculated based on running variance)
|MADDPG_c1||18.01 \-14.22||24.15 \-29.88||22.38 \-16.91||47.59 \-45.64||34.25 \-27.68||50.81 \-43.62|
|MADDPG_c5||26.32 \-20.48||15.67 \-11.59||29.06 \-22.16||27.07 \-22.89||23.44 \-20.41||32.24 \-26.46|
|IMAC||51.24 \-42.56||37.37 \-45.521||44.64 \-36.49||49.12 \-42.65||36.63 \-30.03||35.42 \-28.82|
|IMAC_s5_c1||38.86 \-32.06||34.54 \-35.03||9.97 \-3.11||26.25 \-21.06||11.80 \-7.558||38.32 \-32.28|
|IMAC_s10_c1||26.67 \-21.418||34.99 \-35.02||9.71 \-4.11||9.82 \-6.92||9.82 \-6.92||37.50 \-31.30|
|IMAC_s10_c5||45.88 \-38.27||26.39 \-35.42||11.51 \-9.12||30.02 \-27.41||29.08 \-25.661||22.25 \-16.51|
In this environment, agents must cooperate through physical actions to reach a set of landmarks. Agents observe the relative positions of other agents and landmarks, and are collectively rewarded based on the proximity of any agent to each landmark. In other words, the agents have to ‘cover’ all of the landmarks. Further, the agents occupy significant physical space and are penalized when colliding with each other. Our agents learn to infer the landmark they must cover, and move there while avoiding other agents.
Table 2 shows that MADDPG without communication tends to use high-entropy messages, while IMAC can convey low-entropy messages. Combined with the performance in Figure 4, we can see that under limited-bandwidth constraint, IMAC learns informative communication protocols.
Predator and prey
In this variant of the classic predator-prey game, some slower cooperating agents must chase some faster adversaries around a randomly generated environment with some large landmarks impeding the way. Each time the cooperative agents collide with some adversaries, the agents are rewarded while the adversary is penalized. Agents observe the relative positions and velocities of the agents, and the positions of the landmarks.
Table 3 shows the performance under different limited-bandwidth constraints during inference in the environment of predator and prey. We can see with limited-bandwidth constraint, MADDPG with communication and IMAC suffer a degradation of performance. However, IMAC outperforms MADDPG with communication in respect of resistance to the effect of limited bandwidth.