Over the past several years, both single-agent reinforcement learning (RL) and multi-agent RL (MARL) have gained a lot of interest. In (MA)RL, the agents often have to deal with partial observability. The agents can only observe part of the global state, making it hard to choose appropriate actions. For example, an agent navigating through an environment cannot see certain parts of the environment. In MARL, this partial observability can often be alleviated by allowing the agents to share information with each other. By combining this information with their own observation, agents get a more complete view of the environment and can choose better actions (tan1993multi; melo2012). For example, multiple agents navigating through the same environment can share the information they can see with each other, resulting in a more complete view.
One of the subfields within MARL is research towards learned communication between agents. The most commonly used approach thus far is allowing gradients to flow between agents as a form of feedback on the received messages. However, in the case of discrete communication messages this raises a problem since gradients cannot flow through a discrete communication channel. Several different approaches have been proposed in the state of the art (foerster2016learning; lowe2020multiagent; mordatch2018; lin2021learning) to discretize messages while allowing gradients to flow through the discretization unit. Each of these methods was tested using different communication learning approaches and applied on different environments, making a fair comparison very difficult.
Our contributions consist of two parts. First, we present an in-depth comparison of several discretization methods used in the state-of-the-art. In our comparison we focus on using these discretization methods to allow discrete communication when learning communication using the gradients of the receiving agents. We compare each of the approaches on several environments with increasing complexity as well as analyze their performance when the environment introduces errors to the communication messages. Secondly, we present two discretization methods (ST-DRU and ST-GS) that have not been used in communication learning before.
The remainder of this paper is structured as follows. Section 2 gives an overview of work related to our research. Section 3 contains some additional background information. Section 4 provides a detailed explanation of the discretization methods we compare in this paper. In Section 5, the different experiments that were done are explained along with their results. We discuss the results of our experiments further in Secion 6. In Section 7 we draw some conclusions from our experimental results.
2. Related Work
In this section, we review state-of-the-art work relevant to our research. We give an overview of several communication learning methods, focusing on methods that learn discrete communication. Here, we see some alternative approaches for learning discrete communication beside using a differentiable communication channel as well as different discretization techniques used in the state of the art.
foerster2016learning and sukhbaatar2016learning proposed the first successful methods for learning inter-agent communication. foerster2016learning proposed two novel approaches, Reinforced Inter-Agent Learning (RIAL) and Differentiable Inter-Agent Learning (DIAL). Both RIAL and DIAL use discrete communication messages, but learn the communication policy in a different way. RIAL learns the communication policy the same way as learning the action policy, by using the team reward. However, the results clearly show that this is not sufficient in most environments. DIAL proved more successful by using gradients originating from the agents receiving the messages which provide feedback on the communication policy. sukhbaatar2016learning proposed a different approach called CommNet. Messages consist of the hidden state of the agents, resulting in continuous messages. Similar to DIAL, CommNet uses gradients that flow through the communication channel to train the communication.
A lot of the research that follows these works chooses to use continuous communication like sukhbaatar2016learning, avoiding the problem of discretizing the communication messages (simoes2020a3c3). Other methods choose a different method to learn communication than using gradients through the communication channel. This also avoids the challenge of discretizing the messages. jaques2019socialinfluence train the communication policy using the team reward augmented with a social influence reward. This additional reward is based on how much the message changes the action policy of the receiving agents. vanneste2021learning use counterfactual reasoning to directly learn a communication protocol without the need of a differentiable communication channel. Freed_Sartoretti_Hu_Choset_2020 use a randomized encoder at the sender to encode the continuous messages into discrete messages. At the receiver, a randomized decoder is used to approximate the original continuous message. They show that by using this technique they can consider the communication channel equivalent to a continuous channel with additive noise, allowing gradients to flow between the sender and receiver agent.
lowe2020multiagent and mordatch2018 propose Multi-Agent Deep Deterministic Policy Gradients (MADDPG). In their work they evaluate MADDPG on multiple different scenarios, including communication tasks. They do not use a differentiable communication channel to learn communication but they have to make sure the messages are differentiable to allow the MADDPG method to work properly since policies are learned using gradients that originate from the critic. They allow discrete communication by using a gumbel softmax. lin2021learning
use an autoencoder at the sender to compose a representation of the observation that will be used as communication message. To discretize these messages they use a straight through estimator in the autoencoder. Bothlowe2020multiagent; mordatch2018 and lin2021learning have to use differentiable discretization techniques in their methods to allow discrete communication. However, they do not use the techniques in the same way we do in our work. lowe2020multiagent and mordatch2018 use the discretization method in a similar way as our work but in MADDPG the gradients that correct the communication originate from the critic instead of from other agents. lin2021learning use the discretization method in a very different way since they train the communication policy entirely using the reconstruction loss of the autoencoder instead of the gradients from the other agents.
Summarized, in the state-of-the-art related to our research, we see that multiple discretization methods have been proposed. But, differences in communication learning approaches and the fact that each of these methods is tested on different environments makes comparing these different discretization methods very hard.
3.1. Deep Q-Networks (DQN)
In single agent RL (sutton1998introduction), the agent chooses an action based on the state of the environment. As a result of this action, the environment will transition to a new state and provide the agent with a reward . This reward is used to train the agent. Q-learning uses this reward to calculate a Q-value for each state action pair . This Q-value represents a value for each state action pair, where a higher Q-value indicates a better action. Therefore, the policy of our agent can be defined by Equation 1.
Deep Q-learning (mnih2015human)
uses a neural network with parametersto represent the Q-function. The deep Q-network is optimized at iteration by minimizing the loss in Equation 2.
where is the discount factor and are the parameters of the target network. This target network will be updated after each training iteration according to Equation 3.
where is a weight that indicates how fast the target network should follow the parameters . In our work, the agent does not receive the full state but only a limited observation of this state. This increases the complexity since the observation might lack important information.
|Training Output (forward pass)||Function used for backward pass||Evaluation Output (forward pass)|
is the sigmoid function
3.2. Differentiable Inter-Agent Learning (DIAL)
To allow for a fair comparison, we will use the same communication learning approach for each of the different discretization methods. We use DIAL as proposed by foerster2016learning in our experiments since it is the most general and well known architecture to learn discrete communication using gradients from the other agents. The architecture of DIAL can be seen in Figure 1. We adapted the original DIAL architecture by separating the action and communication network. This allows us to keep the communication network small, making communication learning easier. In our experiments, we examine environments where the agents only need to share and encode part of the observation which allows us to make this adaptation. When the agents are expected to communicate about a strategy, splitting the action and communication network may no longer be possible.
Each agent consists of two networks, the A-Net and the C-Net. The A-Net produces Q-values to determine the action based on the observation and the incoming messages. The C-Net is responsible for calculating the messages based on the observation. It does not receive the incoming messages in our experiments because in these environments the communication policy does not need the incoming messages to determine the output message. Before the messages are broadcast to the other agents, the discretization unit applies one of the discretization techniques that we are comparing in this paper. To train the agents, we apply the team reward provided by the environment on the A-Net according to deep Q-learning. The gradients from the A-Net are propagated to the C-Net of all the agents that sent a message to that agent. This allows us to train the C-Net using the feedback of the agents receiving the messages.
In this section, we describe the different discretization modules, that we will compare, in more detail. Table 2 provides an overiew of all the discretization methods and their differences. We show the difference between the function used to calculate the output of the discretization unit during training and during evaluation as well as the function that is used for the backward pass.
4.1. Discretize Regularize Unit (DRU)
In the DIAL method, foerster2016learning propose a module called the Discretize Regularize Unit (DRU) to allow gradients to be used for training while learning discrete communication messages. The DRU has two modes, discretization and regularization. The discretization mode is used at execution time and discretizes the input into a single bit using Equation 4.
where is a heaviside function and is the input of the discretization unit (output of the C-Net). This calculation cannot be used during training because the derivative of the heaviside function is the Dirac function which is zero everywhere except at , where the output is infinite. Therefore, the regularization mode is used during training. When using the regularization mode, the agents are allowed to communicate using continuous messages. However, the DRU tries to encourage the communication policy to generate messages that can easily be discretized at execution time. This is achieved by applying Equation 5.
where is the input of the discretization unit (output of the C-Net),and is the sigmoid function. The noise will affect the output of the DRU the most when the input is around zero since the sigmoid is the steepest there. The influence will be much smaller for inputs with high absolute values. The output in those cases will also go towards zero and one, making it very similar to discrete, binary messages. This can be seen in Figure 3.
4.2. Straight Through Estimator (STE)
A straight through estimator (STE) (bengio2013estimating; yin2019understanding) performs a normal discretization, as in Equation 4
, when calculating the output. However, when performing backpropagation, it uses the gradients of an identity function instead of the gradients of the discretization. The advantage of this technique is that the agent receiving the message will immediately receive binary numbers and can learn how to react to these messages while still being able to use the gradients from the receiving agents to train the communication network. For the STE, the output will always look like the DRU in evaluation mode, shown in Figure3.
4.3. Gumbel Softmax (GS)
The Gumbel Softmax (GS) (jang2017categorical; maddison2017concrete) is a method to approximate a sample from a categorical distribution in a differentiable way. Normal sampling techniques are not differentiable and therefore not directly applicable in this context. The GS achieves this desirable property by using Gumbel noise and the gumbel-max trick(gumbel1954statistical)
. Using the gumbel-max trick, we can sample from a categorical distribution with class probabilitiesas described in Equation 6.
where are i.i.d. samples drawn from Gumbel(0,1), , , , is the input of the discretization unit (output of the C-Net) and is the sigmoid function. To make this differentiable, we have to approximate the and functions with a function. Since we need two probabilities to obtain a categorical distribution for both states of a bit we will obtain the output message by using Equation 7.
where is the input of the discretization unit (output of the C-Net), is the softmax temperature and is the sigmoid function. A low temperature will result in an output that closely matches the output of Equation 6. The higher the temperature the more the output will approach a uniform distribution. Figure 3 shows the behaviour of the GS for different inputs in training and evaluation mode with temparature .
We propose a novel discretization method (ST-DRU) that combines the DRU and STE methods. During execution, we discretize the input messages in the same way as mentioned in the DRU and STE methods, shown in Equation 4. During training, we will use a different function for the forward and backward pass. In the forward pass, we add gaussian noise and apply the same discretization, resulting in Equation 8.
where is the input of the discretization unit (output of the C-Net) and is noise sampled from a Gaussian distribution with standard deviation . However, during backpropagation we use the gradients of Equation 5 instead. The advantage of this approach over the original DRU can be seen in Figure 5. The agents receiving the messages will receive binary messages from the start of training. The DRU uses continuous messages during training. Even though the agents are encouraged by the DRU to produce outputs with a high absolute value, it will take a while before the output messages will resemble binary messages. The ST-DRU will also encourage the agent to produce outputs that can easily be discretized. But, the receiving agents will immediately receive discrete messages, allowing them to learn to interpret them more quickly.
Similarly to the ST-DRU, we also test a straight through version of the GS as proposed by jang2017categorical. Here, we will use the sampling technique described in Equation 6 to calculate the output, while using the gradients of the softmax approximation in Equation 7 to train the communication network. Similarly to the DRU, the GS produces continuous messages during training. When evaluating the agents the GS will discretize the messages. If the sending agent is not producing outputs with a high enough absolute value, the difference between the messages at training time and at evaluation time will be very large. This prevents the receiving agents from correctly interpreting the message and choosing the appropriate actions. The ST-GS on the other hand will produce discretized messages at evaluation time and at training time, as can be seen in Figure 5. This way, we make sure that the receiving agent knows how to correctly interpret discrete messages.
In this section, we explain each of our experiments and analyze the results. For each experiment we show the average performance over five different runs for each of the discretization methods. The hyperparameters and network architectures are identical for each of the discretization methods since our hyperparameter search showed that the best hyperparameters and network architecture were not influenced by the choice in discretization method. All of our experiments are run using the RLlib framework(liang2018rllib).
5.1. Matrix Environment
The Matrix environment is inspired by the Matrix Communication Games presented by lowe2019measuring. In the Matrix environment agents receive a natural number in . The values for and
can be chosen independent of each other. The agents are allowed to broadcast one message to the other agents before they have to indicate whether all agents received the same number or not. The odds of the agents receiving the same number are 50% regardless of the value ofand . The minimum number of bits required to be able to represent each possible input number can easily be determined by applying a base two logarithm on . The team reward in this environment is equal to the number of agents that correctly determined whether all agents got the same number or not. Therefore, the maximum reward is equal to . Table 3 shows the reward matrices that correspond with a Matrix environment with and any value for .
We examine the results for two different configurations of this environment which can be seen in Table 4. In each of these experiments, we show the evaluation reward of our agents, measured by performing 100 evaluation episodes after each 100 training iterations. During the evaluation episodes, the agents do not explore and the discretization methods are applied in evaluation mode. In this environment, all of the agents are identical. Therefore, we can use parameter sharing between the agents, which improves their performance significantly as shown in the results of foerster2016learning.
|Simple Matrix Environment||3||4|
|Complex Matrix Environment||5||256|
5.1.1. Simple Matrix Environment
In Figure 7 and Table 5, the results for the different discretization methods are shown. The maximum reward the agents can achieve in this scenario is a reward of 3. We can see that most of the methods are able to achieve a reward very close to this maximum except for the GS. The ST-GS does not have the same issue as the GS to achieve the maximum reward. We also see that the STE method is faster at the beginning of the training but this difference disappears rather quickly. Due to the limited complexity of this environment, the differences between the methods are still small.
5.1.2. Complex Matrix Environment
The Complex Matrix environment has more agents () as well as more possible input numbers (). The agents need a message consisting of a full byte to be able to encode each of the possible input numbers. Figure 7 and Table 5 show the results of this experiment. The maximum reward in this configuration is 5. We see that the difference between the methods is larger than in the Simple Matrix environment due to the added complexity. We see that the STE method is the only one that is able to reach the maximum reward in this training period. It reaches a reward close to the maximum reward after only 5k training iterations. The other methods only start improving after 15k training iterations and take over 60k training iterations to reach their maximal performance. We can also see that the adapted versions of the DRU and GS which include the STE technique also perform better than the version without the STE technique. The ST-DRU has an average reward that is 0.072 higher than the DRU and the ST-GS has an average reward that is 0.176 higher than the GS during the final 10% of training iterations. The communication amplitude in Figure 9 provides an explanation for the training speed of the STE method. The communication amplitude is the mean absolute value of the input of the discretization unit. We can see a clear difference between the STE and the other methods. The communication amplitude of the STE stays below 0.5 while the communication amplitude of the DRU and GS approaches 1.8 and the communiation amplitude of the ST-DRU and the ST-GS exceeds 2.0. This is caused by the noise that is included in all of the discretization methods except for the STE. For a low communication amplitude the output of each of these discretization methods is still very random. The output during training will be determined by the noise instead of by the sign of the input which is done during evaluation. This encourages the agent to produce outputs with a higher communication amplitude. However, this puts a delay on the speed at which the agents can discover communication protocols. In Figure 9 we see that the communication amplitude starts rising more quickly at around 15k training iterations. Once the communication amplitude starts rising we can see in Figure 7 that the rewards that the agents receive also starts rising, indicating that they are starting to learn how to communicate with each other.
5.2. Speaker Listener Environment
As a more complex environment, we use the speaker listener scenario from the particle environment by OpenAI (lowe2020multiagent; mordatch2018). This is one of the environments that was used to evaluate MADDPG (lowe2020multiagent; mordatch2018). In this environment there are two agents and three landmarks. One of the agents, the speaker, observes which landmark is the target during this episode. The speaker then has to communicate this information to the other agent, the listener. Next, the listener has to navigate to the target landmark. Both agents are rewarded using a team reward that is composed based on the distance of the listener to the target landmark. Contrary to the Matrix environment, the agents are not the same in this environment. The speaker will only consist of a communication policy while the listener will only consist of an action policy. In this experiment, we show the evaluation reward of our agents, measured by performing 10 evaluation episodes after each 50 training iterations. During the evaluation episodes, the agents do not explore and the discretization methods are applied in evaluation mode.
Figure 9 and Table 5 show the results in the speaker listener environment. We see that the STE is no longer performing as good as in our earlier experiments. It results in the worst result while the GS and DRU provide the best results. The delay that is caused by the noise in the DRU, GS, ST-DRU and ST-GS is no longer the determining factor in the training speed. The exploration that is done in the communication policy by the DRU, GS, ST-DRU and ST-GS due to the noise appears to have a beneficial result.
5.3. Error Correction
|Before Errors||After Errors|
[Figure 11. The communication protocol for the Matrix environment before and after the introduction of errors]This figure shows the communication protocol for each of the discretization methods before and after errors are introduced. For the Straight Through Estimator we see that for both possible input numbers the same message is generated. The Gumbel Softmax and the Straight Through Gumbel Softmax show a communication protocol where in almost all cases the same input is mapped on the same message. For the Gumbel Softmax 1% of the inputs are mapped on different output messages. This causes an overlap between the messages for both input numbers after introducing errors in 0.5% of the cases. The remaining cases result in an output message that is completely distinguishable from the other input number before and after introducing errors. For the Straight Through Gumbel Softmax 0.5% of the input numbers are mapped onto different numbers. This causes an overlap between the messages for both input numbers after introducing errors in 0.3% of the cases. The remaining cases result in an output message that is completely distinguishable from the other input number before and after introducing errors. The Discretize Regularize Unit and the Discretize Regularize Unit adapted with a Straight Through Estimator produce an communication protocol that always maps the same input number on the same message. After introducing errors, we see that the output messages do not show any overlap between the two possible input messages.
In addition to comparing the different discretization methods in ideal circumstances, we also want to make this comparison in a situation with more uncertainty. Therefore, we perform some additional experiments on the Matrix environment as discussed in Section 5.1. However, instead of perfect communication circumstances without any errors as done before, we flip a certain amount of random bits with a certain probability. This causes the receiver to receive different information than intended by the sender. Depending on the maximum amount of bits that can be flipped, the agents need more message bits to be able to counteract the errors that are introduced. We use a simple Matrix environment with and . In this experiment, we show the evaluation reward of our agents, measured by performing 100 evaluation episodes after each 100 training iterations. During the evaluation episodes, the agents do not explore and the discretization methods are applied in evaluation mode. In this environment, all of the agents are identical. Therefore, we can use parameter sharing between the agents, which improves their performance significantly as shown in the results of foerster2016learning.
We perform a test where there is 50% chance that an error will occur. Normally, the agents would be able to represent both possible incoming numbers using a single bit. However, if they have to be able to correct the errors that get introduced, they require three bits. The results of this experiment can be seen in Figure 10 and Table 5. We see that the STE is not able to correct the errors that occur. Therefore, it is not able to achieve good results. However, the other discretization methods are able to detect and correct the errors. Our hypothesis is that these methods are more robust to errors due to the noise that is used within these discretization methods.
To see how the agents are able to correct the introduced errors, we examine which message the agents choose for which incoming number. Figure 11 shows the different communication policies for the different discretization methods. We can see the output message depending on which input number was given to the agent before and after the errors are introduced. We see that the agents with the DRU, GS, ST-DRU or ST-GS have chosen messages where the possible messages after the introduction of errors do not overlap between the possible input numbers. This way the agents make sure that the messages are still comprehensible, even if errors occur. For the GS we see that there are 9 messages that overlap with the output messages for a different input number after error introduction. The same thing can be observed for the ST-GS in 6 cases. We see that when we use the STE, the agents are not able to find this communication protocol. Even before the errors are introduced, the messages for both possible inputs are the same. This indicates that the agents did not find a useful communication protocol.
In our experiments, we compared different discretization techniques in different environments where the agents need to learn a communication protocol to achieve the goal. In this section, we discuss some general trends that we saw accross the experiments. Table 5 shows how each of the different methods performed in each experiment. It shows the average return and standard deviation during the last 10% of training iterations. In our results, we saw only small differences in a simple environment. However, the differences become a lot more apparent when using a more complex environment. The STE method performs very well in both the Simple and Complex Matrix environment, while performing worst in the speaker listener and not being able to achieve the goal in the error correction task. This makes the use of the STE as a standard method not recommended, especially in environments where perfect communication cannot be guaranteed. Similarly, the GS either performs the best among the tested methods or the worst. The ST-GS has a more consistent performance than the regular GS. The DRU and the ST-DRU perform very similar except in the speaker listener environment. There, the DRU clearly outperforms the ST-DRU.
Overall, we can state that in most cases, either the DRU, ST-DRU or ST-GS should be used to discretize communication. These methods provide consistent results across the experiments while the STE and GS might achieve a higher return or be faster in some cases but fail dramatically in others.
In this paper we compared several discretization methods in different environments with different complexities and challenges. We focus on the situation where these discretization methods are used to discretize communication messages between agents that are learning to communicate with each other while acting in an environment.
The results showed that the choice of discretization method can have a big impact on the performance. Across all of the experiments, the DRU, ST-DRU and ST-GS performed best. The STE performs a lot better in the matrix environment in terms of speed and return. However, in the speaker listener environment, the STE performs the worst and in the error correction task, it fails to learn a communication protocol. The GS performs best among all of the methods in the evironments where the STE fails, but performs the worst in the Matrix environment. The DRU, ST-DRU and ST-GS show more consistent results, close to the best result, making them a better standard choice. However, sometimes it may prove useful to perform additional experiments to establish the best discretization method.
This work was supported by the Research Foundation Flanders (FWO) under Grant Number 1S12121N and Grant Number 1S94120N. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.