Biases for Emergent Communication in Multi-agent Reinforcement Learning

12/11/2019 ∙ by Tom Eccles, et al. ∙ Google 0

We study the problem of emergent communication, in which language arises because speakers and listeners must communicate information in order to solve tasks. In temporally extended reinforcement learning domains, it has proved hard to learn such communication without centralized training of agents, due in part to a difficult joint exploration problem. We introduce inductive biases for positive signalling and positive listening, which ease this problem. In a simple one-step environment, we demonstrate how these biases ease the learning problem. We also apply our methods to a more extended environment, showing that agents with these inductive biases achieve better performance, and analyse the resulting communication protocols.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Environments where multiple learning agents interact can model important real-world problems, ranging from multi-robot or autonomous vehicle control to societal social dilemmas (CaoOverview, ; MatignonMultiRobot, ; LeiboDilemmas, ). Further, such systems leverage implicit natural curricula, and can serve as building blocks in the route for constructing aritficial general intelligence (LeiboManifesto, ; BansalEmergent, ). Multi-agent games provide longstanding grand-challenges for AI (KitanoRoboCup, ), with important recent successes such as learning a cooperative and competitive multi-player first-person video game to human level (CTF, ). An important unsolved problem in multi-agent reinforcement learning (MARL) is communication between independent agents. In many domains, agents can benefit from sharing information about their beliefs, preferences and intents with their peers, allowing them to coordinate joint plans or jointly optimize objectives.

A natural question that arises when agents inhibiting the same environment are given a communication channel without an agreed protocol of communication is that of emergent communication (steels2003evolving, ; wagner2003progress, ; foerster2016learning, ): how would the agents learn a “language” over the joint channel, allowing them to maximize their utility? The most naturalistic model for emergent communication in MARL is that used in Reinforced Inter-Agent Learning (RIAL) (foerster2016learning, ) where agents optimize a message policy via reinforcement from the environment’s reward signal. Unfortunately, straightforward implementations perform poorly (foerster2016learning, ), driving recent research to focus on differentiable communication models (foerster2016learning, ; Mordatch:Abbeel:2018, ; SukhbaatarComms, ; Hausknecht16, ), even though these models are less generally applicable or realistic.

RIAL offers the advantage of having decentralized training and execution; similarly to human communication, each agent treats others as a part of its environment, without the need to have access to other agents’ internal parameters or to back-propagate gradients “through” parameters of others. Further, agents communicate with discrete symbols, providing symbolic scaffolding for extending to natural language. We build on these advantages, while facilitating joint exploration and learning via communication-specific inductive biases.

We tackle emergent communication through the lens of Paul Grice (grice1968utterer, ; neale1992paul, ), and capitalize on the dual view of communication in which interaction takes place between a speaker, whose goal is to be informative and relevant (adhering to the equivalent Gricean maxims), and a listener, who receives a piece of information and assumes that their speaker is cooperative (providing informative and relevant information). Our methodology is inspired by the recent work of Lowe et al. (lowe2019pitfalls, ), who proposed a set of comprehensive measures of emergent communication along two axis of positive signalling and positive listening, aiming at identifying real cases of communication from pathological ones.

Our contribution: we formulate losses which encourage positive signaling and positive listening, which are used as auxiliary speaker and listener losses, respectively, and are appended to the RIAL communication framework. We design measures in the spirit of Lowe et al. (lowe2019pitfalls, ) but rather than using these as an introspection tool, we use them as an optimization objective for emergent communication. We design two sets of experiments that help us clearly isolate the real contribution of communication in task success. In a one-step environment based on summing MNIST digits, we show that the biases we use facilitate the emergence of communication, and analyze how they change the learning problem. In a gridworld environment based on search of a treasure, we show that the biases we use make communication appear more consistently, and we interpret the resulting protocol.

1.1 Related Work

Differentiable communication was considered for discrete messages (foerster2016learning, ; Mordatch:Abbeel:2018, ) and continuous messages (SukhbaatarComms, ; Hausknecht16, ), by allowing gradients to flow through the communication channel. This improves performance, but effectively models multiple agents as a single entity. In contrast we assume agents are independent learners, making the communication channel non-differentiable. Earlier work on emergent communication focused on cooperative “embodied” agents, showing how communication helps accomplish a common goal (foerster2016learning, ; Mordatch:Abbeel:2018, ; Das:etal:2018, ), or investigating communication in mixed cooperative-competitive environments (lowe2017multi, ; Cao:etal:2018, ; Jaques:etal:2019, ), studying properties of the emergent protocols (Lazaridou:etal:2018, ; Kottur:etal:2017, ; lowe2019pitfalls, ).

Previous research has investigated independent reinforcement learners in cooperative settings (TanIndependentvCooperative, ), with more recent work focusing on canonical RL algorithms. One version of decentralized Q-learning converges to optimal policies in deterministic tabular environments without additional communication (LauerDistributed, ), but does not trivially extend to stochastic environments or function approximation. Centralized critics (lowe2017multi, ; FoersterCOMA, ) improve stability by allowing agents to use information from other agents during training, but these violate our assumptions of independence, and may not scale well.

2 Setting

We apply multi-agent reinforcement learning (MARL) in partially-observable Markov games (i.e. partially-observable stochastic games(ShapleyStochasticGames, ; LittmanMarkovGames, ; Hansen04POSG, ), in environments where agents have a joint communication channel. In every state, agents take actions given partial observations of the true world state, including messages sent on a shared channel, and each agent obtains an individual reward. Through their individual experiences interacting with one another and the environment, agents learn to broadcast appropriate messages, interpret messages received from peers and act accordingly.

Formally, we consider an -player partially observable Markov game  (ShapleyStochasticGames, ; LittmanMarkovGames, ) defined on a finite state set , with action sets and message sets . An observation function defines each agent’s -dimensional restricted view of the true state space. On each timestep , each agent receives as an observation , and the messages sent in the previous state for all . Each agent then select an environment action and a message action . Given the joint action the state changes based on a transition function

; this is a stochastic transition, and we denote the set of discrete probability distributions over

as . Every agent gets an individual reward for player . We use the notation , , and . We write for , excluding , and for , excluding .

In our fully cooperative setting, each agent receives the same reward at each timestep, , which we denote by . Each agent maintains an action and a message policy from which actions and messages are sampled, and , and which can in general be functions of their entire trajectory of experience . These policies are optimized to maximize discounted cumulative joint reward (which is discounted by to ensure convergence), where . Although the objective is a joint objective, our model is that of decentralized learning and execution, where every agent has its own experience in the environment, and independently optimizes the objective with respect to its own action and message policies and ; there is no communication between agents other than using the actions and message channel in the environment. Applying independent reinforcement learning to cooperative Markov games results in a problem for each agent which is non-stationary and non-Markov, and presents difficult joint exploration and coordination problems (BernsteinDecPomdp, ; ClausBoutillierDynamics, ; LaurentIndepenedentLearners, ; MatignonIndependentLearners, ).

3 Shaping Losses for Facilitating Communication

One difficulty in emergent communication is getting the communication channel to help with the task at all. There is an equilibrium where the speaker produces random symbols, and the listener’s policy is independent of the communication. This might seem like an unstable equilibrium: if one agent uses the communication channel, however weakly, the other will have some learning signal. However, this is not the case in some tasks. If the task without communication has a single, deterministic optimal policy, then messages from policies sufficiently close to the uniform message policy should be ignored by the listener. Furthermore, any entropy costs imposed on the communication channel, which are often crucial for exploration, exacerbate the problem, as they produce a positive pressure for the speaker’s policy to be close to random. Empirically, we often see agents fail to use the communication channel at all; but when agents start to use the channel meaningfully, they are then able to find at least a locally optimal solution to the communication problem.

We propose two shaping losses for communication to alleviate these problems. The first is for positive signalling (lowe2019pitfalls, ): encouraging the speaker to produce diverse messages in different situations. The second is for positive listening (lowe2019pitfalls, ): encouraging the listener to act differently for different messages. In each case, the goal is for one agent to learn to ascribe some meaning to the communication, even while the other does not, which eases the exploration problem for the other agent.

We note that most policies which maximize these biases do not lead to high reward. Much information about an agent’s state is unhelpful to the task at hand, so with a limited communication channel positive signalling is not sufficient to have useful communication. For positive listening, the situation is even worse – most ways of conditioning actions on messages are actively unhelpful to the task, particularly when the speaker has not developed a good protocol. These losses should therefore not be expected to lead directly to good communication. Rather, they are intended to ensure that the agents begin to use the communication channel at all – after this, MARL can find a useful protocol.

3.1 Bias for positive signalling

The first inductive bias we use promotes positive signalling, incentivizing the speaker to produce different messages in different situations. We add a loss term which is minimized by message policies that have high mutual information with the speaker’s trajectory. This encourages the speaker to produce messages uniformly at random overall, but non-randomly when conditioned on the speaker’s trajectory.

We denote by the average message policy for agent over all trajectories, weighted by how often they are visited under the current action policies for all agents. The mutual information of agent ’s message with their trajectory is:

(1)
(2)

We estimate this mutual information from a batch of rollouts of the agent policy. We calculate

exactly for each timestep from the agent’s policy. To estimate , we estimate as the average message policy in the batch of experience. Intuitively, we would like to maximize , so that the speaker’s message depends maximally on their current trajectory. However, adding this objective as a loss for gradient descent leads to poor solutions. We hypothesize that this is due to properties of the loss landscape for this loss. Policies which maximize mutual information are deterministic for any particular trajectory , but uniformly random unconditional on . At such policies, the gradient of the term is infinite. Further, for any the space of policies which have entropy at most is disconnected, in that there is no continuous path in policy space between some policies in this set.

To overcome these problems, we instead use a loss which is minimized for a high value for and a target value for . The loss we use is:

(3)

for some target entropy

, which is a hyperparameter. This loss has finite gradients around its minima, and for suitable choices of

the space of policies which minimizes this loss is connected. In practice, we found low sensitivity to , and typically use a value of around , which is half the maximum possible entropy.

1:.
2:.
3:Target conditional entropy .
4:Weighting for conditional entropy.
5:for b=1; b B; b++ do # Batch of rollouts.
6:     Observations for .
7:     Actions for .
8:     Other agent messages from for .
9:     Initial hidden state .
10:     Action set , message set , observation space , hidden state space .
11:     Message policy .
12:     Hidden state update rule .
13:     for t = 1; t T; t++ do
14:         .
15:         .
16:         .
17:         .
18:         .
19:     end for
20:end for
21:.
22:.
Algorithm 1 Calculation of positive signalling loss

3.2 Bias for positive listening

The second bias promotes positive listening: encouraging the listener to condition their actions on the communication channel. This gives the speaker some signal to learn to communicate, as its messages have an effect on the listener’s policy and thus on the speaker’s reward. The way we encourage positive listening is akin to the causal influence of communication, or CIC (Jaques:etal:2019, ; lowe2019pitfalls, ). In (Jaques:etal:2019, ), this was used as a bias for the speaker, to produce influential messages, and in (lowe2019pitfalls, ) as a measure of whether communication is taking place. We use a similar measure as a loss for the listener to be influenced by messages. In (Jaques:etal:2019, ; lowe2019pitfalls, ), CIC was defined over one timestep as the mutual information between the speaker’s message and the listener’s action. We extend this to multiple timesteps using the mutual information between all of the speaker’s previous messages on a single listener action – using this as an objective encourages the listener to pay attention to all the speaker’s messages, rather than just the most recet. For a listener trajectory , we define (this is the trajectory , with the messages removed). We define the multiple timestep CIC as:

(4)
(5)

We estimate this multiple timestep CIC by learning the distribution . We do this by performing a rollout of the agent’s policy network, with the actual observations and actions in the trajectory, and zero inputs in the place of the messages. We fit the resulting function to predict , using a cross-entropy loss between these distributions:

(6)

where we backpropagate only through the

term. For a given policy , this loss is minimized in expectation when . Thus is trained to be an approximation of the listener’s policy unconditioned on the messages it has received. The multi-timestep CIC can then be estimated by the KL divergence between the message-conditioned policy and the unconditioned policy:

(7)

For training positive listening we use a different divergence between these two distributions, which we empirically find achieves more stable training. We use the norm between the two distributions:

(8)
1:Observations for .
2:Actions for .
3:Other agent messages from for .
4:Initial hidden state .
5:Action set , observation space , hidden state space .
6:Action policy .
7:Hidden state update rule .
8:.
9:.
10:for i = 1; t T; t++ do
11:     .
12:     .
13:     .
14:     .
15:     .
16:     .
17:end for
Algorithm 2 Calculation of positive listening losses

4 Empirical Analysis

We consider two environments. The first is a simple one-step environment, where agents must sum MNIST digits by communicating their value. This environment has the advantage of being very amenable to analysis, as we can readily quantify how valuable the communication channel currently is to each agent. In this environment, we provide evidence for our hypotheses about how the biases we introduce in Section 3 ease the learning of communication protocols. The second environment is a new multi-step MARL environment which we name Treasure Hunt. It is designed to have a clear performance ceiling for agents which do not utilise a communication channel. In this environment, we show that the biases enable agents to learn to communicate in a multi-step reinforcement learning environment. We also analyze the resulting protocols, finding interpretable protocols that allow us to intervene in the environment and observe the effect on listener behaviour. The full details of the Treasure Hunt environment, together with the hyperparameters used in our agents, can be found in the supplementary material.

4.1 Summing MNIST digits

In this task, depicted in Figure 1, the speaker and listener agents each observe a different MNIST digit (as an image), and must determine the sum of the digits. The speaker observes an MNIST digit, and selects one of possible messages. The listener receives this message, observes an independent MNIST digit, and must produce one of possible actions. If this action matches the sum of the digits, both agents get a fixed reward of

, otherwise, both receive no reward. The agents used in this environment consist of a convolutional neural net, followed by an multi-layer perceptron and a linear layer to produce policy logits. For the listener, we concatenate the message sent to the output of the convnet as a one-hot vector. The agents are trained independently with REINFORCE.

Figure 1: Summing MNIST environment. In this example, both agents would get reward if .

The purpose of this environment is to test whether and how the biases we propose ease the learning task. To do this, we quantify how useful the communication channel is to the speaker and to the listener. We periodically calculate the rewards for the following policies:

  1. The optimal listener policy , given the current speaker and the labels of the listener’s MNIST digits.

  2. The optimal listener policy , given the labels of the listener’s MNIST digits and no communication channel.

  3. The optimal speaker policy , given the current listener and the labels of the speaker’s MNIST digits.

  4. The uniform speaker policy , given the current listener.

We calculate these quantities by running over many batches of MNIST digits, and calculating the optimal policies explicitly. The reward the listener can gain from using the communication channel is , so this is a proxy for the strength of the learning signal for the listener to use the channel. Similarly, is how much reward the speaker can gain from using the communication channel, and so is a proxy for the strength of the learning signal for the speaker.

Figure 2: (a) Both biases lead to more reward. (b,c, d) Listener and speaker power in various settings. Listener power increases first with positive signalling, and speaker power increases first with positive listening.

The results (Figure 2) support the hypothesis that the bias for positive signalling eases the learning problem for the listener, and the bias for positive listening eases the learning problem for the speaker. When neither agent has any inductive bias, we see both and stay low throughout training, and the final reward of is exactly what can be achieved in this environment with no communication. When we add a bias for positive signalling or positive listening, we see the communication channel used in most runs (Table 1), leading to greater reward, and and both increase. Importantly, when we add our inductive bias for positive listening, we see increase initially, followed by . This is consistent with the hypothesis that the positive listening bias bias produces a stronger learning signal for the speaker; then once the speaker has begun to learn to communicate meaningfully, the listener also has a strong learning signal. When we add the bias for positive signalling the reverse is true – increases before . This again fits the hypothesis that speaker’s bias produces a stronger learning signal for the speaker.

We also ran experiments with the speaker getting an extra reward for positive listening, as in (Jaques:etal:2019, ). However, we did not see any gain from this over the no-bias baseline; in our setup, it seems the speaker agent was unable to gain any influence over the listener. We think that there is a natural reason this bias would not help in this environment; for a fixed listener, the speaker policy which optimizes positive listening has no relation to the speaker’s input. Thus this bias does not force the speaker to produce different messages for different inputs, and so does not increase the learning signal for the listener.

Biases Proportion of good runs CI Final reward of good runs
No bias - N/A
Social influence - N/A
Positive listening -
Positive signalling -
Both -
Table 1: Both biases lead to consistent discovery of useful communication. We define a good run to be one with final average reward greater than . Averages are over runs for each setting.

4.2 Treasure Hunt

We propose a new cooperative RL environment called Treasure Hunt, where agents explore several tunnels to find treasure 111Videos for this environment can be found at https://youtu.be/eueK8WPkBYs and https://youtu.be/HJbVwh10jYk.. When successful, both agents receive a reward of . The agents have a limited field of view; one agent is able to efficiently find the treasure, but can never reach it, while the other can reach the treasure but must perform costly exploration to find it. In the optimal solution, the agent which can see the treasure finds it and communicates the position to the agent which can reach it. Agents communicate by sending one of five discrete symbols on each timestep. The precise generation rules for the environment can be found in the supplementary material.

Figure 3: Treasure hunt environment.
Figure 4: Positive signalling and listening biases leads to more reward.

The agents used in this environment are Advantage Actor-Critic methods (mnih2016asynchronous, ) with the V-trace correction (espeholt2018, )

. The agent architecture employs a single convolutional layer, followed by a multi-layer perceptron. The message from the other agent is concatenated to the output of the MLP, and fed into an LSTM. The network’s action policy, message policy and value function heads are linear layers. Our training follows the independent multiagent reinforcement learning paradigm: each agent is trained independently using their own experience of states and actions. We use RMSProp 

(rmsprop, )

to adjust the weights of the agent’s neural network. We co-train two agents, each in a consistent role (finder or collector) across episodes.

The results are shown in Table 3. We find that biases for positive signalling and positive listening both lead to increased reward, and adding either bias to the agents leads to more consistent discovery of useful communication protocols; we define these as runs which get reward greater than , the maximum final reward in runs with no communication. With or without biases, the agents still frequently only discover local optima - for example, protocols where the agent which can find treasure reports on the status of only one tunnel, leaving the other to search the remaining tunnels. This demonstrates a limitation of these methods; positive signalling and listening biases are useful for finding some helpful communication protocol, but they do not completely solve the joint exploration problem in emergent communication. However, among runs which achieve some communication, we see greater reward on average among runs with both biases, corresponding to reaching better local optima for the communication protocol on average.

We also ran experiments with the speaker getting an extra reward for influencing the listener, as in (Jaques:etal:2019, ). Here, we used the centralized model in (Jaques:etal:2019, ), where the listener calculates the social influence of the speaker’s messages, and the speaker gets an intrinsic reward for increasing this influence. We did not see a significant improvement in task reward, as compared to communication with no additional bias.

Biases Proportion good CI Final reward (good runs) Final reward
No bias -
Positive signalling -
Positive listening -
Both -
Table 2: Proportion and average reward of good runs. Values are means over runs with confidence intervals, calculated using Wilson approximation in the case of Bernoulli variables.
Run Mean time (unmodified) Mean visit time (modified)
Median
Best
Table 3: Visit time to tunnel, with and without modified messages. Values are means over episodes with confidence intervals.

We analyze the communication protocols for two runs, which correspond to the two videos linked in11footnotemark: 1. One is a typical solution among runs where communication emerges; we picked this run by taking the median final reward out of all runs with both positive signalling and positive listening biases enabled. Qualitatively, the behaviour is simple – the finder finds the rightmost tunnel, and then reports whether there is treasure in that tunnel for the remainder of the episode. The other run we analyze is the one with the greatest final reward; this has more complicated communication behaviour. To analyze these runs, we rolled out episodes using the final policies from each.

First, we relate the finder’s communication protocol to the actual location of the treasure on this frame; in both runs, we see that these are well correlated. In the median run, we see that one symbol relates strongly to the presence of treasure; when this symbol is sent, the treasure is in the rightmost tunnel around of the time. In the best run, where multiple tunnels appear to be reported on by the finder, the protocol is more complicated, with various symbols correlating with one or more tunnels. Details of the correlations between tunnels and symbols can be found in the supplementary material.

Next, we intervene in the environment to demonstrate that these communication protocols have the expected effect on the collector. For each of these pairs of agents, we produce a version of the environment where the message channel is overridden, starting after frames. We override the channel with a constant message, using the symbol which most strongly indicates a particular tunnel. We then measure how long the collector takes to reach a square from the bottom, where the agent is just out of view of the treasure. In Table 3, we compare this to the baseline where we do not override the communication channel. In both cases, the collector reaches the tunnel significantly faster than in the baseline, indicating that the finder’s consistent communication is being acted on as expected.

5 Conclusion

We introduced two new shaping losses to encourage the emergence of communication in decentralized learning; one on the speaker’s side for positive signalling, and one on the listener’s side for positive listening. In a simple environment, we showed that these losses have the intended effect of easing the learning problem for the other agent, and so increase the consistency with which agents learn useful communication protocols. In a temporally extended environment, we again showed that these losses increase the consistency with which agents learn to communicate.

Several questions remain open for future research. Firstly, we investigate only fully co-operative environments; does this approach can help in environments which are neither fully cooperative nor fully competitive? In such settings, both positive signalling and positive listening can be harmful to an agent, as it becomes more easily exploited via the communication channel. However, since the losses we use mainly serve to ensure the communication channel starts to be used, this may be as large a problem as it initially seems. Secondly, the environments investigated here have difficult communication problems, but are otherwise simple; can these methods be extended to improve performance in decentralized agents in large-scale multi-agent domains? There are a few dimensions along which these experiments could be scaled – to more complex observations and action spaces, but also to environments with more than two players, and to larger communication channels.

References

Appendix A Environment details

To generate a map for the treasure hunt environment, we:

  1. Create a rectangle of grey pixels with height and width .

  2. Draw a black tunnel on the second row up, including all but the leftmost and rightmost pixels.

  3. Draw a black tunnel on the second row down, including all but the leftmost and rightmost pixels.

  4. Pick starting positions on the top horizontal tunnel for vertical tunnels. These are randomly selected among sets of positions which are all at least pixels apart (so no tunnel is visible from another). Draw black tunnels from these positions to pixels above the bottom tunnel.

  5. Place the yellow treasure at the bottom of a random tunnel.

  6. Place one agent uniformly at random in the top tunnel, and one uniformly at random in the bottom tunnel.

The episode length is timesteps. The agents have actions, corresponding to the directions and a no-op action. They can move in the black tunnels and onto the treasure, but not onto the grey walls. The agent observation is a square, centered on the agent. When an agent moves onto the treasure, both agents receive reward , and the treasure respawns at the bottom of a random tunnel.

The RGB values of the colors of the pixels in the observations are:

  • Blue self: .

  • Red partner agent: .

  • Grey walls: .

  • Black tunnels: .

  • Yellow treasure: .

Appendix B Treasure hunt communication protocols

In this section, we give more details on the communication protocols in Treasure Hunt. First, we give full details of the correlations between messages and treasure location in the two runs discussed in the main text. These are the median and the best runs in terms of final reward, in the setting where we use both positive signalling and positive listening biases. We generated episodes using the final policies for the agents, and recorded the treasure locations and symbols on each timestep. Each cell shows the probability of each treasure location, given that the speaker transmits a particular symbol.

In Table 4, we see that symbol is particularly meaningful, fairly reliably indicating the presence of treasure in the final tunnel. In Table 5, we see that three symbols appear to be used; and correlate with righthand tunnels, and with lefthand ones.

S
0
1
2
3
4
Table 4: Message tunnel correlations for median Treasure Hunt run.
S
0
1
2
3
4
Table 5: Message tunnel correlations for best Treasure Hunt run.

We also show the correlations between messages and listener actions, in Tables 6 and 7. In both cases, these are as expected from the correlations between tunnels and messages; we see the listener move more in the direction which the messages correlate with.

S
0
1
2
3
4
Table 6: Message action correlations for median Treasure Hunt run.
S
0
1
2
3
4
Table 7: Message action correlations for best Treasure Hunt run.

Appendix C Network details and hyperparameters

c.1 MNIST sums experiments

In the MNIST sums environment, the agent architecture used was from an existing MNIST classifier; we did not optimize this, as the goal was to investigate the effect of communication biases rather than to achieve optimal performance. This architecture is:

We used the Adam optimizer [16], with a learning rate of and parameters , , . These layers were implemented using the Sonnet library [32]. The agents were trained using REINFORCE; the total loss for each agent consists of the REINFORCE loss, entropy regularization, and the biases for positive signalling () and positive listening, ( and ).

For the listener agent, the message is concatenated to the flattened output of the convolutional net before the hidden linear layer.

The final hyperparameters used for the settings were:

Hyperparameter No bias PS PL Both SI
Batch size
Action policy entropy bonus
Message policy entropy bonus
Target message entropy N/A N/A N/A
Weight of N/A N/A N/A
Weight of N/A N/A N/A
Weight of N/A N/A
for N/A N/A N/A
Table 8: Hyperparameters for final MNIST experiments.

To select the hyperparameters for the no bias setting, we performed a joint sweep over action and message entropy bonuses, consider a range of values from to for each. No values were found which improved over the no communication policy; the final values reported here are those which worked best in the other settings.

In the positive listening setting, we performed sweeps:

  • Over the weight of , using values in .

  • Over the entropy costs for messages and actions, using values .

  • Over the weight of , using values of ; aside from , which unsurprisingly produced worse results, there was no significant difference in the results of these runs.

In the positive signalling setting, we performed sweeps:

  • Jointly over the weight of and , using values in for , and for the product .

In the setting with both biases, we ran no additional sweeps, simply combining the hyperparameters from the best runs with positive signalling and positive listening.

In the social influence setting, we performed sweeps:

  • Over the weight of the social influence reward, using values .

  • Over the entropy costs for messages and actions, using values .

For all hyperparameter sweeps, we ran runs, and picked the setting with the highest average final reward. For the final sets of hyperparameters, we then ran runs.

c.2 Treasure Hunt experiments

In our experiments, we use parallel environment copies for the Asynchronous Advantage Actor-Critic algorithm [29] with the V-trace correction [7]. The total loss for each agent consists of the A3C loss, including entropy regularization, and the biases for positive signalling () and positive listening, ( and ). The two agents have the same architecture, which consists of:

  • A single convolutional layer, using channels, kernel size of and stride of .

  • A multi-layer perceptron with hidden layers of size .

  • An LSTM, with hidden size .

  • Linear layers mapping to policy logits for the action and message policies, and to the baseline value function.

The message from the other agent is concatenated to the flattened output of the convolutional net before the hidden linear layer.

We used the RMSProp optimizer [13] for gradient descent, with a initial learning rate of , exponentially annealed by a factor of every million steps. The other parameters are , and .

The final hyperparameters used for the settings were:

Parameter No comms No bias PS PL Both SI
Batch size
Action entropy regularization
Message entropy regularization N/A
Target message entropy N/A N/A N/A N/A
Weight of N/A N/A N/A N/A
Weight of N/A N/A N/A
Weight of N/A N/A N/A N/A
for N/A N/A N/A N/A
Weight of N/A N/A N/A N/A N/A
Table 9: Hyperparameters for final Treasure Hunt experiments.

In the no communication setting, we performed sweeps:

  • Over the entropy costs for actions, using values .

  • Over the sizes of the MLP layers, using values .

  • Over the sizes of the LSTM hidden size, using values .

We then fixed these parameters for the other settings.

In the positive listening setting, we performed sweeps:

  • Over the weight of , using values in .

  • Over the weight of , using values of .

In the positive signalling setting, we performed sweeps:

  • Jointly over the weight of and , using values in for , and for the product .

  • Over the target entropy , using values in .

In the setting with both biases, we ran no additional sweeps, simply combining the hyperparameters from the best runs with positive signalling and positive listening.

In the social influence setting, we performed a sweep over the weighting of the reward for social influence to the speaker, using values in .

For all hyperparameter sweeps, we ran runs, and picked the setting which exceeded the no-communication baseline most frequently, terminating runs early if the result was clear. For the final sets of hyperparameters, we then ran runs.

Appendix D Multi-step CIC ablation

In the positive listening bias for the listener, we use the multiple timestep CIC. Recall that for a listener trajectory , we define (this is the trajectory , with the messages removed). We define the multiple timestep CIC as:

(9)

We could also choose to use the single timestep CIC, defined in [15, 25] as the mutual information between the speaker and the listener’s actions. Defining – which is the trajectory with only the final message removed – this would be:

(10)

Similarly to in the multiple timestep CIC, we estimate the single timestep CIC by learning the distribution . We do this by performing a rollout of the agent’s policy network, with the actual observations and actions in the trajectory, including all message except the final one, which is zeroed out. We fit the resulting function to predict , using a cross-entropy loss between these distributions.

The single-timestep CIC can then be estimated by:

(11)

As with the multi-timestep CIC, we used the loss for

(12)

Using the single timestep CIC, with or without positive signalling, we did not find improvements over a baseline with no speaker-side bias – see Figure 5.

Figure 5: One step CIC does not lead to better results than no speaker-side bias.

Appendix E Statistical methodology

All confidence intervals shown are confidence intervals. For confidence intervals of Bernoulli variables – the proportion of runs with reward above a certain threshhold – we use the Wilson approximation. For graphs depicting average performance over multiple runs, we first take the mean reward per run in time windows over training. The interval shown is the confidence interval for this mean.