Incentivizing the Emergence of Grounded Discrete Communication Between General Agents

by   Thomas A. Unger, et al.

We converted the recently developed BabyAI grid world platform to a sender/receiver setup in order to test the hypothesis that established deep reinforcement learning techniques are sufficient to incentivize the emergence of a grounded discrete communication protocol between general agents. This is in contrast to previous experiments that employed straight-through estimation or tailored inductive biases. Our results show that these can indeed be avoided, by instead providing proper environmental incentives. Moreover, they show that a longer interval between communications incentivized more abstract semantics. In some cases, the communicating agents adapted to new environments more quickly than monolithic agents, showcasing the potential of emergent discrete communication for transfer learning.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9


Emergent Communication through Negotiation

Multi-agent reinforcement learning offers a way to study how communicati...

Learning Emergent Discrete Message Communication for Cooperative Reinforcement Learning

Communication is a important factor that enables agents work cooperative...

Biases for Emergent Communication in Multi-agent Reinforcement Learning

We study the problem of emergent communication, in which language arises...

Developmentally motivated emergence of compositional communication via template transfer

This paper explores a novel approach to achieving emergent compositional...

Emergence of Grounded Compositional Language in Multi-Agent Populations

By capturing statistical patterns in large corpora, machine learning has...

Miss Tools and Mr Fruit: Emergent communication in agents learning about object affordances

Recent research studies communication emergence in communities of deep n...

The emergence of visual semantics through communication games

The emergence of communication systems between agents which learn to pla...

1 Introduction

The traditional approach to language modeling suffers a fundamental flaw. While models such as GPT-2

(Radford et al., 2019) show that capturing the statistics of natural language corpora can lead to impressive results, they are ultimately divorced from the reality they are supposed to describe; they capture the statistics of descriptions of things, rather than the statistics of the things themselves. This approach limits the grounding of semantics.

Some recent works have attempted to improve upon this by instead employing communication games, in which agents are incentivized to use communication as a tool in order to optimize for some objective. This frames communication as a means toward an end, instead of a thing to merely mimic. The corresponding learning process, sometimes called language emergence, can then be framed as a negotiation process through which agents establish a common semantics.

Besides the fact that such a treatment of communication can be conducive to cooperation, there are a number of other benefits to this approach. Since the agents must transmit relevant information to complete a given task, it naturally coaxes agents into explaining themselves, so that we can listen in and perhaps understand what they are thinking. Another benefit is its intrinsic interactivity, meaning that we can position ourselves as one of the agents, to give commands, to ask for clarification, or vice versa.

Lastly, framing communication as a tool is in parallel with the process by which natural language emerged and is maintained. Any natural language which is actively in use is a living language; it is ever-changing and in effect ever-emerging. For example, many humans adopt and invent jargon in their professions; it is arguably desirable that artificial agents that have been primed to use natural language will nonetheless similarly and independently develop artificial jargon to aid in the pursuit of their goals.

1.1 Related Work

Typical methods to train communicating agents are those under the umbrella of multi-agent reinforcement learning (MARL). With the advent of deep learning, these further tend to fall under multi-agent deep reinforcement learning (MDRL). DBLP:journals/corr/abs-1810-05587 gave a brief survey of recent works using MDRL. They subdivided these into four categories. Namely, the analysis of emergent behaviors, learning communication, learning cooperation and agents modeling agents. They further stated non-stationarity, the increase of dimensionality and the credit-assignment problem as specific difficulties in the use of MDRL methods.

One difficulty specific to the use of discrete symbols, which has made it unpopular in the past, is that their non-continuous nature makes them non-differentiable. This means that communication cannot be optimized in a straightforward way using backpropagation. In order to deal with non-differentiability, DBLP:journals/corr/HavrylovT17 employed a straight-through Gumbel-softmax estimator

(Bengio et al., 2013; Maddison et al., 2016; Jang et al., 2016) in order to study the emergence of discrete communication in simple referential games. mordatch2018emergence similarly proposed the use of a straight-through Gumbel-softmax estimator in a more complex environment. However, the use of straight-through estimation means that the agents have access to each other’s internal states, from which they can extract counterfactual information, i.e., “how would the other agent have reacted if I had said something slightly different?”. This does not generalize to training communication with opaque agents, whose internal states are inaccessible, e.g., humans or hard coded agents.

foerster2016learning and DBLP:journals/corr/abs-1810-11187 employed centralized learning, in which agents share parameters that are optimized in a centralized manner. This essentially improves coordination by letting agents talk to copies of themselves, which does not generalize to genuinely separate agents. The latter work also employed continuous communication, which is unlike the discrete nature of natural language.

DBLP:journals/corr/abs-1810-08647 criticized the use of centralized learning by foerster2016learning as being “unrealistic” and instead resorted to an explicit model of other agents (MOA) in order to incentivize the emergence of discrete communication between separate agents in grid worlds. This MOA makes explicit predictions of actions of other agents, which incentivizes the maximization of the mutual information between the messages and the resulting actions. However, hard coding such a mechanism into the agent model makes it inflexible, which renders communicating agents more transparent in the sense that they can make assumptions about each other’s inner workings. While this may hasten learning in some cases, it may impede or even prevent learning communication with agents in general, for which these assumptions do not hold.

choi2018compositional studied the emergence of discrete communication in simple referential games by employing an inductive bias in the agent model in the form of an obverter method. This method optimizes the sender’s own understanding of messages as a proxy for optimizing the receiver’s understanding, relying on the assumption that these correlate. Again, such an assumption may be unwarranted for agents in general. DBLP:journals/corr/abs-1804-03984 did very similar work without employing this mechanism, which however still remained confined to simple referential games.

2 Method

In attempting to improve on the work discussed above, we hypothesized that established deep reinforcement learning methods are sufficient to incentivize the emergence of a grounded discrete communication protocol, without relying on unrealistic straight-through estimation methods or tailored inductive biases and instead providing proper environmental incentives. Moreover, in contrast to the simple referential games typically used previously, we tested this hypothesis on the more complex BabyAI platform.

2.1 BabyAI

BabyAI Chevalier-Boisvert et al. (2018) is a research platform originally intended to support investigations towards including humans in the loop for grounded language learning. A task in a BabyAI environment consists of a single agent navigating a rectangular grid world to achieve a given objective. Besides being occupied by the agent, a cell may be occupied by an object, which can either serve a purpose toward the objective, or exist purely as a distraction to this objective. The objective of the agent is provided in the form of an instruction in a synthetic language which is a proper subset of English. Following such an instruction may involve learning multiple tasks such as moving, changing orientation and picking up objects. By default, the agent has a partial, egocentric view of the environment, as opposed to a global view.

The entirety of the BabyAI platform comprises an extensible suite of nineteen environments of varying difficulty. This framework promotes curriculum learning, in which agents are trained to perform simpler tasks before being trained to perform more complex tasks. Increasingly complex tasks are composed of increasing numbers of subtasks (e.g., picking up a key needed to open the corresponding door).

In their paper, the authors included various baselines using both imitation learning (IL) and RL training procedures. They found that agents benefited only modestly from pretraining on relevant subtasks.

2.2 Sender/Receiver Setup

The BabyAI platform does not natively support a multi-agent setup. Therefore, in order to test our hypothesis, we converted it to a sender/receiver setup. A sender/receiver setup is one in which one agent can transmit information to another agent to improve its performance on a given task. This is the typical setup used in simple referential games, but adopting the BabyAI platform complicates the task by introducing dependencies across successive actions.

There are five incentives that characterize our basic setup, subdivided into two primary and three secondary incentives. Figure 1 summarizes these incentives.

Figure 1: a cartoon depicting our sender/receiver setup in BabyAI. The triangles indicate the two agents; the receiver is inside the environment with its limited, egocentric view indicated by the highlighted cells, while the sender is outside the environment with a global view. The sender must use the available messages and their constituent symbols to transmit any information to the receiver, prompting it to give feedback indicating the utility of the messages. Only the receiver knows the explicit instruction, as indicated by its thought bubble.

2.2.1 Primary Incentives

Each of the two primary incentives provides one of the agents with the basic incentive to actually use the communication channel.

Receiver’s Incentive

There is a differential between the observations of the sender and the receiver. The receiver takes on the role of the single agent in the original BabyAI setup, but in contrast to the original agent has its observations limited to only two cells, namely the cell it is occupying and the cell it is facing. The sender on the other hand views the environment from an “Archimedean point”, i.e., it has a global view of the environment and can affect the environment only indirectly, namely through any messages it sends to the receiver. This informational differential provides the receiver with an incentive to listen to the sender.

Sender’s Incentive

While the receiver obtains its reward signal directly from the environment, the receiver in turn provides the reward signal to the sender. Section 2.4.2 provides further details. This reward provides the sender with an incentive to actually transmit information relevant to the receiver.

2.2.2 Secondary Incentives

While the primary incentives provide the agents with the basic incentives to use the communication channel, the secondary incentives incentivize more precisely how this communication channel is to be used.

Inter-Message Economic Incentive

The sender sends a single message every time steps, where is dubbed the communication interval. For example, if , then the sender could get away with dictating one action per message. However, as this value increases, so does the incentive to compile information about multiple time steps into each message, i.e., to use messages economically.

Intra-Message Economic Incentive

Each message consists of a series of discrete symbols. These messages have a fixed length of symbols and a fixed symbol vocabulary . We indicate both the length of the message and the size of the vocabulary using the notation , where denotes the size of the set , which is the set of possible messages. Lower values of and increase the incentive to compress information within each message, i.e., to use symbols economically.

Encyclopedic Incentive

The instruction is shown only to the receiver. When there is only one object, the goal is implicit in the sender’s observations. However, when it is ambiguous which object is the goal, this information differential provides the sender with an incentive to offer a description of the environment that is more comprehensive than simply the location of the goal, or a list of actions necessary to complete the task.

2.3 Agent Model

The authors of BabyAI implemented two Advantage Actor–Critic (A2C) agent models to produce baseline RL performances. They referred to these two models as the Large model and the Small model. The latter formed the basis for our own implementation. The small baseline model uses a unidirectional GRU (Cho et al., 2014)

to encode the instruction and a convolutional neural network (CNN) with two batch-normalized

(Ioffe and Szegedy, 2015) FiLM (Perez et al., 2018) layers to process both the observation and the encoded instruction. This serves as input for an LSTM (Hochreiter and Schmidhuber, 1997), which integrates it with inputs from previous time steps.

To facilitate communication, we enhanced the agent model with a decoder module and an encoder module. The decoder module consists of an LSTM module whose input is the hidden state of the agent’s memory and whose output is a series of vectors, each of which is fed through a linear layer that produces the logits of a probability distribution over

possible symbols. These logits are fed through a softmax layer to produce the actual probability distribution, similarly to how the existing policy module produces a probability distribution over actions. From the resulting sequence of probability distributions are sampled one-hot vectors, constituting a sequence of discrete symbols.

The encoder module likewise consists of an LSTM, whose input is such a sequence of one-hot vectors and whose output is its final hidden state, which is concatenated with the instruction embedding.

2.4 Training Procedure

In order to train the agents, we adapted the implementation of the proximal policy optimization (PPO) (Schulman et al., 2017)

algorithm that was used by the BabyAI authors to produce their RL baselines. This is a highly parallelized implementation written in the Python programming language using the machine learning library PyTorch, that can also utilize any graphics processing unit (GPU) supporting CUDA, which we made extensive use of to maximize performance.

The implementation of PPO has a plethora of hyperparameters, such as learning rate, epochs per batch and batch size. For most hyperparameters, changing their default value typically did not increase performance and in fact usually decreased performance. For this reason, we assumed the default values to have been sufficiently tuned for our purposes and thus we refrained from deviating significantly from these values.

2.4.1 Frame Cadence

At each time step in an environment, the receiver is assigned a frame , during which it makes an observation, updates its memory and then performs one action. Our implementation of the sender/receiver setup inserts an extra frame assigned to the sender before the frame at every time step at which the sender emits a message. During these frames, the sender makes an observation, updates its memory and then emits a message . This message is placed into a buffer, which is read by the receiver at every time step.

2.4.2 Sender’s Fee

A2C methods use the policy gradient to optimize the agent’s “actor” , such that actions which lead to a state with a high value are reinforced. These state values in turn are estimated by the agent’s “critic” . By exploiting the Bellman equation, can be optimized using temporal-difference (TD) learning, i.e., by bootstrapping the state-value estimate with the state-value estimate , as follows.


While the reward for the receiver is straightforward—the reward as given by the environment—the same is not true for the reward of the sender. We essentially bootstrap the sender’s state-value estimate with the receiver’s state-value estimate , where is the message that the sender sent to the receiver at time step . Using equation 1, we can write and for the sender and the receiver, respectively, as follows.


As visible in equation 2, the sender does not bootstrap using its own value estimates at all, but rather treats every message emission as a separate task. That is, from the sender’s perspective, every “episode” lasts a single time step and the sender experiences a number of such “episodes” equal to the number of messages emitted during what is a single episode from the receiver’s perspective111The sender is nonetheless able to remember previous time steps from the same episode..

Because the receiver is heavily dependent on the messages for the accuracy of its state-value estimates, this mechanism gives the sender very direct feedback about the usefulness of its messages. Under this condition, the sender mainly models the receiver and only models the environment insofar as it affects the receiver’s state-value estimate. Hence the sender can be said to have an implicit model of the other agent.

2.5 Experiments

Agents were trained with the GoToObj environment from the original BabyAI implementation, as well as with custom environments that were derived from it. Figure 2 shows instances of these environments.

The environments contain objects that can have one of three types and one of six colors. Unless otherwise noted, object types and colors are sampled randomly.

In each of the environments, the task is for the receiver to move toward an object having the type and color specified in the instruction. The type and color specified correspond to a randomly selected object in the environment. An episode ends once either the receiver faces an object with the specified characteristics or 64 time steps have passed.

We varied the communication interval logarithmically according to the powers of 2, starting at and ending at . Since episodes have a maximum length of 64 time steps, equates to one message per episode. The first message in an episode is always emitted at the first time step of the episode.

We performed pretraining experiments in order to gauge the agents’ transfer learning ability. In these experiments, the agents either had to adapt to additional obstacles, or to increased goal ambiguity.


This is an environment from the original BabyAI platform. This is an grid, where all cells along the perimeter are intraversible walls, effectively reducing it to a grid. The agent and the object are spawned in different randoms locations on this grid.


This environment was customly designed to complicate the task as seen in the GoToObj environment. Specifically, the agent effectively has to move to two locations in succession instead of one in order to reach the goal. Moreover, in addition to turning and moving forward, the agent must learn to toggle the door to open it. The door connects two rooms in a larger grid. The door is positioned randomly in one of the three cells connecting these rooms. The agent is always spawned in the upper-left room, while the object is always spawned in the upper-right room and is never of the key type.

Figure 3: log-log plots of learning curves of agents in various environments.

This is another customly designed environment that further complicates the task. Not only must the agent learn to turn, move forward and toggle a door, but it must learn to pick up a key to unlock the door.

There are two variants of this environment. In the unambiguous variant there is—in addition to the key object—one unambiguous goal object. In the ambiguous variant there is an additional, distracting object. Due to the encyclopedic incentive, the latter variant leaves it ambiguous to the sender which object is the goal. Wherever the variant is not specified, it can be assumed that we are referring to the unambiguous variant.

To ensure that the task can actually be completed, both the agent and the key to the door are spawned in the upper-left room. The goal object and any distracting object are spawned in the upper-right room and are never of the key type.

3 Results

The original BabyAI paper used task success as the principal metric to evaluate performance. However, this is not an appropriate metric to assess the benefit of communication in this case; even without communication, the task success rate can reach 100 %. Rather than in task success, the benefit of communication is primarily in the reduction of the time required for task success. Therefore, we instead measured performance using the number of time steps per episode, i.e., episode length. More precisely, we measured the average over the lengths of all episodes completed in 75 successive batches. Using this metric, we expected results to be bounded between two extremes. Firstly, disallowing any communication should result in an approximate upper bound on the number of time steps per episode required to complete the task, as the receiver is dependent entirely on its own, very limited observations. Secondly, we might find an approximate lower bound when the receiver has access to all information that the sender has. To obtain a reasonable approximation of the lower bound, we introduced the so-called “Archimedean222The moniker “Archimedean” was derived from the term “Archimedean point” and is meant to be understood as an antonym of “myopic”. receiver” which does not read messages, but is given the sender’s observations as a substitute for hypothetical messages that contain all relevant information.

Figure 4: log-log plots of learning curves of various pretrained agents.

With regards to the communication interval, we skipped the values and from the powers of 2; while we did perform some experiments with these values, we found these curves to usually fall predictably between those of and , which already tended to be very similar to each other.

Figure 3 shows the learning curves for agents trained with varying communication intervals in all four environments, while figure 4 shows the learning curves for the pretrained agents.

4 Discussion

Figure 3 shows a clear separation between the upper and lower bounds in all environments, with the learning curves for agents trained with lower values of hugging the lower bound and approaching the upper bound for agents trained with increasing values of . This indicates that while communication was of substantial benefit for shorter communication intervals, this benefit decreased with longer intervals. Because the sender relies on the receiver’s feedback in the form of its state-value estimates , we surmise that this decrease is at least in part due to the difficulty for the receiver in learning to make these estimates accurately over such long intervals.

Figure 2:

instances of the GoToObj, GoToObjDoor and GoToObjLocked environments, from left to right. The triangle indicates the receiver’s location and orientation, while the highlighted cells indicate its partial, egocentric view. In the custom environments, the two connecting rooms are always the two upper rooms. The two bottom rooms are never used and merely serve as padding so that the observations are square, which our implementation of the agent model incidentally requires.

Figure 4 shows that not only were agents pretrained with communication more sample efficient than agents pretrained without communication, but they were more sample efficient even than the Archimedean receiver. This indicates that emergent communication can be leveraged to improve over established transfer learning techniques.

4.1 Semantic Analysis

We analyzed the semantics of the communication protocols that had emerged between agent pairs trained in the GoToObj environment at the end of their respective learning curves in figure 3. For the purpose of this analysis the messages were not sampled stochastically, but instead determined using the function.

Table 1 shows the actions taken by the receiver following some of the sender’s messages, with . One striking anomaly that in fact holds for all 495 emitted messages is that none of them have an associated probability , i.e., the receiver never turned left. We surmise that this suboptimality is idiosyncratic to PPO in general and not to the use of communication specifically. Under the assumption that the receiver would not turn left, there is a clear correlation between table 1 and figure 5; is virtually optimal.

Figure 5: the probabilities that the object was in a specified location relative to the receiver’s location and orientation at a time step in which the sender emitted the message with in the GoToObj environment. The triangle indicates the receiver’s relative location and orientation.

For longer communication intervals, it is not as straightforward to ascertain semantics, as there is a combinatorial explosion of possible trajectories following a message. Table 2 shows the actions taken by the receiver averaged over integral trajectories following some of the sender’s messages, with . The strategy associated with the most frequent message “bbbbbbbb” can be identified as repeatedly moving straight into a wall and then turning left, thereby scouring the perimeter. Consulting figure 6, this appears to be sensible, although it is not optimal behavior. Similarly for the third most frequent message, “eeeeeeee”.

5 Conclusion

We have demonstrated that established deep reinforcement learning techniques are sufficient to incentivize the emergence of grounded discrete communication between agents in grid world environments, without employing straight-through estimation or tailored inductive biases. In addition, we have shown that such communication can be leveraged to improve over established transfer learning techniques.

One limitation of the incentives that we have provided is the assumption that the receiver supplies a reward signal to the sender. However, this assumption is satisfied naturally by the state-value estimate of any actor–critic agent model and is arguably more straightforward to implement than inductive biases used in previous work. Another limitation is that the effectiveness of communication decreased as the communication interval increased, ostensibly

anything 0.22 0.78 21.96 % 0.24 0.14 0.62 17.05 %
nothing 0.00 1.00 16.66 % 0.12 0.16 0.72 13.43 %
wall 1.00 0.00 5.30 % 1.00 0.00 0.00 3.62 %
Table 2: the probability distributions of an action of the receiver following the sender’s emission of the most frequent message or the third most frequent message with in the GoToObj environment. The distributions are further conditioned on , which here indicates what was in the cell facing the receiver at the time step that the action was taken, namely “anything”, “nothing” or a “wall”. Actions and are to turn left and right, respectively, while action is to move forward.

as a consequence of a decrease in the accuracy of the reward signal provided by the receiver to the sender.

5.1 Future Work

kottur2017natural and lowe2019pitfalls previously stressed that, despite the relative simplicity of emergent communication in contemporary works, the associated semantics are difficult to interpret. Our work ties in to this assessment, as we had difficulty analyzing the semantics in all but the simplest of environments. One sinister future scenario could involve inadvertently allowing agents to use steganography. We therefore consider the development of artificial language processing (ALP) to be of crucial importance to understanding artficial intelligence and maintaining value alignment Yudkowsky (2004); Leike et al. (2018).

DBLP:journals/corr/abs-1203-2990 proposed a theory that relates difficulty of learning in deep architectures to culture and language. Our setup can be used to test this theory, by iteratively training random pairs of agents from a larger pool and observing whether a more effective communication protocol emerges between one of the pairs and manages to spread through subsequent pairings.

ffffhhha 1.00 0.00 11.67 %
ffffffbb 0.00 1.00 9.10 %
fffffbbb 0.00 1.00 8.50 %
total: 29.27 %
Table 1: the probability distributions of the first action of the receiver following the sender’s emission of one of the three most frequent messages with in the GoToObj environment. Action is to turn right, while action is to move forward.
Figure 6: the probabilities that the object was in a specified location relative to the receiver’s orientation at a time step in which the sender emitted the message with in the GoToObj environment.


We thank Tim Baumgärtner, Gautier Dagan, Wilker Fereirra Aziz, Dieuwke Hupkes, Bence Keresztury, Mathijs Mul, Diana Rodríguez Luna and Sainbayar Sukhbaatar for offering their help in producing this work.


  • Y. Bengio, N. Léonard, and A. C. Courville (2013)

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    CoRR abs/1308.3432. External Links: Link, 1308.3432 Cited by: §1.1.
  • M. Chevalier-Boisvert, D. Bahdanau, S. Lahlou, L. Willems, C. Saharia, T. H. Nguyen, and Y. Bengio (2018) BabyAI: first Steps Towards Grounded Language Learning with a Human in the Loop. CoRR abs/1810.08272. External Links: Link, 1810.08272 Cited by: §2.1.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv preprint arXiv:1406.1078. Cited by: §2.3.
  • S. Hochreiter and J. Schmidhuber (1997) Long Short-Term Memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §2.3.
  • S. Ioffe and C. Szegedy (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning, pp. 448–456. Cited by: §2.3.
  • E. Jang, S. Gu, and B. Poole (2016) Categorical Reparameterization with Gumbel-Softmax. arXiv preprint arXiv:1611.01144. Cited by: §1.1.
  • J. Leike, D. Krueger, T. Everitt, M. Martic, V. Maini, and S. Legg (2018) Scalable Agent Alignment Via Reward Modeling: A Research Direction. arXiv preprint arXiv:1811.07871. Cited by: §5.1.
  • C. J. Maddison, A. Mnih, and Y. W. Teh (2016)

    The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

    CoRR abs/1611.00712. External Links: Link, 1611.00712 Cited by: §1.1.
  • E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville (2018) FiLM: Visual Reasoning with a General Conditioning Layer. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.3.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language Models Are Unsupervised Multitask Learners. Cited by: §1.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.4.
  • E. Yudkowsky (2004) Coherent Extrapolated Volition. Singularity Institute for Artificial Intelligence. Cited by: §5.1.