Emergence of Communication in an Interactive World with Consistent Speakers

09/03/2018 ∙ by Ben Bogin, et al. ∙ Tel Aviv University 0

Training agents to communicate with one another given task-based supervision only has attracted considerable attention recently, due to the growing interest in developing models for human-agent interaction. Prior work on the topic focused on simple environments, where training using policy gradient was feasible despite the non-stationarity of the agents during training. In this paper, we present a more challenging environment for testing the emergence of communication from raw pixels, where training using policy gradient fails. We propose a new model and training algorithm, that utilizes the structure of a learned representation space to produce more consistent speakers at the initial phases of training, which stabilizes learning. We empirically show that our algorithm substantially improves performance compared to policy gradient. We also propose a new alignment-based metric for measuring context-independence in emerged communication and find our method increases context-independence compared to policy gradient and other competitive baselines.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Natural language is learned not by passively processing text, but through active and interactive communication that is grounded in the real world [Winograd1972, Bruner1985, Harnad1990]. Since grounding is fundamental to human-agent communication, substantial effort has been put into developing grounded language understanding systems, in which an agent is trained to complete a task in an interactive environment given some linguistic input [Siskind1994, Roy and Pentland2002, Chen and Mooney2008, Wang, Liang, and Manning2016, Gauthier and Mordatch2016, Hermann et al.2017, Misra, Langford, and Artzi2017].

Recently there has been growing interest in developing models for grounded multi-agent communication, where communication arises between agents solely based on the necessity to cooperate or compete in order to complete an end task [Havrylov and Titov2017, Mordatch and Abbeel2017, Lazaridou et al.2018, Cao et al.2018, Lewis et al.2017]. Such computational accounts shed light on the properties of communication that emerge between agents, as a function of various constraints on the agents and environment, and allow us to examine central properties such as compositionality. This is an important step on the road towards understanding how to construct agents that develop robust and generalizable communication protocols, which is essential for human-agent communication.

However, most prior work in this field has focused on simple referential games, where a speaker agent communicates a message and a listener agent chooses an answer from a small number of options. This setup suffers from several simplistic assumptions. First, the listener performs a single action and observes immediate feedback, while in the real world agents must perform long sequences of actions and observe delayed reward. Second, the agents coordinate on a single task, while in the real world agents must perform multiple tasks that partially overlap. This results in a relatively simple optimization problem, and thus most prior work employed standard policy-gradient methods, such as REINFORCE [Williams1992] and actor-critic [Mnih et al.2016] to solve the task and learn to communicate.

In this work, we propose a more challenging interactive environment for testing communication emergence, where a speaker and a listener interact in a 2D world and learn to communicate from raw pixel data only. The environment accommodates multiple tasks, such as collecting, using, and moving objects, where the speaker emits a multi-symbol utterance, and the listener needs to perform a long sequence of actions in order to complete the task. For example, in Figure 1a the task is to navigate in the environment, collect exactly two yellow blocks (and no blue blocks) and bring them to a drop-zone.

Training the agents using policy gradient fails in this interactive environment. This is due to the non-stationarity of both agents that constantly update their model, the stochasticity of their actions at the initial phase of learning, combined with the long sequence of actions required to solve the task. In this work, we propose a more stable training algorithm, inspired by the obverter technique [Batali1998, Choi, Lazaridou, and de Freitas2018], where we impose structure on the learned representation space of utterances and worlds, such that the speaker produces more consistent utterances in similar contexts. This aids the listener in learning to map utterances to action sequences that will solve the task. We show that the agent’s ability to communicate and solve the task substantially improves compared to policy gradient methods, especially as the tasks become increasingly complex.

Once the agents solve a task, we can analyze the properties of the communication protocol that has arisen. We focus on the property of context-independence

, namely, whether symbols retain their semantics in various contexts, which implies compositionality. To this end, we develop a new alignment-based evaluation metric that aligns concepts with symbols, and find that our algorithm also produces a communication protocol that is more context-independent compared to policy gradient and other competitive baselines.

Figure 1: Initial worlds as observed by the speaker for the three mission types. The goal inventory is at the bottom and the drop-zone is the top row. In Bring, blocks are marked by black, and in Paint, blocks are marked by pink. The listener is marked by a green square.

To summarize, the contribution of this work is three-fold:

  1. [topsep=0pt,itemsep=0ex,parsep=0ex]

  2. A rich and extensible environment for investigating communication emergence.

  3. A new method for training communicating agents that substantially outperforms policy gradient.

  4. A novel metric for evaluating context-independence of a communication protocol.

Our entire code-base can be downloaded from https://github.com/benbogin/emergence-communication-cco/.

Related work

Prior work on emergent communication focused on referential games or a variant of it. In such games, a speaker describes an image or object, and a listener chooses between what the speaker is referring to and one or more distractors [Lazaridou, Peysakhovich, and Baroni2016, Lazaridou, Pham, and Baroni2016, Havrylov and Titov2017, Evtimova et al.2018, Lazaridou et al.2018]

. In our setup, the agent has to perform a long sequence of actions in a fully interactive environment. Because the space of action sequences is huge and positive feedback is only sparsely received, optimization is a major issue since the probability of randomly performing a correct action sequence is vanishing. Some work on referential games also performed a differentiable relaxation to facilitate end-to-end learning

[Choi, Lazaridou, and de Freitas2018, Mordatch and Abbeel2017, Havrylov and Titov2017]

, while we in this work use reinforcement learning.

Mordatch2017EmergenceOG ̵̃Mordatch2017EmergenceOG have shown emergence of communication in an interactive 2D world. However, they assume a model-based setup, where the world dynamics are known and differentiable, which limits the applicability of their method. Moreover, the input to the agents is a structured representation that describes the world and goal rather than raw pixels. Last, they also used a supervised signal as an auxiliary task, while we train from the environment’s reward only.

Evaluating the properties of emergent communication is a difficult problem. Prior work has focused on qualitative analyses [Havrylov and Titov2017, Evtimova et al.2018, Mordatch and Abbeel2017] as well as generalization tests [Kottur et al.2017, Choi, Lazaridou, and de Freitas2018]. Lazaridou2016TowardsMC ̵̃Lazaridou2016TowardsMC have evaluated communnication by aligning concepts and symbols, in a single-symbol utterance. In this work we propose an evaluation metric that is based on multi-symbol alignment algorithms, such as IBM model 1 [Brown et al.1993].

A Multi-task Grid World

In this section we present the environment and tasks designed for the speaker and listener. The environment design was guided by several principles. First, completing the tasks should require a sequence of actions with delayed reward. Second, agents should obtain raw visual input rather than a structured representation. Third, multiple related tasks should be executed in the environment, to examine information transfer between tasks.

Our environment is designed for a speaker , and a listener . The listener observes an initial 2D 5x5 grid world , where objects of different colors are randomly placed in various cells of the grid, and an inventory of objects collected so far. The speaker observes , but also an image of a goal inventory of objects , which specifies the final goal: the type of mission that needs to be performed as well as the number and color of objects that the listener must interact with (Figure 1). Given the input pair the speaker produces an utterance , which includes a sequence of discrete symbols from a vocabulary , which can be viewed as the speaker’s actions. The listener generates a sequence of actions . At each step , it observes the current world state and utterance and selects an action from a set of actions, which causes the environment to deterministically modify the world state, i.e., . The game ends when the listener either accomplishes or fails the task, or when a maximum number of steps has been reached.

Our environment currently supports three missions, which can be viewed as corresponding to different “verbs” in natural language:

  1. [topsep=0pt,itemsep=0ex,parsep=0ex]

  2. Collect (Figure 1b): The listener needs to collect a set of objects. When the listener arrives at a position of an object, it automatically collects it. Thus, it is disqualified if it passes in positions with objects that should not be collected. After collecting all objects the listener should declare Stop.

  3. Paint (Figure 1c): The agent needs to collect a set of objects, paint them, and then declare Stop. The action Paint performs a painting action on the object that was collected last.

  4. Bring (Figure 1a): The agent needs to collect a set of objects and then bring them to a “drop-zone”. The listener is disqualified if it passes in the drop-zone before collecting all necessary objects.

A task is defined by a mission, a number of objects and a color. Note that the 3 missions are overlapping in that they all require understanding of the notion Collect. The set of actions for the listener includes six actions: move to one of the cardinal four directions (Up, Down, Left, Right), Paint, and Stop. The actions of the speaker are defined by the symbols in the vocabulary , in addition to an END token that terminates the utterance.

The environment provides a per-action return after performing an action in the environment . For the speaker, iff the listener successfully completes the task, and is zero otherwise and in all other timesteps. That is, the speaker gets a return only after emitting the entire utterance. For the listener, iff it successfully completed the task. When , to encourage short sequences, and for any where an object is collected, to encourage navigation towards objects.

While the environment is designed for two players it can also be played by a single player. In that setting, an agent observes both the goal and the current world, and acts based on the observation. This is useful for pre-training agents to act in the world even without communication.

The speaker outputs an utterance only in the first state (), and thus must convey the entire meaning of the task in one-shot, including the mission, color and number of objects. The listener must understand the utterance and perform a sequence of actions to complete the task. Thus, the probability of randomly completing a task successfully is low, in contrast to referential games where the actor has high probability of succeeding even when it does not necessarily understand the utterance.


We now describe our method for training the speaker and listener. We first present a typical policy gradient approach for training the agents (A2C, Mnih2016AsynchronousMF ̵̃Mnih2016AsynchronousMF) and outline its shortcomings. Then, we describe our method, which results in more stable and successful learning.

Policy gradient training

As evident from the last section, it is natural to view the dynamics of the speaker and listener as a Markov decision process (MDP), where we treat each agent as acting independently, while observing the other agent as part of the environment

. Under this assumption, both agents can learn a policy that maximizes expected reward. Thus, policy gradient methods such as REINFORCE [Williams1992] or A2C [Mnih et al.2016] can be applied.

For the speaker , the states of the MDP are defined by the input world-goal pair where the transition function is the identity function, because uttered tokens do not modify the world state, and the actions and returns are as described in the previous section. The goal of the speaker is to learn a policy that maximizes the expected reward:

where encapsulates the environment, including the policy of the listener, and .

For the listener , the states of the MDP at each timestep are defined by its input world-utterance pair , the actions are chosen from , and the deterministic transition function is (defined in last section). For a sequence of actions the total reward is , where the returns are as previously defined. Thus, the goal of the listener is to learn a policy that maximizes:

Optimizing the expected reward of the speaker and listener is performed using stochastic gradient descent. The gradient is approximated by sampling


from multiple games, and computing the gradient from these samples. Because this is a high-variance estimate of the gradient, it is common to reduce variance by subtracting a baseline from the reward that does not introduce bias

[Sutton and Barto1998]. In this work, we use “critic” value functions and as baselines for the speaker and listener respectively. These value functions predict the expected reward and are trained from the same samples as to minimize the L2 Loss between the observed and predicted reward.

A main reason why training with policy gradient fails in our interactive environment, is that the agents are stochastic, and at the initial phase of training the distribution over utterances and action sequences is high-entropy. Moreover, the agents operate in a non-stationary environment, that is, the reward of the speaker depends on the policy of the listener and vice versa. Consequently, given similar worlds observed by the speaker, different utterances will be observed by the listener with high probability at the beginning of training, which will make it hard for the listener to learn a mapping from symbols to actions. In referential games this problem is less severe: the length of sequences and size of action space is small, and reward is not sparse. Consequently, agents succeed to converge to a common communication protocol. In our environment, this is more challenging and optimization fails. We now present an alternative more stable training algorithm.

Context-Consistent Obverter

We now describe a new model and training algorithm for the speaker (the listener stays identical to the policy gradient method), which we term Context-consistent Obverter (CCO), to overcome the aforementioned shortcoming of policy gradient. Specifically, given a task, such as “bring 2 blue”, we would like the speaker to be more consistent, i.e., output the same utterance with high probability even at the initial phase of training. Nevertheless, the speaker still needs the flexibility to change the meaning of utterances based on the training signal.

Recently, Choi2018CompositionalOC ̵̃Choi2018CompositionalOC proposed a method for training communicating agents. Their work, inspired by the obverter technique [Batali1998] which has its roots in the theory of mind [Premack and Woodruff1978], is based on the assumption that a good way to communicate with a listener is to speak in a way that maximizes the speaker’s own understanding.

We observe that under this training paradigm, we can impose structure on the continuous latent representation space that will lead to a more consistent speaker. We will define a model that estimates whether an utterance is good according to the speaker, which will break the symmetry between utterances and lead to a lower-entropy distribution over utterances. This will let the listener observe similar utterances in similar contexts, and slowly learn to act correctly.

Formally, in this setup the speaker does not learn a mapping , but instead learns a scalar-valued scoring function that evaluates how good the speaker thinks an utterance is, given the initial world and the goal. Specifically, we define:

That is, we encode and

as vectors with functions

and , and score them by the negative euclidean distance between them.111While we use euclidean distance as , it is not guaranteed that a closer state is better with respect to the actual goal. We empirically found this works well, and leave options such as learning this function for future work. The best possible encoding of an utterance in this setup is to have the same encoding for and : when , then .

1:world-goal pair
3:for  do
4:       // scoring dictionary
5:      for  in  do       
6:      if        
8:      if  then break       
9:      append to return
Algorithm 1 CCO speaker decoder

We can now score utterances based on the geometric structure of the learned representation space using , and decode the highest-scoring utterance . A naïve decoding algorithm would be to exhaustively score all possible utterances, but this is inefficient. Instead, we use a greedy procedure that decodes token-by-token (see Algorithm 1). At each step, given the decoded utterance prefix, we score all possible continuations (line 5) including the option of no continuation (line 6). Then, we take the one with the highest score (line 7), and stop either when no token is appended or when the maximal number of steps occurs.222One could replace

with sampling by forming a probability distribution from the scored utterances with a

, however empirically we found that this does not improve results.

Figure 2: An illustration of the learned state space during training. Blue circles are hidden states of input goals, and red crosses are hidden states of input utterances. The purple dashed arrows indicate that () will become closer to if the advantage , and more distant if .

As we explain below, the world-goal representation can be pre-trained and fixed. Thus, the speaker in CCO is trained to shift utterance representations closer to when a positive reward is observed, and farther away when a negative reward is observed. Specifically, given a decoded utterance , we calculate the speaker’s advantage value . The objective is then to maximize the score:

where is the set of all prefixes of , and:

Our objective pushes utterances that lead to high reward closer to the world-goal representation they were uttered in, and utterances that lead to low reward farther away. Since the speaker decodes utterances token-by-token from left to right, we consider all prefixes in the objective, pushing the model to create good representations even after decoding few tokens. Last, the coefficient puts higher weight on positive reward samples.

Figure 2 illustrates our model and training algorithm. Given the geometric structure of the latent representation space, an utterance with representation that is close to a certain hidden state will be consistently chosen by the speaker. If this leads to high reward, the model will update parameters such that becomes closer to and more likely to be chosen for or similar states. Analogously, if this leads to low reward will become more distant from and eventually will not be chosen. The geometric structure breaks the symmetry between utterances at the beginning of training and stabilizes learning.

Comparison to Choi2018CompositionalOC ̵̃Choi2018CompositionalOC

Our work is inspired by Choi2018CompositionalOC ̵̃Choi2018CompositionalOC in that the speaker does not directly generate the utterance but instead receives it as input and has a decoding procedure for outputting . However, we differ from their setup in important aspects: First, we focus on exploiting the structure of the latent space to break the symmetry between utterances and get a consistent speaker, as mentioned above. Second, their model only works in a setting where the agent receives immediate reward (a referential game with a binary outcome), and can thus be trained with maximum likelihood. Our model, conversely, works in an environment with delayed reward. Third, their model only updates the parameters of the listener, and the signal from the environment is not propagated to the speaker. We train the speaker using feedback from the listener. Fourth, their formulation requires that the speaker and listener have an identical architecture, while this work does not impose this constraint. Last, we will show empirically in the next section that our algorithm improves performance and interpretability.

Neural network details

Figure 3: A high-level overview of the network architectures of the listener (top) and the speaker (bottom).

Figure 3 provides a high-level overview of the listener and speaker network architectures.

The listener receives as input the world-utterance pair . The world is given as raw RBG pixel values, andfrom goes through a convolutional layer. Each token is first embedded, and then summarized with a BOW representation or a GRU [Cho et al.2014] with the last hidden state taken as output. The output vector goes through a single linear layer to obtain a representation for . The next layer adds multiplicative interaction between the world and utterance [Oh et al.2015]

, which then goes through another convolutional layer, followed by a single linear layer, to get a final hidden representation

of the inputs. Finally, the value and action

are obtained using a linear layer that outputs a scalar and a softmax layer of size

, respectively.

The speaker receives as input the world-goal pair in raw RGB format, which both go through the same convolutional layer. The two inputs are then multiplied to obtain a hidden representation for the input task. When training with policy gradient (marked as dashed red box), the utterance is generated with a GRU, that receives as its initial hidden state. When training with CCO, is given as a third input to the network (marked as a dotted purple box), and a representation is computed as the multiplicative interaction between and , before calculating the distance with .


In this paper, we are interested in the emergence of communication and thus allow both the listener and the speaker to pre-train and learn to act in the world and solve all tasks in a single-agent setting. The speaker and the listener are pre-trained separately with the policy-gradient speaker network that observes the world and goal, except that the output layer is taken from the listener, since we predict an action and value (Figure 3). After pre-training, the listener will learn to represent from scratch, since it only observed at pre-training time. Similarly, the speaker will only update the parameters that represent (dashed purple box in Figure 3).

Experimental Evaluation

Training setup Ref-8C/5S 3C/1N/1M 8C/1N/1M 8C/3N/1M 3C/3N/2M 3C/3N/3M
V. size / max len. 15/20 15/20 15/20 15/20 15/20 15/20
PG 0.87 0.94 0.79 0.2 0.24 0.05
PGLowEnt 0.76 0.92 0.89 0.42 0.38 0.31
PGFixed 0.94 0.82 0.8 0.29 0.28 0.07
NoTrainCCO 0.89 0.9 0.82 0.31 0.15 0.11
FixedRand 1.0 1.0 1.0 0.96 0.51 0.31
Obverter- GRU 0.99 - - - - -
Obverter- BOW 0.98 - - - - -
CCO- GRU 1.0 1.0 1.0 0.98 0.97 0.91
CCO- BOW 0.99 1.0 1.0 0.98 0.91 0.93
Table 1: Test results of task completion performance for tasks with increasing number of colors, numbers and missions (in the referential game we also use shapes). We also provide the vocabulary size and maximum sentence length for each setup.

In this section we aim to answer two main questions: (a) Does training with CCO improves the ability of agents to communicate and solve tasks in our environment? (b) Does the learned communication protocol exhibit context-independence?

Implementation details:

We train the agents simultaneously in 10 different environments for up to a total of 10 million steps. Parameters are updated in a batch every 5 steps using RMSProp

[Tieleman and Hinton2012]. We use a hidden state size and GRU cell size of 50.

Evaluation of Task Completion

First, we evaluate the ability of the agents to solve tasks given different training algorithms. We evaluate agents by the proportion of tasks they solve on a test set of worlds that were not seen at training time. We evaluate the following baselines:

  • [topsep=0pt,itemsep=0ex,parsep=0ex]

  • CCO-GRU: Our main algorithm.

  • CCO-BOW: Identical to CCO-GRU except we replace the GRU that processes with a bag-of-words for both the speaker and the listener.

  • Obverter

    : A reimplementation of the obverter as described in Choi2018CompositionalOC ̵̃Choi2018CompositionalOC, where the speaker and listener are first pre-trained separately to classify the color and shape of given objects. The convolution layers parameters are then frozen, similar to CCO.

  • PG: A2C policy gradient baseline [Mnih et al.2016] (as previously described).

  • PGLowEnt: Identical to PG

    , except that logits are divided by a temperature (

    ) to create a low-entropy distribution. The goal is to create a more consistent agent with policy gradient and compare to CCO.

  • PGFixed: Identical to PG, except that for the speaker only the parameters of the RNN that generates are trained, similar to CCO.

  • NoTrainCCO: Identical to CCO but without training the speaker, which will result in a consistent random language being output by the decoder. The goal is to investigate the importance of utterance representation learning.

  • FixedRand: An oracle speaker that given the task (“paint 1 blue” rather than pixels) assigns a fixed, unambiguous, but random sequence of symbols. This results in a perfectly consistent communication protocol that does not emerge and is not compositional, and the listener has to learn its meaning.

We evaluate the baselines on a sequence of tasks with increasing complexity:

Referential: A referential game reimplemented exactly as in Choi2018CompositionalOC ̵̃Choi2018CompositionalOC. The speaker sees an image of a rendered 3D object, defined by one of 8 colors and 5 shapes and describes it to the listener. The listener sees a different image and has to determine if it has the same color and shape as the object seen by the speaker.

Figure 4: Success rate in the 8C/1N/1M task as a function of games played, for vocabulary size 15 and maximum sentence length 20, for different algorithms.

Interactive: Our environment with varying numbers of colors, numbers and missions. In Table 1, 3C/3N/3M corresponds to a world with three colors, three numbers and all three missions. If a single number is used, it is ; If a single mission is used, it is Collect; if two missions are used, they are Collect and Paint.

Table 1 shows the results of our experiments. As the difficulty of the tasks increases, the performance of PG substantially decreases. PGLowEnt, which is more consistent, and PGFixed which only trains the speaker’s utterance generation parameters, outperform PG, but have low performance for complex tasks.

We also observe that NoTrainCCO does not perform well, showing that random utterance representations are not sufficient, and good representations must be learned. FixedRand, an oracle which assigns a consistent and unambiguous sequence of symbols for every task, outperforms other baselines. This shows that the non-stationarity of communication is a core challenge for PG, and a random but consistent language is more learnable. However, when we introduce multiple missions and the task becomes more complex the listener is no longer able to decipher the random language, and performance drops.333We compared networks with similar capacity and training time. Naturally, given a larger model and more training time the listener is likely to memorize a random language.

The Obverter algorithm solves the referential task almost perfectly, but cannot be used in interactive settings. Finally, our algorithm, CCO, performs well on all tasks.

Training setup 3C/3N/1M 5C/3N/1M 8C/3N/1M
V. size / max length 8/10 10/10 13/10
random speaker 0.03 0.02 0.02
FixedRand 0.18 0.13 0.13
CCO- GRU 0.21 0.31 0.22
CCO- BOW 0.6 0.57 0.38
Training setup Ref-8C/5S 3C/1N/1M 8C/1N/1M
V. size / max length 15/10 5/10 10/10
PG 0.05 0.04 0.01
Obverter- GRU 0.21 - -
Obverter- BOW 0.26 - -
CCO- GRU 0.3 0.43 0.27
CCO- BOW 0.49 0.33 0.5
Training setup 3C/3N/1M 3C/3N/2M 3C/3N/3M
V. size / max length 11/10 11/10 11/10
CCO- GRU 0.27 0.37 0.21
CCO- BOW 0.32 0.25 0.28
Table 2: Context-independence evaluation on the test set.

Figure 4 provides a learning curve for the success rate of different algorithms. We observe that the two consistent methods CCO and FixedRand solve the task much faster than PG methods, with a slight advantage for CCO.

Evaluation of Language

We now turn to analyzing the properties of the emerged language. A hallmark property of natural language is compositionality, the fact that the meaning of the whole is a function of the meaning of its parts [Frege1892]. However, there is no agreed upon metric for evaluating compositionality in emerged communication. We therefore propose to measure context-independence: whether atomic symbols retain their meaning regardless of the context. If communication is perfectly context-independent, then the meaning of the whole is compositional. Natural language is of course not context-independent, but words often retain their semantics in many contexts (for example the word ‘pizza’).

We now propose a metric that measures to what extent there is a one-to-one alignment between task concepts and utterance symbols. We would like to have a measure that provides a high score iff each concept (e.g., Red, or Paint) is mapped to a single symbol (e.g., ”7”), and that symbol is not mapped to other concepts. We base our measure on probabilities of vocabulary symbols given concepts and concepts given vocabulary symbols , that will be estimated by IBM model 1 [Brown et al.1993]. We use an IBM model, since it estimates alignment probabilities assuming a hard alignment between concepts and symbols, and specifically IBM model 1, because the order of concepts is not meaningful.

To run IBM model 1, we generate 1,000 episodes for each evaluated model, from which we produce pairs of utterances and task concepts. We then run IBM model 1 in both directions, which provides the probabilities and .

We now define a context-independence score:

where is the set of all possible task concepts. The score is the average alignment score for each concept, where the score for each concept is a product of probabilities and for the symbol that maximizes . This captures the intuition that each concept should be aligned to one vocabulary symbol, and that symbol should not be aligned to any other concept. The measure is one-sided because usually and some vocabulary symbols do not align to any concept. Also note that will be affected by and and thus a fair comparison is between setups where the size of vocabulary and number of concepts is identical. We note that a perfectly context-independent language would yield a score of 1.

Blue Red Yellow
1 6,2 2 3,2
2 6,1 1,5 3,1
3 6,7 7,5 3,7

Table 3: An example mapping of tasks to utterances showing the most commonly used utterances for the CCO - BOW model, in a setting of 3 colors, 3 numbers and 1 mission with a vocabulary size of 8 and a maximum sentence length of 10. The above language results in a score of 0.6.

Table 2 shows the results of our context-independence evaluation. Table 2, top, compares CCO to the following variants on the mission Collect. A random speaker that samples an utterance randomly sets a lower-bound on for reference, and indeed values are close to 0. FixedRand is a consistent speaker that is unambiguous, but is not compositional. The unambiguity increases slightly, but still the score is low. The score of CCO is higher, but when we replace the GRU that processes with a bag-of-words, improves substantially.

Table 2, middle, compares CCO to Obverter and PG. In the referential game, we see that language emerged with CCO is substantially more context independent than Obverter and PG, especially when using the BOW variant. We see that PG exhibits very low context-independence in both the referential game and when tested on Collect with varying colors. We notice that in contrast to CCO, PG encodes the message in much longer messages, which might explain the low score.

Table 2, bottom, investigates a multi-task setup, to examine whether training on multiple tasks improves context-independence. To this end, we run the agents with 3 colors and 3 numbers, while increasing the number of missions. For a fair comparison, we calculated the score using only the colors and numbers concepts. Despite our prior expectation, we did not observe any improvement in context-independence as the number of missions increases.

Table 3 shows an example mapping from tasks to utterances, denoting the most frequently used utterance for a set of Collect tasks. The table illustrates what a relatively context-independent language looks like (). We see how many alignments are perfect (symbol ‘6’ for the concept BLUE, symbol ‘1’ for the concept 2), while some symbols are aligned well to a concept in one direction, but not in the other (the symbol ‘5’ always refers to the concept Red, but this concept isn’t always described by ‘5’).


This paper presents a new environment for testing the emergence of communication in an interactive world. We find that policy gradient methods fail to train, and propose the CCO model, which utilizes the structure of the learned representation space to develop a more consistent speaker, which stabilizes learning. We also propose a novel alignment-based evaluation metric for measuring the degree of context-independence in the emerged communication. We find that CCO substantially improves performance compared to policy gradient, and also produces more context-independent communication compared to competitive baselines.