Emergence of Compositional Language with Deep Generational Transmission

04/19/2019 ∙ by Michael Cogswell, et al. ∙ 12

Consider a collaborative task that requires communication. Two agents are placed in an environment and must create a language from scratch in order to coordinate. Recent work has been interested in what kinds of languages emerge when deep reinforcement learning agents are put in such a situation, and in particular in the factors that cause language to be compositional-i.e. meaning is expressed by combining words which themselves have meaning. Evolutionary linguists have also studied the emergence of compositional language for decades, and they find that in addition to structural priors like those already studied in deep learning, the dynamics of transmitting language from generation to generation contribute significantly to the emergence of compositionality. In this paper, we introduce these cultural evolutionary dynamics into language emergence by periodically replacing agents in a population to create a knowledge gap, implicitly inducing cultural transmission of language. We show that this implicit cultural transmission encourages the resulting languages to exhibit better compositional generalization and suggest how elements of cultural dynamics can be further integrated into populations of deep agents.



There are no comments yet.


page 12

page 13

page 14

Code Repositories


Code for the paper "Emergence of Compositional Language with Deep Generational Transmission"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cultural transmission of language occurs when one group of agents passes their language to a new group of agents, parents who teach their children to speak as they do. Of the many design features which make human language unique, cultural transmission is important partially because it allows language itself to change over time via cultural evolution (Tomasello, 1999; Christiansen & Kirby, 2003a). This helps explain how a modern language like English emerged from some proto-language, an “almost language” precursor lacking the functionality of modern languages.

Figure 1: We introduce cultural transmission into language emergence between neural agents. The starting point of our study is the goal oriented dialogue task of Kottur et al. (2017), summarized in fig:eg_dialog. During learning we periodically replace some agents with new ones (gray agents). These new agents do not know any language, but instead of creating one they learn it from older agents. This creates generations of language that become more compositional over time.

Compositionality is an important structure of a language, interesting to both linguists and machine learning researchers, which evolutionary linguists explain via cultural transmission 

(Kirby et al., 2014). In this work, a compositional language is one that expresses concepts by combining simpler elements which each have their own meaning. This kind of structure helps give human language the ability to express infinitely many concepts using finitely many elements, and to generalize in obviously correct ways despite a dearth of training examples (Lake & Baroni, 2018). For example, an agent who understands blue square and purple triangle should also understand purple square without directly experiencing it; we use this sort of generalization to measure compositionality. Existing work has investigated conditions under which compositional languages emerge between neural agents in simple environments (Mordatch & Abbeel, 2018; Kottur et al., 2017), but it only investigates how language changes within a generation.

Simulating cultural transmission, the iterated learning model (Kirby et al., 2008; Kirby, 2001; Kirby et al., 2014) has found that generational dynamics cause compositional language to emerge using experiments in simulation (Kirby, 2001) and with human subjects (Kirby et al., 2008). In this model, language is directly but incompletely transmitted (taught) to one generation of agents from the previous generation. Because learning is incomplete and biased, the student language may differ from the teacher language. With the right learning and transmission mechanisms, a non-compositional language becomes compositional after many generations. This is cast as a trade-off between expressivity and compressibility, where a language must be expressive enough to differentiate between all possible meanings (, objects) and compressible enough to be learned (Kirby et al., 2015). The explanation is so prominent that it was somewhat surprising when it was recently found that other factors can cause enough compressibility pressure to get compositional language emergence without generational transmission (Raviv et al., 2018).

In AI, emergence work aims to influence efforts to build intelligent agents that communicate with each other and especially with humans using language. On the fully supervised end of the spectrum, agents have been successfully trained to mimic human utterances for applications like machine translation (Bahdanau et al., 2014), image captioning (Xu et al., 2015), visual question answering (Antol et al., 2015), and visual dialog (Das et al., 2017a). In addition to the large amounts of data required, these systems still do not generalize nearly as well as we would like, and are rather hard to evaluate because natural language is open-ended and has strong priors that obscure true understanding (Liu et al., 2016; Agrawal et al., 2018). Other approaches use less supervision, placing multiple agents in carefully designed environments and giving them goals which require communication (Mikolov et al., 2016; Gauthier & Mordatch, 2016; Kiela et al., 2016). If some of the agents in the environment already know a language like English then the other agents can indirectly learn that language. But even this is expensive because it requires some agent that already knows a language.

On the other end of the spectrum, and most relevant to us, some work has found that languages will emerge to enable communication-centric tasks to be solved without direct or even indirect language supervision (Foerster et al., 2016; Sukhbaatar et al., 2016; Lazaridou et al., 2017; Das et al., 2017b). Unlike regimes where agents are trained to learn an existing language, languages that emerge in this sort of setting are not necessarily easy to understand for a human. Even attempts to translate these emerged languages for other agents are not completely successful (Andreas et al., 2017), possibly because the target languages can’t express the same concepts as the source languages. This desire for structure motivates the previously mentioned work on compositional language emergence in neural agents (Kottur et al., 2017; Mordatch & Abbeel, 2018; Choi et al., 2018).

In this paper, we study the following question – what are the conditions that lead to the emergence of a compositional language? Our key finding is evidence that cultural transmission leads to more compositional language in deep reinforcement learning agents, as it does in evolutionary linguistics. The starting point for our investigation is the recent work of Kottur et al. (2017), which investigates compositionality using a cooperative reference game between two agents. Instead of using the same set of agents throughout training, we replace (re-initialize) some subset of them periodically. The resulting knowledge gap makes it easier for the new agent to learn from the older agents than to create a new language. In this way our approach introduces cultural transmission and thus a compressibility pressure, causing compositionality to emerge.

One difference between our approach and evolutionary methods applied elsewhere in deep learning (Stanley & Miikkulainen, 2002; Stanley et al., 2009; Real et al., 2017) is that we emulate cultural evolution instead of biological evolution. In biological evolution agents change from generation to generation while in cultural evolution the language itself evolves, so the same agent can have different languages at different times. Agents can directly benefit from evolutionary innovations throughout their lifetime instead of only at the beginning.

Our approach is also different from iterated learning because our version of cultural transmission is implicit instead of explicit. Instead of teachers telling students exactly how to refer to the world, language is shared only to the extent doing so helps accomplish the goal.

Through our experiments and analysis, we show that periodic agent replacement is an effective way to induce cultural transmission and more compositionally generalizable language. To summarize, our contributions are:

  1. We propose a method for inducing implicit cultural transmission to neural language models.

  2. We introduce new metrics to measure the similarity between agent languages and verify cultural transmission has occurred.

  3. We show our cultural transmission procedure induces compositionality in neural language models, going from 16% accuracy on a compositionally novel test set to 59% in the best configuration. Furthermore, we show that this is complementary with previously used priors which encourage compositionality.

2 Approach

We start by introducing goal-driven neural dialog models similar to Kottur et al. (2017), then describe how we incorporate cultural transmission.

2.1 Goal-Driven Neural Dialog

In this work, we consider a setting where two agents, (questioner) and (answerer), must communicate by exchanging single tokens over multiple rounds to solve a mutual task. Specifically, must report some attributes of an object seen only by . To accomplish this task, must query for the information.

Using the example in fig:eg_dialog, an ideal dialog goes like this. At the beginning, is given some task to solve (color, shape) and asks a question indicating the task (“X”, requesting the color). observes the object and answers the question (“1”, meaning purple). This dialog continues for a couple more rounds; says “Y” and responds “2”, so also knows the shape is square. Finally, reports its prediction (purple, square) and both agents receive a reward if the prediction is correct.

Each conversation has rounds (we use ). starts by observing ’s previous response , its view of the world (, task), and its previous memory , then outputs memory summarizing its current view of the dialog and utters message . Functionally, . Similarly, responds by computing . After the conversation, tries to solve the given task by predicting (, corresponding to red square) as a function of its observation and final memory: . Both agents are rewarded if both attributes are correct. As in Kottur et al. (2017), we implement , and

as neural networks.

Figure 2: An example dialog based on Kottur et al. (2017) and described in sec:approach_dialog.

Our model is trained to maximize the reward using policy gradients (Williams, 1992). Unlike an approach supervised by human dialogues, nothing orients the agents toward specific meanings for specific words, so they must create their own appropriately grounded language to solve the task.111This lack of alignment also means and messages aren’t necessarily interpretable as questions and answers, respectively. This approach–summarized in the black lines (4-9) of alg:train_aug–is our starting point. In Kottur et al. (2017) it was used to generate a somewhat compositional language given appropriate agent and vocabulary configurations.

2.2 Language Emergence with Cultural Transmission

Here we add cultural transmission to neural dialog models by considering an implicit model of cultural transmission. Implicit cultural transmission does not use word-level supervision, as opposed to explicit cultural transmission in which students are told which words refer to which objects. In implicit cultural transmission shared language emerges from shared goals. We develop an implicit model of cultural transmission222 One of the first things we tried was to implement an explicit teaching phase where some agents would generate language used as training data for other agents during an auxiliary training phase. This never turned out to be helpful, often hurting compositional generalization. We also tried simulating the experiments from (Kirby et al., 2008) with neural agents. Despite two major architecture variations our initial results were negative. In future work we plan to study why this happens and improve the result. that periodically replaces agents. Consequentially, older agents remember the old language while new agents learn it, favoring more easily compressible compositional language.

for epoch  do

       Sample from
Sample from
for  in each batch do
             for conversation rounds  do

Policy gradient update both and parameters
       if  then
             Sample replacement set using
Re-initialize parameters and optimizer for all agents in
return all s and s.
Algorithm 1 Training with Replacement and Multiple Agents


One open choice in this approach is how to select which agents to re-initialize – we explore different options in this section. Every epochs, replacement policy is called and returns a list of agents to be re-initialized, as seen at the blue lines (10-12) of alg:train_aug333Note that our world is rather small so there is only one batch per epoch.. This process creates generations of agents such that each generation learns languages that are slightly different but eventually improve upon those of previous generations.

Single Agent.

In Kottur et al. (2017) there is only one pair of agents () so we cannot replace both agents at the same round because all existing language would be lost. Instead, we consider two strategies that only replace one bot at a time:

  1. Random. Sample or uniformly

  2. Alternate. Alternate between and

Multi Agent.

and have different roles due to the asymmetry in information, so they use different parts of the language. Replacing (alt. ) means (alt. ) has to remember what says or else knowledge about that part of the language is lost. If was replaced but other s speaking the same language were present then would have incentive to remember the original language because of the other bots, preventing language loss. Furthermore, a Multi Agent environment may add compressibility pressure (Raviv et al., 2018) as bots have more to remember if there is any difference between the languages of their conversational partners. Finally, having multiple agents per type could introduce more variations in language, providing an opportunity to favor even better languages. Thus we introduce multiple s and s.

More concretely, we consider a population of s and a population of s . Each member of the populations has a different set of parameters, but any - pair might be sampled to interact with each other in the same batch. During learning we sample random pairs to interact and receive gradient updates. These changes correspond to the red lines (2-3) of alg:train_aug. In this Multi Agent scenario we replace one and one every generation. We investigate three ways to sample agents to replace:

  1. Uniform Random.

    Sample one and one at random, placing equal probably on each option.

  2. Epsilon Greedy. Replace agents as in Uniform Random with probability (we use ). Replace the agent with lowest validation accuracy with probability .

  3. Oldest. Consider only s and s which have been around for the most epochs and sample (uniformly) one of each to replace.

To see how this could cause cultural transmission, consider an that was just replaced with a new bot . Most of the s already know how to translate one set of symbols (the ones from ) into correct predictions. If is going to help these s solve their task then it will be more efficient for it to use the language already known to the receivers, that of . Thus will learn the existing language because it is more efficient than alternatives. This pressure for cultural transmission comes from the imbalance in knowledge between young and old agents created by re-initializing old bots. sec:lang_dist supports this argument by showing that languages evolved via our approach are more similar to each other than otherwise.

3 Experiments

In this section we investigate how our language evolution mechanism affects the structure of emergent languages. We show that our replacement approach causes compositionality and that cultural transmission does occur despite the implicit nature of our implementation.

3.1 Neural Question Answering Details

Task Description.

As in Kottur et al. (2017)

, our world contains objects with 3 attributes (shape, size, color) such that each attribute has 4 possible values. Objects are represented ‘symbolically’ as 3-hot vectors and not rendered as RGB images.

Evaluation with Compositional Dataset.

The explicit annotation of independent properties like shape and color allows compositionality to be tested, a necessarily domain specific evaluation. Certain combinations of attributes (, purple square) are held out of the training set while ensuring that at least one instance of each attribute is present (, at least one purple thing and one square thing). If the language created by interaction between agents can identify the held out instances (, it has unique words for purple square which both agents understand) then it is compositional. This is simply measured by accuracy on the test set. Previous work also measures generalization to held out compositions of attributes to measure compositionality (Kottur et al., 2017; Kirby et al., 2015).

Unlike Kottur et al. (2017), we use a slightly harder version of their dataset which aligns better with the goal of compositional language. For a few selected pairs of attributes 444In our case, red triangles, filled stars, and blue dotted objects. our version ensures those combinations are never seen outside the test set. This disallows opportunities for non-compositional generalization. Without this constraint, agents could generalize perfectly using words for attribute pairs like “red triangle” and “filled star” instead of words for “red,” “triangle,” “filled,” and “star.” The drop in accuracy555Throughout the paper accuracy in this setting refers to the “Both” and not the “One” setting from (Kottur et al., 2017). That means a is correct only if both and . between test and validation (which does not hold out attribute pairs) is roughly 20 points.

Architecture and Training.

Our s and s have the same architecture and hyperparameter variations as in

Kottur et al. (2017), but with our cultural transmission training procedure and some other differences identified below. Like Kottur et al. (2017), our hyperparameter variations consider the number of vocab words () and () may utter and whether or not has memory between dialog rounds. The memoryless version of simply sets between each round of dialog. This means cannot represent which attributes it has already communicated. When there are too many vocab words available there is less pressure to develop a compositional language because for every new object there is always an unused sequence of words which isn’t too similar to existing words, an effect also noticed elsewhere (Mordatch & Abbeel, 2018; Nowak et al., 2000). We add one setting where has no memory yet the number of vocab words is still overcomplete to help understand and disentangle these two factors. Specifically, we consider the following settings:

  1. Small Vocab666This is slightly different from Small Vocab in (Kottur et al., 2017).: ,

  2. Memoryless + Small Vocab: , ,

  3. Overcomplete: ,

  4. Memoryless + Overcomplete: ,

All agents are trained for epochs with a batch size of 1000 (so 1 batch per epoch) and the Adam (Kingma & Ba, 2015) optimizer with learning rate 0.01. In the Multi Agent setting we use . To decide when to stop we measure validation set accuracy averaged over all - pairs and choose the first population whose validation accuracy did not improve for 200k epochs.777There are few objects in the environment, so each batch contains all objects and is an entire epoch. This differs from Kottur et al. (2017), which stopped once train accuracy reached 100%. Furthermore, we do not mine negatives for each training batch.


Two baselines help verify our approach:

  1. No Replacement. Never replace any agent (, alg:train_aug without blue lines).

  2. Replace All. Always replace every agent (, with all agents at line 11 of alg:train_aug).

Comparing to the No Replacement baseline establishes the main result by measuring the difference replacement makes. However, each time lines 11 and 12 of alg:train_aug are executed there is one more chance of getting a lucky random initialization. Since the No Replacement baseline never does this it has a smaller chance of running in to one such lucky agent. Thus we compare to the Replace All baseline, which has the greatest chance of seeing a lucky initialization and thereby ensures that gains over the No Replacement baseline are not simply due to luck. In the Multi Agent setting we increased from 5000 to 25000 because agents were slower to converge.

Figure 3:

Test set accuracies (with standard deviations) are reported against our new harder dataset using models similar to those in

(Kottur et al., 2017). Our variations on cultural transmission outperform the baselines (lighter two green and lighter two blue bars) where language does not change over generations.

3.2 Impact of Cultural Transmission

Agent performance has a lot of variance, so we split the train data into 4 separate folds and perform cross-validation, averaging across folds as well as 4 different random seeds within each fold for a total of 16 runs per experiment. Results with standard deviations are reported in fig:emnlp_results and p-values for all t-tests are reported in the supplement.

Cultural transmission induces compositionality.

First we consider variations in replacement strategy given each model type. Our main result is that cultural transmission approaches always outperform baseline approaches without cultural transmission. This can be seen by noting that the darker bars (cultural transmission) in fig:emnlp_results are larger than the lighter bars (baselines). After running a dependent paired t-test against all pairs of baselines and cultural transmission approaches we find a significant difference in all almost all cases (; see supplement for all p-values). This is strong support for our claim that our version of cultural transmission encourages compositional language because it causes better generalization to novel compositions of attributes.

Population dynamics without replacement do not consistently lead to compositionality.

The Multi Agent No Replacement policy is only better than the Single Agent No Replacement policy for the Memoryless models (), and not otherwise. It is somewhat surprising that this difference is not stronger since having multiple agents in the environment is one of the factors found to lead to compositionality without using cultural (more specifically, generational) transmission in (Raviv et al., 2018).

Variations in replacement strategy do not appear to significantly affect outcomes.

The Single Agent Random/Alternate replacement strategies are usually not significantly different than each other (). The same is true for the Multi Agent Uniform Random/Epsilon Greedy/Oldest strategies. Significant differences only occur in the Overcomplete setting, where Single Agent Alternate outperforms Multi Agent Uniform Random, and in a few Small Vocab settings. This suggests that while some agent replacement needs to occur, it does not much matter whether agents with worse language are replaced or whether there is a pool of similarly typed agents to remember knowledge lost from older generations. The main factor is that new agents learn in the presence of others who already know a language.

Cultural transmission is complementary with other factors that encourage compositionality.

The models considered in Kottur et al. (2017) were ordered, from best to worse, as: Memoryless + Small Vocab Small Vocab Overcomplete. Our trends tend to agree with that conclusion though the differences are smaller– mainly comparing the Memoryless + Small Vocab model to others in cultural transmission settings. Only in the Oldest setting are the differences all significant enough to completely establish the above ordering. This agrees with factors noted elsewhere (Kottur et al., 2017; Mordatch & Abbeel, 2018; Nowak et al., 2000).

Removing memory sometimes hurts.

Removing memory always makes a significant difference () to Small Vocab models and only sometimes makes a difference for Overcomplete models. When it is significant, it helps Small Vocab models and hurts Overcomplete models. As the Memoryless + Overcomplete setting has not been reported before, these results suggest that the relationship between inter-round memory and compositionality is not clear.

Overall, these results show that adding cultural transmission to neural dialog agents improves the compositional generalization of the languages learned by those agents in a way complementary to other priors. It thereby shows how to transfer the cultural transmission principle from evolutionary linguistics to deep learning.

4 Cultural Transmission Analysis

Figure 4: How far apart are languages spoken by pairs of s in a population? On the y axis (eq. (2)) lower values indicate similar language. Populations evolved with our method speak similar languages, but independently evolved agents do not. Thus our implicit procedure induces cultural transmission.
Figure 5: Learning curves measuring . Lower values indicate more similar languages. Our populations converge to similar languages while baseline do not.

Unlike iterated learning (Kirby et al., 2014), cultural transmission is implicit in our model. It is indirect, so we would like to measure whether or not it is actually occurring – that is, whether languages are actually being transferred. doesn’t know anything about the domain of objects, so instead of it directly teaching how to refer to (explicit cultural transmission), can only learn how refers to an object from what it says about that object, . Either identified the purple filled square in a way that made correctly identify it as a purple square or should refer to the purple square differently. By encouraging to say things that allow to answer correctly it aligns the meanings the two bots understand. This is how our bots transmit linguistic knowledge.

Because it is implicit, cultural transmission may not actually be occurring; improvements may be from other sources. How can we measure cultural transmission? We take a simple approach. We assume that if two bots ‘speak the same language’ then that language was culturally transmitted. There is a combinatorial explosion of possible languages that could refer to all the objects of interest, so if the words that refer to the same object for two agents are the same then they were very likely transmitted from the other agents, rather than suspiciously similar languages emerging from scratch by chance. This leads to a simple approach: consider pairs of bots and see if they say similar things in the same context. If they do, then their shared language was probably transmitted.

More formally, consider the distribution of tokens 88footnotemark: 8 might use to describe its object when talking to : or for short. We want to know how similar ’s language is to that of another . We’ll start by comparing those two distributions by computing the KL divergence between them and then taking an average over context (s, questions, and objects) to get our pairwise agent language similarity metric :


Taking another average, this time over all pairs of bots, gives our final measure of language similarity reported in fig:emnlp_lang_dists.


This number should be smaller the more similar language is between bots. Note that even though is not symmetric (because KL divergence is not), is symmetric because it averages over both directions of pairs.

We compute by sampling an empirical distribution over all messages and observations, taking 10 sample dialogues in each possible test state of the world using the final populations of agents as in fig:emnlp_results. Note that this metric applies to a group of agents, so we measure it for only the Multi Agent settings, including two new baselines colored red in fig:emnlp_lang_dists. The Single Agents Combined baseline trains 4 Single Agent No Replacement models independently then puts them together and computes for that group. These agents only speak similar languages by chance, so should be high. The Random Initialization

baseline evaluates language similarity using newly initialized models. All agents have about a uniform distribution over words at every utterance, so their languages are both very similar and useless. These baselines act like a sort of practical (not strict) upper and lower bound to

, respectively.

As reported in table fig:emnlp_lang_dists, languages are more similar for our models than for the No Replacement baseline. This provides evidence that cultural transmission is indeed occurring in our model.

Furthermore, in fig:lang_evolution we plot as it changes during language evolution. Similar to the post-training results in fig:emnlp_lang_dists, we see that agents in the No Replacement and Single Agent strategies don’t learn the same language, but languages that evolve via implicit cultural transmission do converge to some high degree of similarity. Even multiple agents in the same environment without replacement tend towards more similar languages than they would have created otherwise (No Replacement vs Single Agents Combined).

A striking features of fig:lang_evolution is the oscillations, which are only apparent for Multi Agent All. Each time an agent is killed it causes the average language similarity to decrease because the newly initialized agent doesn’t know how to communicate with its companions. These oscillations also occur in the other replacement models, but in those cases only one agent was replaced and the period is shorted ( instead of ).

5 Related work

There are two main veins of related work. In the first, evolutionary linguistics explains compositionality with cultural evolution. In the second, neural language models create languages to communicate and achieve their goals.

Language Evolution Causes Structure.

Researchers have spent decades studying how unique properties of human language like compositionality could have emerged. There is general agreement that people acquire language using a combination of innate cognitive capacity and learning from other language speakers (cultural transmission), with the degree of each being widely disputed (Perfors, 2002; Pinker & Bloom, 1990). Most agree the answer is something in between. Both innate cognitive capacity and specific modern human languages like English coevolved (Briscoe, 2000) via biological (Pinker & Bloom, 1990) and cultural (Tomasello, 1999; Smith, 2006) evolution, respectively. Thus an explanation of these evolutionary processes is essential to explaining human language.

In particular, explanations of how the cultural evolution of languages themselves could cause those languages to develop structure like compositionality are in abundance (Nowak & Krakauer, 1999; Nowak et al., 2000; Smith et al., 2003; Brighton, 2002; Vogt, 2005; Kirby et al., 2014; Spike et al., 2017). An important piece of the explanation of linguistic structure is the iterated learning model (Kirby et al., 2014; Kirby, 2001; Kirby et al., 2008) described in sec:intro. This model focuses on the cultural transmission and bottlenecks that restrict how languages can be learned, showing compositional language emerges in computational (Kirby, 2001, 2002; Christiansen & Kirby, 2003b; Smith et al., 2003) and human (Kirby et al., 2008; Cornish et al., 2009; Scott-Phillips & Kirby, 2010) experiments. Even though cultural transmission may aid the emergence of compositionality, recent results in evolutionary linguistics (Raviv et al., 2018) and deep learning (Kottur et al., 2017; Mordatch & Abbeel, 2018) show cultural transmission may not be necessary for compositionality to emerge.

While existing work in deep learning has focused on biases that encourage compositionality, it has not considered settings where language is permitted to evolve as it is passed down over generations of agents. We consider such a setting because of its potential to complement our existing understanding.

Language Emergence in Deep Learning.

Recent work in deep learning has increasingly focused on multi-agent environments where deep agents learn to accomplish goals (possibly cooperative or competitive) by interacting appropriately with the environment and each other. Some of this work has shown that deep agents will develop their own language where none exists initially if driven by a task which requires communication (Foerster et al., 2016; Sukhbaatar et al., 2016; Lazaridou et al., 2017). Most relevant is similar work which focuses on conditions under which compositional language emerges as deep agents learn to cooperate (Mordatch & Abbeel, 2018; Kottur et al., 2017). Both Mordatch & Abbeel (2018) and Kottur et al. (2017) find that limiting the vocabulary size so that there aren’t too many more words than there are objects to refer to encourages compositionality, which follows earlier results in evolutionary linguistics (Nowak et al., 2000). Follow up work has continued to investigate the emergence of compositional language among neural agents, mainly focusing on perceptual as opposed to symbolic input and how the structure of the input relates to the tendency for compositional language to emerge (Choi et al., 2018; Havrylov & Titov, 2017; Lazaridou et al., 2018). Other work investigating emergent translation has shown that Multi Agent interaction leads to better translation (Lee et al., 2018), but they do not measure compositionality.

Cultural Evolution and Neural Nets.

Some work has considered the evolution of ideas by cultural transmission using neural agents. Most recently, Bengio (2012) considers what benefit cultural evolution might bestow on neural agents. The main idea is that culturally transmitted ideas may provide an important mechanism for escaping local minima in large complex models. Experiments in Gülçehre & Bengio (2016) followed up on that idea and supported part of the hypothesis by showing that supervision of intermediate representations allows a more complex toy task to be learned. Unlike our work, these experiments use direct language supervision provided by the designed environment rather than indirect and implicit supervision provided by other agents.

6 Conclusion

In this work we investigated cultural transmission in deep conversational agents, applying it to language emergence. The evolutionary linguistics community has long used cultural transmission to explain how compositional languages could have emerged. The deep learning community, having recently become interested in language emergence, has not investigated that link until now. Instead of explicit models of cultural transmission familiar in evolutionary linguistics, we favor an implicit model where language is transmitted from generation to generation only because it helps agents achieve their goals. We show that this does indeed cause cultural transmission, and more important, that it also causes the emerged languages to be more compositional.

Future work.

There is room for finding better ways to encourage compositional language. One very relevant approach would be to engineer more explicit models of cultural transmission and add other factors that encourage compositionality (, as studied elswhere in deep learning (Vedantam et al., 2018)). On the other hand, we would like even deep non-language representations to be compositional. Cultural transmission may provide an appropriate prior for those cases as well.


We would like to thank Satwik Kottur for code and comments as well as Karan Desai for additional code. We would also like to thank Douwe Kiela, Diane Bouchacourt, Sainbayar Sukhbaatar, and Marco Baroni for comments on earlier versions of this paper.

This work was supported in part by NSF, AFRL, DARPA, Siemens, Samsung, Google, Amazon, ONR YIPs and ONR Grants N00014-16-1-{2713,2793}. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.


  • Agrawal et al. (2018) Agrawal, Aishwarya, Batra, Dhruv, Parikh, Devi, and Kembhavi, Aniruddha. Don’t just assume; look and answer: Overcoming priors for visual question answering. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2018.
  • Andreas et al. (2017) Andreas, Jacob, Dragan, Anca D., and Klein, Dan. Translating neuralese. In ACL, 2017.
  • Antol et al. (2015) Antol, Stanislaw, Agrawal, Aishwarya, Lu, Jiasen, Mitchell, Margaret, Batra, Dhruv, Lawrence Zitnick, C, and Parikh, Devi. VQA: Visual Question Answering. In ICCV, 2015.
  • Bahdanau et al. (2014) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
  • Bengio (2012) Bengio, Yoshua. Evolving culture vs local minima. CoRR, abs/1203.2990, 2012.
  • Brighton (2002) Brighton, Henry. Compositional syntax from cultural transmission. Artificial Life, 8:25–54, 2002.
  • Briscoe (2000) Briscoe, Ted. Grammatical acquisition: Inductive bias and coevolution of language and the language acquisition device. In Language, volume 76. Linguistic Society of America, 2000.
  • Choi et al. (2018) Choi, Edward, Lazaridou, Angeliki, and de Freitas, Nando. Compositional obverter communication learning from raw visual input. In International Conference on Learning Representations (ICLR), 2018.
  • Christiansen & Kirby (2003a) Christiansen, Morten H and Kirby, Simon. Language evolution. OUP Oxford, 2003a.
  • Christiansen & Kirby (2003b) Christiansen, Morten H. and Kirby, Simon. Language evolution: consensus and controversies. Trends in cognitive sciences, 7 7:300–307, 2003b.
  • Cornish et al. (2009) Cornish, Hannah, Tamariz, Monica, and Kirby, Simon. Complex adaptive systems and the origins of adaptive structure: What experiments can tell us. Language Learning, 59(s1):187–205, 2009. doi: 10.1111/j.1467-9922.2009.00540.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9922.2009.00540.x.
  • Das et al. (2017a) Das, Abhishek, Kottur, Satwik, Gupta, Khushi, Singh, Avi, Yadav, Deshraj, Moura, José M. F., Parikh, Devi, and Batra, Dhruv. Visual dialog. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1080–1089, 2017a.
  • Das et al. (2017b) Das, Abhishek, Kottur, Satwik, Moura, José M.F., Lee, Stefan, and Batra, Dhruv. Learning cooperative visual dialog agents with deep reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017b.
  • Foerster et al. (2016) Foerster, Jakob, Assael, Ioannis Alexandros, de Freitas, Nando, and Whiteson, Shimon. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2137–2145, 2016.
  • Gauthier & Mordatch (2016) Gauthier, Jon and Mordatch, Igor. A paradigm for situated and goal-driven language learning. CoRR, abs/1610.03585, 2016.
  • Gülçehre & Bengio (2016) Gülçehre, Çağlar and Bengio, Yoshua. Knowledge matters: Importance of prior information for optimization. Journal of Machine Learning Research, 17:8:1–8:32, 2016.
  • Havrylov & Titov (2017) Havrylov, Serhii and Titov, Ivan. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. In NIPS, 2017.
  • Kiela et al. (2016) Kiela, Douwe, Bulat, Luana, Vero, Anita L, and Clark, Stephen. Virtual embodiment: A scalable long-term strategy for artificial intelligence research. arXiv preprint arXiv:1610.07432, 2016.
  • Kingma & Ba (2015) Kingma, Diederik P. and Ba, Jimmy. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • Kirby (2001) Kirby, Simon. Spontaneous evolution of linguistic structure-an iterated learning model of the emergence of regularity and irregularity.

    IEEE Trans. Evolutionary Computation

    , 5:102–110, 2001.
  • Kirby (2002) Kirby, Simon. Natural language from artificial life. In Artificial Life, 2002.
  • Kirby et al. (2008) Kirby, Simon, Cornish, Hannah, and Smith, Kenny. Cumulative cultural evolution in the laboratory: An experimental approach to the origins of structure in human language. Proceedings of the National Academy of Sciences, 2008. ISSN 0027-8424. doi: 10.1073/pnas.0707835105. URL https://www.pnas.org/content/early/2008/07/29/0707835105.
  • Kirby et al. (2014) Kirby, Simon, Griffiths, Tom, and Smith, Kenny. Iterated learning and the evolution of language. Current Opinion in Neurobiology, 28:108–114, 2014.
  • Kirby et al. (2015) Kirby, Simon, Tamariz, Monica, Cornish, Hannah, and Smith, Kenny. Compression and communication in the cultural evolution of linguistic structure. Cognition, 141:87–102, 2015.
  • Kottur et al. (2017) Kottur, Satwik, Moura, José M. F., Lee, Stefan, and Batra, Dhruv. Natural language does not emerge ’naturally’ in multi-agent dialog. In EMNLP, 2017.
  • Lake & Baroni (2018) Lake, Brenden M. and Baroni, Marco. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In ICML, 2018.
  • Lazaridou et al. (2017) Lazaridou, Angeliki, Peysakhovich, Alexander, and Baroni, Marco. Multi-agent cooperation and the emergence of (natural) language. In International Conference on Learning Representations (ICLR), 2017.
  • Lazaridou et al. (2018) Lazaridou, Angeliki, Hermann, Karl Moritz, Tuyls, Karl, and Clark, Stephen. Emergence of linguistic communication from referential games with symbolic and pixel input. In International Conference on Learning Representations (ICLR), 2018.
  • Lee et al. (2018) Lee, Jason D., Cho, Kyunghyun, Weston, Jason, and Kiela, Douwe. Emergent translation in multi-agent communication. CoRR, abs/1710.06922, 2018.
  • Liu et al. (2016) Liu, Chia-Wei, Lowe, Ryan, Serban, Iulian, Noseworthy, Michael, Charlin, Laurent, and Pineau, Joelle.

    How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation.

    In EMNLP, 2016.
  • Mikolov et al. (2016) Mikolov, Tomas, Joulin, Armand, and Baroni, Marco. A roadmap towards machine intelligence. In CICLing, 2016.
  • Mordatch & Abbeel (2018) Mordatch, Igor and Abbeel, Pieter. Emergence of grounded compositional language in multi-agent populations. In AAAI, 2018.
  • Nowak & Krakauer (1999) Nowak, Martin A. and Krakauer, David C. The evolution of language. Proceedings of the National Academy of Sciences of the United States of America, 96 14:8028–33, 1999.
  • Nowak et al. (2000) Nowak, Martin A., Plotkin, Joshua B., and Jansen, Vincent A A. The evolution of syntactic communication. Nature, 404:495–498, 2000.
  • Perfors (2002) Perfors, Amy. Simulated evolution of language: a review of the field. J. Artificial Societies and Social Simulation, 5, 2002.
  • Pinker & Bloom (1990) Pinker, Steven and Bloom, Paul. Natural language and natural selection. Behavioral and brain sciences, 13(4):707–727, 1990.
  • Raviv et al. (2018) Raviv, Limor, Meyer, Antje, and Lev-Ari, Shiri. Compositional structure can emerge without generational transmission. Cognition, 182:151–164, 2018.
  • Real et al. (2017) Real, Esteban, Moore, Sherry, Selle, Andrew, Saxena, Saurabh, Suematsu, Yutaka Leon, Le, Quoc V., and Kurakin, Alex.

    Large-scale evolution of image classifiers.

    In ICML, 2017.
  • Scott-Phillips & Kirby (2010) Scott-Phillips, Thomas C. and Kirby, Simon. Language evolution in the laboratory. Trends in cognitive sciences, 14 9:411–7, 2010.
  • Smith (2006) Smith, Kenny. Cultural evolution of language. Encyclopedia of Language and Linguistics 2 Edition, 2:315–322, 2006.
  • Smith et al. (2003) Smith, Kenny, Kirby, Simon, and Brighton, Henry. Iterated learning: A framework for the emergence of language. Artificial Life, 9:371–386, 2003.
  • Spike et al. (2017) Spike, Matthew, Stadler, Kevin, Kirby, Simon, and Smith, Kenny. Minimal requirements for the emergence of learned signaling. In Cognitive Science, 2017.
  • Stanley & Miikkulainen (2002) Stanley, Kenneth O. and Miikkulainen, Risto. Evolving neural networks through augmenting topologies. Evolutionary Computation, 10:99–127, 2002.
  • Stanley et al. (2009) Stanley, Kenneth O., D’Ambrosio, David B., and Gauci, Jason. A hypercube-based encoding for evolving large-scale neural networks. Artificial Life, 15:185–212, 2009.
  • Sukhbaatar et al. (2016) Sukhbaatar, Sainbayar, Szlam, Arthur, and Fergus, Rob.

    Learning multiagent communication with backpropagation.

    In NIPS, 2016.
  • Tomasello (1999) Tomasello, Michael. The cultural origins of human cognition. Harvard university press, 1999.
  • Vedantam et al. (2018) Vedantam, Ramakrishna, Fischer, Ian, Huang, Jonathan, and Murphy, Kevin. Generative models of visually grounded imagination. In International Conference on Learning Representations (ICLR), 2018.
  • Vogt (2005) Vogt, Paul. The emergence of compositional structures in perceptually grounded language games. Artificial intelligence, 167(1-2):206–242, 2005.
  • Williams (1992) Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
  • Xu et al. (2015) Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Cho, Kyunghyun, Courville, Aaron C., Salakhutdinov, Ruslan, Zemel, Richard S., and Bengio, Yoshua. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.

Appendix A Replacement Strategies

Our approach to cultural transmission periodically replaces agents by re-initializing them. The approach section outlines various replacement strategies (policy ), but does not detail their implementation. We do so here.

These strategies depend on a number of possible inputs:

  • the current epoch

  • the period of agent replacement

  • the validation accuracy of agent for s/s. For s this is averaged over all potential partners, and vice-versa for s.

  • the age in epochs of agent for s/s

Single Agent strategies are given in alg:replace_single_random and alg:replace_single_alternate. Multi Agent strategies are given in alg:replace_multi_random, alg:replace_multi_eps, and alg:replace_multi_oldest. Note that Single Agent strategies always replace one agent while Multi Agent strategies always replace one and one . An additional Replace All baseline strategy is given in alg:replace_all and generalizes to both Single and Multi Agent cases.

if  then

Algorithm 2 Single Agent - Random Replacement

if  then

Algorithm 3 Single Agent - Alternate Replacement

return ,

Algorithm 4 Multi Agent - Uniform Random Replacement

Input: , , (usually 0.2)

if  then

       (unique in our experiments)
(unique in our experiments)
return ,
Algorithm 5 Multi Agent - Epsilon Greedy Replacement

Input: ,
return ,

Algorithm 6 Multi Agent - Oldest Replacement


Algorithm 7 Single/Multi Agent - Replace All

Appendix B Detailed Results

In our experiments we compare models and we compare replacement strategies. We ran dependent paired t-tests across random seeds, cross-val folds, and replacement strategies to compare models. We ran dependent paired t-tests across random seeds, cross-val folds, and models to compare replacement strategies. The p-values for all of these t-tests are reported here.

Replacement strategy comparisons are in fig:pvalue_method_single (Single Agent) and fig:pvalue_method_multi (Multi Agent). Model comparisons are in fig:pvalue_model.

Figure 6: Replacement strategy comparison p-values.

Figure 7: Single Agent model comparison p-values.

Figure 8: Multi Agent model comparison p-values.