Emergent Linguistic Phenomena in Multi-Agent Communication Games

01/25/2019 ∙ by Laura Graesser, et al. ∙ Facebook NYU college 20

In this work, we propose a computational framework in which agents equipped with communication capabilities simultaneously play a series of referential games, where agents are trained using deep reinforcement learning. We demonstrate that the framework mirrors linguistic phenomena observed in natural language: i) the outcome of contact between communities is a function of inter- and intra-group connectivity; ii) linguistic contact either converges to the majority protocol, or in balanced cases leads to novel creole languages of lower complexity; and iii) a linguistic continuum emerges where neighboring languages are more mutually intelligible than farther removed languages. We conclude that intricate properties of language evolution need not depend on complex evolved linguistic capabilities, but can emerge from simple social exchanges between perceptually-enabled agents playing communication games.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

Code Repositories

MultimodalGame

Implementing this paper : https://arxiv.org/pdf/1901.08706.pdf by Laura Graesser et al.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Contact linguistics, the field that studies what happens when two or more languages or language varieties interact, poses several pertinent open questions that are difficult to answer: how does symmetric (“mutually intelligible”) communication emerge; how do languages behave under contact; how does a language continuum develop; how does one language come to dominate another; how and why does extensive language contact tend to lead to simplification (e.g. in creoles); and how does a linguistic continuum come about, where neighboring languages are more intelligible than farther removed ones? In this work, we show that such linguistic phenomena emerge naturally given a few general assumptions about the organizational structure of networks of artificial agents equipped with a minimalistic form of learned communication.

Studying language change in vivo is challenging, since it requires simultaneous observation of speaker and community interactions Brooks and Ragir (2008); Trudgill (1974); Joseph (2017); Christiansen and Kirby (2003), while carefully controlling for purposes and goals Winograd (1971); Flores and Winograd (1987); Nowak and Krakauer (1999). Studies of language emergence and evolution must furthermore be conducted over long periods of time, spanning decades or even centuries. Even the Nicaraguan sign language, which emerged remarkably rapidly, took several decades to develop fully Senghas et al. (2005). Language itself also never ceases to evolve Fishman (1964).

Advances in computer science have provided us with opportunities for instead investigating the emergence and evolution of languages in vitro using computational and mathematical models R. Hurford (1989); Briscoe (2002); Kirby (2002); Christiansen and Kirby (2003); Lewis (2008); Skyrms (2010). In computational approaches, communities of agents, equipped with the ability to communicate, are deployed in a simulated environment. Their communication protocol either evolves or is learned, in order maximize some reward provided by the environment. The agents’ behaviors, together with their communication, are observed and compared against known linguistic phenomena or hypothesized linguistic theories. Using computational methods, one can precisely control linguistic, environmental and algorithmic variables, enabling a thorough examination of the cultural and communicative aspects of language.

We propose a new multi-agent framework for studying the emergence and evolution of language, where agents are neural networks endowed with the ability to exchange messages about their perceptual input. Computational multi-agent models are characterized by the complexity of the agents, the choice of learning algorithm, and the design of the environment and reward structure. The complexity of an artificial agent ranges from a set of simple difference equations 

Grouchy et al. (2016), to a CPU-like architecture with an instruction set and registers Knoester et al. (2007), to a co-occurrence matrix between objects and symbols Nowak and Krakauer (1999), to a simple single-layer neural network Trianni and Dorigo (2006), to a deep neural network Lazaridou et al. (2016); Foerster et al. (2016); Jorge et al. (2016)

. The learning algorithm is either a variant of evolutionary algorithms 

Nowak and Krakauer (1999); Kirby (2002); Grouchy et al. (2016), often used in the framework of Artificial Life Bedau (2003), or a gradient-based optimization algorithm, as often used for training deep neural networks with a supervised or reinforcement learning objective function. The former sees generations developing complex behavior over time, while the latter enables more sophisticated agents thanks to the recent advances in deep learning LeCun et al. (2015). Recent years have seen intriguing new results in emergent communication, starting with Lazaridou et al. (2016) and Foerster et al. (2016), using such neural agents Lewis et al. (2017); Havrylov and Titov (2017); Jorge et al. (2016); Evtimova et al. (2018); Das et al. (2018); Cao et al. (2018). Often, these approaches could be framed as special or generalized cases of Lewis’s signalling game (Lewis, 2008, see Ch. 4–5 thereof). Here, deep learning agents play games within communities of similar agents, where the subject of the reference game is the agents’ perceptual input.

In what follows, we first introduce the multi-agent communication framework. We subsequently describe several linguistic phenomena that emerge within the framework. First, we investigate linguistic behavior on the agent-level, and examine when symmetric communication emerges within a linguistic community, as well as how the topological organization of agents within communities impacts their convergence and learning. We then switch to the community-level, where we examine the behavior of communities when they come into contact, as well as how community-level topologies impact convergence, success rate and mutual intelligibility.

We demonstrate that the following linguistic behaviors emerge, which correspond to known linguistic phenomena in human languages: 1) the outcome of contact is a function of inter- and intra-group connectivity, i.e. that languages become mutually intelligible through contact, even for agents that have not themselves been exposed to the other language, provided there is sufficient connectivity between communities; 2) linguistic contact over time either converges to the dominant majority protocol, leading to the other language becoming extinct, or if the communities are balanced gives rise to an original “creole” protocol that has lower complexity than the original languages; 3) a linguistic continuum emerges, where neighboring languages are more mutually intelligible than farther removed languages and the topology of the continuum governs its behavior. To our knowledge, this work constitutes the first attempt at studying contact linguistic phenomena using communities of latest-generation deep neural agents capable of dealing with rich perceptual input; and is the first to show that such intricate properties of language evolution need not depend on intrinsic properties of highly complex evolved linguistic capabilities, but instead can emerge purely from social exchanges between perceptually-enabled agents with simple communicative capabilities.

(a) (b) (c) (d)
Figure 1: (a) Example training data. Only a random half of each image (dark background) is presented to one agent, necessitating communication in order to solve the game. (b) A graphical illustration of the proposed game. Each of two agents observes a partition of an input image and decides which of ten textual captions best describes the entire image before and after exchanging messages with the other agent. (c) The internal structure of an agent. The structure is modular in that each sub-network could be replaced by an alternative without requiring any change in other parts of the proposed framework. (d) The chance of both agents correctly guessing the answer drastically depends on the choice of a reward function: (left) it is important to reward the collective behaviour ( vs. ) in order for two agents to collaborate; (right) the success rate is further improved when we explicitly encourage the agents to maximize the accuracy improvement after communication ( vs. ).

2 Methods

In order to study emergent linguistic phenomena in a simplified but realistic setting, the communication game needs to have several properties. First, it should be symmetric, in that all agents should be able to act as “speaker” and “listener”. Second, the agents should communicate about something external to themselves, i.e., about the sensory experience of something in their environment. Third, the world should be partially observable, implying that communication is required for solving the game successfully. Fig. 1 shows example training data, the game setup and agent design. Please refer to the supplementary material for details.

Let be a multi-agent communication game, consisting of communities of agents , given environmental observations , communicating across the bidirectional message channel , where the community membership of agents is defined as a graph with the agents as vertices and weighted edges  that determine whether two agents are connected, i.e.,  specifies the topology of the network. Each agent is a deep neural network that can handle rich sensory inputs , such as raw images, as well as natural language text. Pairs of connected agents learn to play a game with a reward structure , designed specifically to require communication-based collaboration. The exchange of information through the communication channel  can take any form. By learning to play the game, the agents develop a shared communication protocol which is triadic in nature (i.e., about the observations). This framework allows us to control for proximity constraints, population size and degree of interaction, through the underlying weighted graph structure. Moreover, we can explicitly specify the common purpose of agents through the reward structure, which gives control over the information that must be exchanged between a pair of agents for successfully solving a problem.

During training, a pair of agents corresponding to the adjacent vertices, and , is selected at random according to interaction weights . The agents then play the reference game and are updated according following the specified reward structure. The community structure can change during training. For instance, we are able to merge separately trained linguistic communities into a single community and finetune the agents from both communities (i.e. bring the communities into contact) to investigate linguistic contact. Once training of the agents in the linguistic community is done, we can ignore the graph structure underlying the community and test pairs of distant agents to understand the distribution of emergent communication protocols.

3 Results

3.1 Emergence of Symmetric Linguistic Protocols

We first examine whether having just two autonomous agents is sufficient for developing a common communication protocol. That is, we ask whether a symmetric language emerges when there are only two agents in a linguistic community, or whether they learn to speak distinct idiolects to each other. In order to answer this question, we formally define “mutual intelligibility” in the communication protocol as the ability for each agent to play the game against itself. If a shared communication protocol has emerged, the agent would not have any trouble playing a game with itself (she “understands” what she “says” and “says” what she “understands”). We run five simulations with random initialization and examine the success rate between two agents averaged over all the test examples, where the agents succeed on each example if both of them correctly guess the answer after communication.

As shown in Fig. 2 (a), two agents can solve the problem with a high success rate when they play with each other (cross-play). However, the success rate drops to random chance (10%) when each agent plays with itself (self-play). This clearly suggests that the emergent communication protocol is not symmetric, and each agent develops its own protocol and the other adapts to it, leading to two incompatible idiolects. We then run additional experiments having three, four, five, eight and ten agents each, where every pair of agent interact with an equal interaction intensity (the success rate is averaged over all possible pairs of agents, unless stated otherwise). We notice that the success rates between self-play and paired-play are indistinguishable from each other, strongly implying that a common, shared language emerges as a social convention if and only if we have more than two language users. Importantly, this finding demonstrates that, at least within this framework, it is not necessary to specifically equip an agent with an innate mechanism that ties listening and speaking, such as the obverter technique Oliphant and Batali (1997); Choi et al. (2018), nor any explicit community-wide coordination. All that is needed in order for a common language to emerge is a minimum number of agents.

(a) (b) (c) (d)
Figure 2: (a) Six communities of populations two, three, four, five, eight and ten each were trained. Once training is done, we compute the success rates under self-play and cross-play. The success rates between self-play and paired-play are indistinguishable, implying that a common, shared language emerges as a social convention and requires more than two language users. In other words, we observe that at least three agents are necessary for the emergent protocol to be symmetric without any specialized mechanism that enforces the symmetry of emergent protocol. (b) The average number of plays per agent to the first observed success rate between a pair of agents in a linguistic community approximately stays constant with respect to the community size. (c) The number of plays required for each agent in a linguistic community to reach the success rate of 75% on average across all agents pairs in the community approximately stays constant with respect to the community size. (d) Each agent learns at approximately the same rate regardless of the community size. Each of these observations was averaged over five runs and suggests that the emergent protocol emerges incrementally in a distributed manner rather than in a centralized way. Each success rate was averaged over five runs in all the cases.

We observe no detrimental effect to increasing the number of agents per linguistic community, even though more agents have to come to agree on a single protocol. As shown in Fig. 2 (b–c), it takes approximately 60–65,000 plays per agent for us to observe the first instance of a pair of agents reaching the success rate of 70% regardless of the community size. Similarly, it takes approximately 150-200,000 plays per agent for the success rate averaged over all pairs of agents in a community to reach 75%, again, regardless of the size of the community. Surprisingly, the speed at which each agent learns stays constant with respect to the community size, as in Fig. 2 (d). That is, we were not able to observe any correlation between the community size and the linguistic convergence of the entire community. This is in contrast with observations in the naming game mentioned earlier, in which the convergence time increases with respect to the number of agents Baronchelli et al. (2008), suggesting that the behaviors of linguistic communities heavily depend on individual agents’ capabilities and learning settings and that further investigation, both theoretical and empirical, is warranted in this setup with sophisticated agents.

(a) (b) (c)
Figure 3: (a) Two linguistic communities come in contact. The solid lines correspond to intra-community interactions, while the dashed ones to inter-community ones. The bridge agents are marked bold. (b) Two communities of population ten each came in contact with the ratio of learning frequencies  and the inter-group connectivity , after being separately trained in isolation. The bridge agents, who interact with the agents from the other community, learn faster and better the new, shared emergent protocol. All the other agents however also rapidly learn to communicate with the agents from the other community, although they never interact directly with them. The learning curves are averaged over five runs. (c) This contour plot visualizes the success rate after 200,000 plays after the contact by two linguistic communities while varying the ratio of learning frequencies  and the inter-group connectivity

(linearly interpolated from 15 experiments.) We observe that the success rate, which measures the level of convergence of two protocols, requires a certain level of the inter-group connectivity (

). Even when the inter-group connectivity is high enough, we further see that the bridge agents must interact with the agents from the other community enough () for the converged protocol to be well understood by the agents from both communities.

3.2 Convergence of Linguistic Protocols via Contact

We next examine what happens when we expose different linguistic communities to each other. Specifically, we consider two linguistic communities of population sizes and

, which are trained independently from each other as fully-connected communities and have developed separate communication protocols. We bring these two communities into contact with each other by selecting a new set of inter-community edges with probability

to form a new linguistic community. We assign a weight to all the inter-group edges and another weight to all the intra-group edges. We can then examine how “interaction intensity” relates to language shift. See Fig. 3 (a) for an illustration of two linguistic communities making contact.

We first investigate two communities of identical population sizes () with the ratio of the learning frequencies of the intra-group pairs and inter-group pairs set to , where is the number of inter-community connections and is the number of intra-community connections, and the inter-group connectivity chance set to . We notice in Fig. 3 (b) that the bridge agents learn to communicate better (evident from the higher success rate among themselves), but the other agents quickly catch up (according to the success rate among themselves excluding the bridge agents), although these other agents never directly interact with agents from the other community.

This finding demonstrates the rapid shift toward a common protocol in both groups where all agents learn to speak a shared language, regardless of whether they actually interact with agents from the other group.

Having established that linguistic contact leads to convergence of the communication protocol, we delve deeper into the impact of two major parameters: the ratio of inter and intra-group connectivity and the connectivity probability . We vary the ratio of the learning frequencies of the intra-group pairs and inter-group pairs between , and , while fixing the inter-group connectivity to . After 200,000 plays, the former () converges to a more tightly coupled linguistic community, achieving 65.6% success rate between agents that never interacted with each other. On the other hand, when the inter-group interaction occurred only half as frequently as the intra-group interaction, the agents from the two groups can play together with a much lower 52.4% success rate. We observed similar patterns over many different combinations of the ratio and inter-group connectivity. For example, we varied the inter-group connectivity between , , , and while the interaction ratio was fixed to . After 200,000 plays, we observed the success rates, averaged over all possible inter-group pairs, reach 42.1%, 51.1%, 65.55%, 66.65% and 66.3%, respectively. This implies that there is a critical level of inter-group connectivity (around in this specific case) after which language propagation saturates.

In Fig. 3 (c), we plot the interplay between the ratio and the inter-group connectivity after interpolating from the fifteen experiments varying these parameters. This demonstrates that both parameters are important in determining the level of linguistic convergence.

3.3 Birth of a New Language: Emergence of a Contact Language

We investigate the effect of the population size ratio between two linguistic communities when they come in contact. We study how population size is a factor in one language coming to “dominate” another language upon contact. We vary the ratio by fixing and varying . Each community is pretrained in isolation to develop its own protocol before coming in contact with the other. We set the interaction ratio to and the inter-group connectivity chance to .

We refer to the original protocols of the communities right after pretraining by and . Each of these is then evolved further after these two communities come in contact, resulting in and . The previous experiment on linguistic contact suggests that based on the fact that the agents from both communities can successfully play the game after coming in contact, so we refer to the final protocol as . We examine how similar is to either of the original protocols, or . This similarity is measured by letting the agent using or play against the one using , which is naturally facilitated by the proposed framework. This historical self-play accuracy reflects the similarly of the original and final protocols.

(a) (b)
Figure 4: (a) By plotting the divergence of the common emergent protocol after the linguistic contact by two communities, we observe that it either converges to the majority protocol or to one in-between two original protocols. (b) By varying the population ratio, it becomes clear that the near-perfect balance between two communities is necessary for a novel, contact protocol to emerge rather than the domination by a majority protocol. Each data point was averaged over five runs.

When the population ratio deviates from , we observe that the final protocol rapidly converges to the majority protocol (), evident from the near-perfect and the near-chance in Fig. 4. This is the consequence of the fact that members from both communities are rewarded for cooperating and playing the game well (via the bridge agents). In other words, the agents prefer to integrate or assimilate rather than segregate, similar to how it has been found that minority groups shift “toward the use of dominant language” Fase et al. (1992). On the other hand, we observe and that both of these historical self-play accuracies are significantly above chance, when the population ratio is closer to or exactly . It is impossible to identify either or as an ancestor of , but is rather a combination of these two original protocols, which is “a key feature defining a contact language” (see Chapter 10 of Matras (2009)). Both of these observations suggest the potential of the proposed framework for simulating and understanding the birth and death of new languages via linguistic contact.

(a) Success Rate (b)
Figure 5: Although the success rate evolves similarly in terms of the number of plays per agent when two communities of varying populations come in contact (a), we observe significantly different levels of complexity dependent on the population ratio (b). The complexity is generally lower when the sizes of two communities are close to balanced (10-10 and 10-9), while the complexity does not decrease as much when there is a significant imbalance in sizes between two communities (10-4 and 10-3). In both of the cases, the agents were finetuned until the average success rate reached at least .

We further investigate the complexity of the contact language arising from two linguistic communities coming into contact. We define complexity based on the uncertainty of an agent when generating a message, measured by the entropy of the message distribution

. Higher entropy indicates that agents can express states in many different ways: in other words, the more complex a language, the higher the degree of freedom. For each linguistic community (consisting of two clusters), we then compute the average of these message distributions in order to characterize the complexity of a learned communication protocol.

We take four settings from the previous experiments— and —to investigate the evolution of linguistic complexity. In all cases, we observe in Fig. 5 (b) that the complexity decreases when two communities come in contact with each other. This observation is in agreement with a similar phenomenon of structural simplification in creole languages which are understood to arise from the contact of two or more languages Parkvall (2008); Bakker et al. (2011). We also observe that the complexity plateaus earlier when there is a larger imbalance between two communities’ population (10-3 and 10-4), while it drops further with more balanced communities (10-9 and 10-10). This implies that the new-born contact languages arising from the contact of two similarly-sized communities tend to be substantially simpler.

(a) Chain 5-5-5-5-5 (b) Chain 5-5-10-5-5 (c) (b)-(a) (d) Dense 5-5-5-5-5 (e) Dense 5-5-10-5-5
Figure 6: We plot the protocol similarities among the five communities in a chain; (a) 5-5-5-5-5 and (b) 5-5-10-5-5. Although neighbouring communities exhibit higher protocol similarities, the agents from the distant communities are not mutually intelligible. When there is a larger community in the middle of a chain, we observe higher levels of protocol similarities among the communities in the chain, evident from the higher similarities between C2 and C4 in (b) than in (a). To a lesser degree, it is observable between C2 and C5 and between C1 and C4. The difference between (a) and (b) is plotted in (c) which indicates that the protocols near the center become more similar to each other when the center community is larger, however, at the expense of sacrificing the intelligibility between further-away communities. Each observation was averaged over three runs in which the group of agents used to initialize each cluster was varied. In contrast, we do not observe such a linguistic continuum when the communities are connected densely in (d) and (e).

3.4 A Chain of Communities: Linguistic Continuum

We generalize the previous setting to having linguistic communities in a connected chain of communities. We start by pretraining linguistic communities of populations respectively, evolving distinct communication protocols. We then chain them such that each consecutive pair, and , comes in contact with a pre-specified inter-group connectivity chance and interaction ratio , and begin training all of the communities jointly. This setup allows us to study the emergence of a linguistic continuum, which is highly relevant to dialect continuums existing in natural languages, as found e.g. in the Nordic Germanic dialects of Scandinavia, as a chain of Swedish dialect in Finland, Swedish in Sweden, Danish, Norwegian to Icelandic  Chrystal (1987). Often, speakers on the border between two consecutive linguistic communities are mutually intelligible, while those from communities geographically separated by many intermediate ones cannot communicate.

We start by considering a chain of five communities of equal population (N=5). As plotted in Fig. 6 (a), we clearly observe the emergence of a linguistic continuum. The agents from a pair of adjacent communities can communicate with each other almost as well as those within a single community, while communicability rapidly degrades as the distance between a pair of communities grows (off-diagonal). The agents from and cannot understand each other at all, achieving the near-chance success rate. A similar continuum is observed when we increased the population of the center community two-fold (510). This continuum however exhibits properties different from the original chain of equal-sized communities: the center communities , and become more tightly coupled, as evident from the higher success rate among those in Fig. 6 (b-c). This however happens at the cost of communicability between the agents from furthest-removed communities.

In order to verify that the emergence of such a continuum is due to topological properties of communities, we show the protocol similarities among the densely connected five communities in Fig. 6 (d-e). Unlike chaining, we ensure that every pair of communities comes in contact with each other in a densely connected topology. As expected, the protocol similarity between any pair of communities is uniformly high, confirming that the linguistic continuum arises from the topology.

4 Discussion

We have proposed a new framework for the large-scale investigation of complex linguistic phenomena via multi-agent communication games, which enables the analysis of communication protocols learned by linguistic communities of trainable agents. Our framework is contrasted from previous work through the complexity of the artificial agents using latest-generation deep reinforcement learning, as well as in their ability to handle rich sensory signal. We observed that a symmetric communication protocol emerges without any innate, explicit mechanism built in an agent, when there were three or more of them in a linguistic community.

We then demonstrated the emergence of linguistic phenomena in this framework. First, the result of linguistic contact between communities is determined by inter- and intra-group connectivity. Given sufficient inter-group connectivity, languages become mutually intelligible through contact, even for agents that have not themselves been exposed to the other language. Second, linguistic contact over time either converges to the dominant majority protocol, leading to the extinction of the other language, or gives rise to an original “creole” protocol that has lower complexity than the original languages, if the communities are balanced. Third, a linguistic continuum emerges, where neighboring languages are more mutually intelligible than farther removed languages. The topology of the continuum governs its behavior, and a very dominant central language causes its neighbors to lose mutual intelligibility with communities that are not directly exposed to that central language.

We conclude that intricate properties of language evolution need not depend on complex evolved linguistic capabilities, but can emerge from simple social exchanges between perceptually-enabled agents playing communication games.

We thank Alex Peysakhovich and Marco Baroni for their valuable comments.

References

  • Bakker et al. (2011) Peter Bakker, Aymeric Daval-Markussen, Mikael Parkvall, and Ingo Plag. Creoles are typologically distinct from non-creoles. Journal of Pidgin and Creole Languages, 26:5–42, 01 2011.
  • Baronchelli et al. (2008) Andrea Baronchelli, Vittorio Loreto, and Luc Steels. In-depth analysis of the naming game dynamics: the homogeneous mixing case. International Journal of Modern Physics C, 19(05):785–812, 2008.
  • Bedau (2003) Mark A Bedau. Artificial life: organization, adaptation and complexity from the bottom up. Trends in Cognitive Sciences, 7(11):505–512, 2003.
  • Bridle (1990) John S Bridle.

    Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters.

    In Advances in Neural Information Processing Systems, pages 211–217, 1990.
  • Briscoe (2002) Ted Briscoe. Linguistic evolution through language acquisition. Cambridge University Press, 2002.
  • Brooks and Ragir (2008) Patricia J Brooks and Sonia Ragir. Prolonged plasticity: Necessary and sufficient for language-ready brains. Behavioral and Brain Sciences, 31(5):514–515, 2008.
  • Cao et al. (2018) Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, and Stephen Clark. Emergent communication through negotiation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk6WhagRW.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1724–1734, Doha, Qatar, October 2014.
  • Choi et al. (2018) Edward Choi, Angeliki Lazaridou, and Nando de Freitas. Compositional obverter communication learning from raw visual input. International Conference on Learning Representations (ICLR), 2018.
  • Christiansen and Kirby (2003) Morten H Christiansen and Simon Kirby. Language evolution: Consensus and controversies. Trends in Cognitive Sciences, 7(7):300–307, 2003.
  • Chrystal (1987) David Chrystal. The cambridge encyclopedia of language. Cambridge ua, 1987.
  • Das et al. (2018) Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answering. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2018.
  • Evtimova et al. (2018) Katrina Evtimova, Andrew Drozdov, Douwe Kiela, and Kyunghyun Cho. Emergent communication in a multi-modal, multi-step referential game. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJGZq6g0-.
  • Fase et al. (1992) Willem Fase, Koen Jaspaert, and Sjaak Kroon. Maintenance and loss of minority languages. John Benjamins Publishing, 1992.
  • Fishman (1964) Joshua A Fishman. Language maintenance and language shift as a field of inquiry. a definition of the field and suggestions for its further development. Linguistics, 2(9):32–70, 1964.
  • Flores and Winograd (1987) F. Flores and T. Winograd. Understanding Computers and Cognition: A New Foundation for Design. Addison-Wesley Longman, 1987. ISBN 0201112973.
  • Foerster et al. (2016) Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016.
  • Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Proceedings of AISTATS, pages 315–323, 2011.
  • Grouchy et al. (2016) Paul Grouchy, Gabriele MT D’Eleuterio, Morten H Christiansen, and Hod Lipson. On the evolutionary origin of symbolic communication. Scientific Reports, 6:34615, 2016.
  • Havrylov and Titov (2017) Serhii Havrylov and Ivan Titov. Emergence of language with multi-agent games: learning to communicate with sequences of symbols. In Advances in Neural Information Processing Systems, pages 2149–2159, 2017.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Jorge et al. (2016) Emilio Jorge, Mikael Kågebäck, Fredrik D Johansson, and Emil Gustavsson. Learning to play guess who? and inventing a grounded language as a consequence. Deep Reinforcement Learning Workshop at NIPS, 2016.
  • Joseph (2017) Brian D. Joseph. Historical linguistics. In Mark Aronoff, editor, The handbook of linguistics, chapter 15. John Wiley & Sons, 2017.
  • Kirby (2002) Simon Kirby. Natural language from artificial life. Artificial Life, 8(2):185–215, 2002.
  • Kirby and Hurford (2002) Simon Kirby and James R. Hurford. The emergence of linguistic structure: An overview of the iterated learning model. In Angelo Cangelosi and Domenico Parisi, editors, Simulating the Evolution of Language, pages 121–147. Springer London, London, 2002. ISBN 978-1-4471-0663-0. doi: 10.1007/978-1-4471-0663-0˙6. URL https://doi.org/10.1007/978-1-4471-0663-0_6.
  • Knoester et al. (2007) David B Knoester, Philip K McKinley, Benjamin Beckmann, and Charles Ofria. Directed evolution of communication and cooperation in digital organisms. In European Conference on Artificial Life, pages 384–394. Springer, 2007.
  • Kuhnle and Copestake (2017) Alexander Kuhnle and Ann Copestake. Shapeworld-a new test methodology for multimodal language understanding. arXiv preprint arXiv:1704.04517, 2017.
  • Lazaridou et al. (2016) Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. Multi-agent cooperation and the emergence of (natural) language. International Conference on Learning Representations (ICLR), 2016.
  • Lazaridou et al. (2018) Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. Emergence of linguistic communication from referential games with symbolic and pixel input. International Conference on Learning Representations (ICLR), 2018.
  • LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
  • Lee et al. (2017) Jason Lee, Kyunghyun Cho, Jason Weston, and Douwe Kiela. Emergent translation in multi-agent communication. International Conference on Learning Representations (ICLR), 2017.
  • Lewis (2008) David Lewis. Convention: A philosophical study. John Wiley & Sons, 2008.
  • Lewis et al. (2017) Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. Deal or no deal? end-to-end learning of negotiation dialogues. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2443–2453, 2017.
  • Matras (2009) Yaron Matras. Language contact. Cambridge University Press, 2009.
  • Nair and Hinton (2010) Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In

    Proceedings of the 27th international conference on machine learning (ICML)

    , pages 807–814, 2010.
  • Nowak and Krakauer (1999) Martin A. Nowak and David C. Krakauer. The evolution of language. Proceedings of the National Academy of Sciences, 96(14):8028–8033, 1999. ISSN 0027-8424. doi: 10.1073/pnas.96.14.8028. URL http://www.pnas.org/content/96/14/8028.
  • Nowak et al. (1999) Martin A Nowak, Joshua B Plotkin, and David C Krakauer. The evolutionary language game. Journal of Theoretical Biology, 200(2):147–162, 1999.
  • Oliphant and Batali (1997) Michael Oliphant and John Batali. Learning and the emergence of coordinated communication. Center for Research on Language Newsletter, 11(1):1–46, 1997.
  • Parkvall (2008) Mikael Parkvall. The simplicity of creoles in a cross-linguistic perspective. Language Complexity: Typology, contact, change, pages 265–285, 2008.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning.

    Glove: Global vectors for word representation.

    In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014. URL http://www.aclweb.org/anthology/D14-1162.
  • Puglisi et al. (2008) Andrea Puglisi, Andrea Baronchelli, and Vittorio Loreto. Cultural route to the emergence of linguistic categories. Proceedings of the National Academy of Sciences, 105(23):7936–7940, 2008.
  • R. Hurford (1989) James R. Hurford. Biological evolution of the saussurean sign as a component of the language acquisition device. Lingua, 77:187–222, 02 1989.
  • Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Nature, 323(6088):533, 1986.
  • Senghas et al. (2005) Richard J Senghas, Ann Senghas, and Jennie E Pyers. The emergence of nicaraguan sign language: Questions of development, acquisition, and evolution. Biology and knowledge revisited: From neurogenesis to psychogenesis, pages 287–306, 2005.
  • Skyrms (2010) Brian Skyrms. Signals: Evolution, learning, and information. Oxford University Press, 2010.
  • Steels (1995) Luc Steels. A self-organizing spatial vocabulary. Artificial Life, 2(3):319–332, 1995.
  • Steels (2015) Luc Steels. The Talking Heads experiment: Origins of words and meanings, volume 1. Language Science Press, 2015.
  • Trianni and Dorigo (2006) Vito Trianni and Marco Dorigo. Self-organisation and communication in groups of simulated and physical robots. Biological Cybernetics, 95(3):213–231, 2006.
  • Trudgill (1974) Peter Trudgill. Linguistic change and diffusion: Description and explanation in sociolinguistic dialect geography. Language in Society, 3(2):215–246, 1974.
  • Williams (1992) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer, 1992.
  • Winograd (1971) Terry Winograd. Procedures as a representation for data in a computer program for understanding natural language. Technical report, MIT, 1971.

Appendix A Game, Reward and Agents

Game

We design a symmetric, multi-modal referential game in such a way that it is necessary for agents to cooperate if they are to successfully solve the game. Each agent observes a random half of an image that contains an object of specific shape and color (see Fig. 1 (a) for examples) and is given a set of textual captions of which only one correctly describes the object in the original image.

The dataset is created based on “ShapeWorld” Kuhnle and Copestake (2017). The goal of the agents is to identify the correct textual caption that describes the image. Since each agent only has partial information about the original image, the pair of agents must cooperate with each other via communication to be effective at solving the problem together. See Fig. 1 (b) for the graphical illustration of the proposed game.

At the beginning of the game, each agent makes an initial guess of the correct answer, followed by rounds of communication in which the agents take turns transmitting a binary message to the other agent. Binary message vectors have been used before for studying the emergence and evolution of language Kirby and Hurford (2002). While we selected this type of communication for efficiency reasons, it is straightforward to replace it with sequences of discrete symbols Jorge et al. (2016); Havrylov and Titov (2017); Lee et al. (2017); Cao et al. (2018); Lazaridou et al. (2018). Once the communication rounds are over, each agent makes their final guess . The game is considered successful, if both of the agents correctly guessed the answer.

The game is similar to other games in the language evolution literature Nowak and Krakauer (1999); Nowak et al. (1999), such as the naming game Steels (1995); Baronchelli et al. (2008), the guessing game Steels (2015) and the category game Puglisi et al. (2008). Unlike the guessing and category games, the proposed game is symmetric between the participating agents and is partially observed. Unlike the naming game, the agents in the proposed game can handle sensory input and learn to capture sophisticated relationships between objects with arbitrary messages by means of supervised and reinforcement learning.

Reward

The reward structure for each agent is designed as follows. First, we reward the agent when it correctly guesses the answer after communication rather than before in order to encourage it to incorporate information received from the other agent, where is an indicator function. We empirically validate the importance of rewarding after-communication behavior, shown in the first two bars in the left plot of Fig. 1 (d). Second, we reward cooperation by giving each agent a shared reward composed of both its own and the other agent’s rewards, i.e., , which significantly boosts the success rate as shown by the latter two bars in the left plot of Fig. 1 (d). Lastly, we explicitly encourage the agents to rely on communication by rewarding them for relative improvement from communication, rather than the success after communication: , where and . This final reward, which encourages both cooperation and explicit reliance on communication, reaches the highest success rate, as shown in the right plot of Fig. 1 (d).

Agent

A reference agent is implemented as a deep neural network consisting of multiple component sub-networks, based on recent advances in deep learning LeCun et al. (2015), as illustrated in Fig. 1 (c). Each agent is equipped with visual perception and the ability to communicate, both of which are implemented jointly in a single deep neural network and trained end-to-end using reinforcement learning to play the proposed communication game. The sensory sub-network is implemented as a ResNet-34, the state-of-the-art deep convolutional network from He et al. (2016)

, with fixed weights (i.e., using the weights obtained from training on ImageNet classification). We transfer the final pre-classification layer in order to extract a 512-dimensional feature vector from the partially-visible input image, which is further transformed with a trainable dense layer and ReLU 

Nair and Hinton (2010); Glorot et al. (2011)activation function to a 100-dimensional feature vector,

. The receiver sub-network is a recurrent neural network based on gated recurrent units 

Cho et al. (2014) and is able to process multiple turns of message exchanges. It encodes the history of received binary-vector messages into a 100-dimensional feature vector . These two vectors are then combined using the fusion sub-network into a single 100-dimensional vector which represents the agent’s internal state. Based on this internal state , the agent computes three quantities. First, the predictor sub-network computes the predictive distribution over all the captions by comparing the internal state against the feature vector of each caption outputted by the text sub-network. The most likely caption under this distribution is the agent’s answer. Second, the sender sub-network computes the distribution of a message to be sent, using the output of the predictor sub-network to incorporate the agent’s current view of which caption is correct. During training, the agent stochastically samples a binary-vector message from this distribution , and during test, it uses the most likely message . The text sub-network encodes predictions (e.g. “there is a red circle”) as representations of natural language. Lastly, the reward is predicted by the value sub-network, which is only used during training to stabilize learning.

This modular agent design allows us to easily swap various sub-networks with other architectures within the same framework. For instance, by replacing the sender sub-network with a recurrent neural network, the agent can generate a sequence of symbols rather than a binary vector. One could also modify the game to include other sensory modalities by modifying the sensory sub-network.

Each agent is trained using a hybrid of supervised and reinforcement learning. Which of the two agents starts the game is decided at random. The agent computes two predictive distributions before and after message exchange, and . Since we know the correct caption

during training, we use supervised learning with these two predictive distribution,

, using backpropagation and stochastic gradient descent 

Rumelhart et al. (1986). Because messages are discrete, we cannot use backpropagation for learning the message generating process. Instead, we use reinforcement learning, in particular REINFORCE Williams (1992), to maximize the reward ; . We regularize learning by encouraging the entropy of the message distribution to be large to allow the agent to explore various communication strategies during learning.

Appendix B Data Generation

We modify ShapeWorld Kuhnle and Copestake (2017) to generate a set of training, validation and test examples. Each example is a 128128 RGB image containing an object with a simple shape and color. There are eight shapes–‘circle’, ‘cross’, ‘ellipse’, ‘pentagon’, ‘rectangle’, ‘semicircle’, ‘square’ and ‘triangle’– and seven colors–‘blue’, ‘cyan’, ‘gray’, ‘green’, ‘magenta’, ‘red’ and ‘yellow’. The size and position of the object in an image are randomly decided, while ensuring that the object size is relatively small compared to the image size. Each image is associated with a textual caption, i.e., a sentences which describes the shape, color, or shape and color of the object. Some examples include “there is a blue square”, “there is a yellow shape” and “there is a square”.

We randomly select nine captions from the other images in order to create 10 candidate captions from which the correct one must be selected by both agents. 13-16% of examples are ambiguous due to the fact that objects can be described by their shape, color, or shape and color.

Each image is partitioned into two parts, each of which is shown to only one of two players. Due to the small size of the object in each image and random partitioning, the object is only visible to one of the players in approximately 82-84% of images. When the object is split into two partitions, both agents may be able to correctly solve the problem without consulting the other (yet a split rectangle may be wrongly perceived as a triangle, for instance). It is necessary for the agents to communicate in all the other cases. In Fig. 1 (a), we show example image partitions. Random partitioning happens during training and at evaluation time without any fixed partition per image.

We create 5,000 training examples while excluding the following combinations: ‘red square’, ‘green triangle’, ‘blue circle’, ‘yellow rectangle’, ‘magenta cross’ and ‘cyan ellipse’. These were excluded in order for us to test the generalization of trained agents to unseen combinations of color and shape.111 This is done to facilitate future research, and we do not test this generalization property in this paper. We similarly construct 1,000 in-domain evaluation examples with only combinations included in the training set, and 5,000 out-of-domain evaluation examples which contain all possible combinations. Both of these are held-out during training and are used for evaluation, with the in-domain results reported throughout this paper. Out of these 6,000 examples, 271 combinations of shape and color do not appear in the training set.

Appendix C Agent Specification

The agent extracts a sensory feature vector using a pretrained ResNet-34 excluding the final classification layer, denoted ResNet-34, which is available from torchvision package222 https://pytorch.org/docs/stable/torchvision/index.html

from PyTorch.

333 https://pytorch.org/ It is followed by

where is a rectified linear unit, and and are trainable parameters.

A message is processed by a gated recurrent unit (GRU, Cho et al. (2014)) each time a new message is received:

where the GRU’s hidden state, , is initialized to zero at the beginning of each game. In this manuscript’s setup, there is only one message received per agent per game. The game begins with each agent receiving a blank message (all zeros) and making a prediction about the correct caption. Next the agent selected to communicate first sends a message to the other agent, who after receiving it sends a message back to the first agent.444 The order of message exchange is randomized each play. Finally, each agent again tries to predict the correct caption.

The image and message vectors, and , are concatenated and combined into a single vector by the fusion sub-network by

where and are trainable parameters. This fused vector is used to represent the agent’s internal state.

Each candidate caption is turned into a vector by the text sub-network:

where is the -th word of the candidate caption , is a trainable word embedding function with the vocabulary , and is the length of the caption. We build the embedding function using a set of pretrained 100-dimensional GloVe Pennington et al. (2014) vectors. The predictor sub-network then compares each candidate caption against the fused vector to compute its score:

These scores are normalized to become a probability Bridle (1990):

Given the fused vector, the agent computes the message to be sent to the partner. This is done by first computing the message distribution, using the normalized probabilities from the predictor sub-network to incorporate the agent’s current belief about the correct caption. Assuming a -dimensional binary message as done in this paper, the distribution is computed by first calculating a weighted sum of the caption;

This is combined with the hidden state to generate the message distribution;

where , , , , and are trainable parameters. During training, we sample from this distribution, while we simply round the probability for each bit at test time.

In order to reduce the variance of policy gradients, we use a learned value estimate as a baseline. The agent estimates the expected reward/return given the observation–image and message–using the value sub-network. It takes as input the fused vector and outputs a single scalar:

where , , and are trainable parameters.

Appendix D Learning

Loss Functions

There are four loss functions involved in each game. The first one is a prediction loss function. Given the index

of a correct caption, the prediction loss function is

This loss is used twice based on the predictions before and after the message exchange; and .

The second loss is a value loss function. After playing a game, the agent receives a reward . The value sub-network then needs to learn to predict its reward:

The third loss is a message loss function. During training, we sample one message from . If this message led to a success, we increase the probability of the sampled message. Otherwise, we decrease it. The success is measured relative to the predicted value. The cost function is then

where refers to using the predicted value but not updating the value sub-network according to this loss function. The gradient of this message loss function with respect to corresponds to REINFORCE Williams (1992).

Lastly, we include an entropy penalty. Following Evtimova et al. (2018), we encourage the entropy of the message distribution to be higher to facilitate exploration:

The overall loss function is then the weighted sum of the four loss functions:

where we set , , and .

Optimization

We use stochastic gradient descent with minibatch size of 32 and use RMSProp to automatically adapt per-parameter learning rates.

Appendix E Code and Data

The code used for implementing the proposed framework as well as the experiments in this manuscript is publicly available at https://github.com/lgraesser/MultimodalGame. The generated data used in the experiments can be downloaded from https://goo.gl/HgHV1H.