Implementing this paper : https://arxiv.org/pdf/1901.08706.pdf by Laura Graesser et al.
In this work, we propose a computational framework in which agents equipped with communication capabilities simultaneously play a series of referential games, where agents are trained using deep reinforcement learning. We demonstrate that the framework mirrors linguistic phenomena observed in natural language: i) the outcome of contact between communities is a function of inter- and intra-group connectivity; ii) linguistic contact either converges to the majority protocol, or in balanced cases leads to novel creole languages of lower complexity; and iii) a linguistic continuum emerges where neighboring languages are more mutually intelligible than farther removed languages. We conclude that intricate properties of language evolution need not depend on complex evolved linguistic capabilities, but can emerge from simple social exchanges between perceptually-enabled agents playing communication games.READ FULL TEXT VIEW PDF
Implementing this paper : https://arxiv.org/pdf/1901.08706.pdf by Laura Graesser et al.
Contact linguistics, the field that studies what happens when two or more languages or language varieties interact, poses several pertinent open questions that are difficult to answer: how does symmetric (“mutually intelligible”) communication emerge; how do languages behave under contact; how does a language continuum develop; how does one language come to dominate another; how and why does extensive language contact tend to lead to simplification (e.g. in creoles); and how does a linguistic continuum come about, where neighboring languages are more intelligible than farther removed ones? In this work, we show that such linguistic phenomena emerge naturally given a few general assumptions about the organizational structure of networks of artificial agents equipped with a minimalistic form of learned communication.
Studying language change in vivo is challenging, since it requires simultaneous observation of speaker and community interactions Brooks and Ragir (2008); Trudgill (1974); Joseph (2017); Christiansen and Kirby (2003), while carefully controlling for purposes and goals Winograd (1971); Flores and Winograd (1987); Nowak and Krakauer (1999). Studies of language emergence and evolution must furthermore be conducted over long periods of time, spanning decades or even centuries. Even the Nicaraguan sign language, which emerged remarkably rapidly, took several decades to develop fully Senghas et al. (2005). Language itself also never ceases to evolve Fishman (1964).
Advances in computer science have provided us with opportunities for instead investigating the emergence and evolution of languages in vitro using computational and mathematical models R. Hurford (1989); Briscoe (2002); Kirby (2002); Christiansen and Kirby (2003); Lewis (2008); Skyrms (2010). In computational approaches, communities of agents, equipped with the ability to communicate, are deployed in a simulated environment. Their communication protocol either evolves or is learned, in order maximize some reward provided by the environment. The agents’ behaviors, together with their communication, are observed and compared against known linguistic phenomena or hypothesized linguistic theories. Using computational methods, one can precisely control linguistic, environmental and algorithmic variables, enabling a thorough examination of the cultural and communicative aspects of language.
We propose a new multi-agent framework for studying the emergence and evolution of language, where agents are neural networks endowed with the ability to exchange messages about their perceptual input. Computational multi-agent models are characterized by the complexity of the agents, the choice of learning algorithm, and the design of the environment and reward structure. The complexity of an artificial agent ranges from a set of simple difference equationsGrouchy et al. (2016), to a CPU-like architecture with an instruction set and registers Knoester et al. (2007), to a co-occurrence matrix between objects and symbols Nowak and Krakauer (1999), to a simple single-layer neural network Trianni and Dorigo (2006), to a deep neural network Lazaridou et al. (2016); Foerster et al. (2016); Jorge et al. (2016)
. The learning algorithm is either a variant of evolutionary algorithmsNowak and Krakauer (1999); Kirby (2002); Grouchy et al. (2016), often used in the framework of Artificial Life Bedau (2003), or a gradient-based optimization algorithm, as often used for training deep neural networks with a supervised or reinforcement learning objective function. The former sees generations developing complex behavior over time, while the latter enables more sophisticated agents thanks to the recent advances in deep learning LeCun et al. (2015). Recent years have seen intriguing new results in emergent communication, starting with Lazaridou et al. (2016) and Foerster et al. (2016), using such neural agents Lewis et al. (2017); Havrylov and Titov (2017); Jorge et al. (2016); Evtimova et al. (2018); Das et al. (2018); Cao et al. (2018). Often, these approaches could be framed as special or generalized cases of Lewis’s signalling game (Lewis, 2008, see Ch. 4–5 thereof). Here, deep learning agents play games within communities of similar agents, where the subject of the reference game is the agents’ perceptual input.
In what follows, we first introduce the multi-agent communication framework. We subsequently describe several linguistic phenomena that emerge within the framework. First, we investigate linguistic behavior on the agent-level, and examine when symmetric communication emerges within a linguistic community, as well as how the topological organization of agents within communities impacts their convergence and learning. We then switch to the community-level, where we examine the behavior of communities when they come into contact, as well as how community-level topologies impact convergence, success rate and mutual intelligibility.
We demonstrate that the following linguistic behaviors emerge, which correspond to known linguistic phenomena in human languages: 1) the outcome of contact is a function of inter- and intra-group connectivity, i.e. that languages become mutually intelligible through contact, even for agents that have not themselves been exposed to the other language, provided there is sufficient connectivity between communities; 2) linguistic contact over time either converges to the dominant majority protocol, leading to the other language becoming extinct, or if the communities are balanced gives rise to an original “creole” protocol that has lower complexity than the original languages; 3) a linguistic continuum emerges, where neighboring languages are more mutually intelligible than farther removed languages and the topology of the continuum governs its behavior. To our knowledge, this work constitutes the first attempt at studying contact linguistic phenomena using communities of latest-generation deep neural agents capable of dealing with rich perceptual input; and is the first to show that such intricate properties of language evolution need not depend on intrinsic properties of highly complex evolved linguistic capabilities, but instead can emerge purely from social exchanges between perceptually-enabled agents with simple communicative capabilities.
In order to study emergent linguistic phenomena in a simplified but realistic setting, the communication game needs to have several properties. First, it should be symmetric, in that all agents should be able to act as “speaker” and “listener”. Second, the agents should communicate about something external to themselves, i.e., about the sensory experience of something in their environment. Third, the world should be partially observable, implying that communication is required for solving the game successfully. Fig. 1 shows example training data, the game setup and agent design. Please refer to the supplementary material for details.
Let be a multi-agent communication game, consisting of communities of agents , given environmental observations , communicating across the bidirectional message channel , where the community membership of agents is defined as a graph with the agents as vertices and weighted edges that determine whether two agents are connected, i.e., specifies the topology of the network. Each agent is a deep neural network that can handle rich sensory inputs , such as raw images, as well as natural language text. Pairs of connected agents learn to play a game with a reward structure , designed specifically to require communication-based collaboration. The exchange of information through the communication channel can take any form. By learning to play the game, the agents develop a shared communication protocol which is triadic in nature (i.e., about the observations). This framework allows us to control for proximity constraints, population size and degree of interaction, through the underlying weighted graph structure. Moreover, we can explicitly specify the common purpose of agents through the reward structure, which gives control over the information that must be exchanged between a pair of agents for successfully solving a problem.
During training, a pair of agents corresponding to the adjacent vertices, and , is selected at random according to interaction weights . The agents then play the reference game and are updated according following the specified reward structure. The community structure can change during training. For instance, we are able to merge separately trained linguistic communities into a single community and finetune the agents from both communities (i.e. bring the communities into contact) to investigate linguistic contact. Once training of the agents in the linguistic community is done, we can ignore the graph structure underlying the community and test pairs of distant agents to understand the distribution of emergent communication protocols.
We first examine whether having just two autonomous agents is sufficient for developing a common communication protocol. That is, we ask whether a symmetric language emerges when there are only two agents in a linguistic community, or whether they learn to speak distinct idiolects to each other. In order to answer this question, we formally define “mutual intelligibility” in the communication protocol as the ability for each agent to play the game against itself. If a shared communication protocol has emerged, the agent would not have any trouble playing a game with itself (she “understands” what she “says” and “says” what she “understands”). We run five simulations with random initialization and examine the success rate between two agents averaged over all the test examples, where the agents succeed on each example if both of them correctly guess the answer after communication.
As shown in Fig. 2 (a), two agents can solve the problem with a high success rate when they play with each other (cross-play). However, the success rate drops to random chance (10%) when each agent plays with itself (self-play). This clearly suggests that the emergent communication protocol is not symmetric, and each agent develops its own protocol and the other adapts to it, leading to two incompatible idiolects. We then run additional experiments having three, four, five, eight and ten agents each, where every pair of agent interact with an equal interaction intensity (the success rate is averaged over all possible pairs of agents, unless stated otherwise). We notice that the success rates between self-play and paired-play are indistinguishable from each other, strongly implying that a common, shared language emerges as a social convention if and only if we have more than two language users. Importantly, this finding demonstrates that, at least within this framework, it is not necessary to specifically equip an agent with an innate mechanism that ties listening and speaking, such as the obverter technique Oliphant and Batali (1997); Choi et al. (2018), nor any explicit community-wide coordination. All that is needed in order for a common language to emerge is a minimum number of agents.
We observe no detrimental effect to increasing the number of agents per linguistic community, even though more agents have to come to agree on a single protocol. As shown in Fig. 2 (b–c), it takes approximately 60–65,000 plays per agent for us to observe the first instance of a pair of agents reaching the success rate of 70% regardless of the community size. Similarly, it takes approximately 150-200,000 plays per agent for the success rate averaged over all pairs of agents in a community to reach 75%, again, regardless of the size of the community. Surprisingly, the speed at which each agent learns stays constant with respect to the community size, as in Fig. 2 (d). That is, we were not able to observe any correlation between the community size and the linguistic convergence of the entire community. This is in contrast with observations in the naming game mentioned earlier, in which the convergence time increases with respect to the number of agents Baronchelli et al. (2008), suggesting that the behaviors of linguistic communities heavily depend on individual agents’ capabilities and learning settings and that further investigation, both theoretical and empirical, is warranted in this setup with sophisticated agents.
We next examine what happens when we expose different linguistic communities to each other. Specifically, we consider two linguistic communities of population sizes and
, which are trained independently from each other as fully-connected communities and have developed separate communication protocols. We bring these two communities into contact with each other by selecting a new set of inter-community edges with probabilityto form a new linguistic community. We assign a weight to all the inter-group edges and another weight to all the intra-group edges. We can then examine how “interaction intensity” relates to language shift. See Fig. 3 (a) for an illustration of two linguistic communities making contact.
We first investigate two communities of identical population sizes () with the ratio of the learning frequencies of the intra-group pairs and inter-group pairs set to , where is the number of inter-community connections and is the number of intra-community connections, and the inter-group connectivity chance set to . We notice in Fig. 3 (b) that the bridge agents learn to communicate better (evident from the higher success rate among themselves), but the other agents quickly catch up (according to the success rate among themselves excluding the bridge agents), although these other agents never directly interact with agents from the other community.
This finding demonstrates the rapid shift toward a common protocol in both groups where all agents learn to speak a shared language, regardless of whether they actually interact with agents from the other group.
Having established that linguistic contact leads to convergence of the communication protocol, we delve deeper into the impact of two major parameters: the ratio of inter and intra-group connectivity and the connectivity probability . We vary the ratio of the learning frequencies of the intra-group pairs and inter-group pairs between , and , while fixing the inter-group connectivity to . After 200,000 plays, the former () converges to a more tightly coupled linguistic community, achieving 65.6% success rate between agents that never interacted with each other. On the other hand, when the inter-group interaction occurred only half as frequently as the intra-group interaction, the agents from the two groups can play together with a much lower 52.4% success rate. We observed similar patterns over many different combinations of the ratio and inter-group connectivity. For example, we varied the inter-group connectivity between , , , and while the interaction ratio was fixed to . After 200,000 plays, we observed the success rates, averaged over all possible inter-group pairs, reach 42.1%, 51.1%, 65.55%, 66.65% and 66.3%, respectively. This implies that there is a critical level of inter-group connectivity (around in this specific case) after which language propagation saturates.
In Fig. 3 (c), we plot the interplay between the ratio and the inter-group connectivity after interpolating from the fifteen experiments varying these parameters. This demonstrates that both parameters are important in determining the level of linguistic convergence.
We investigate the effect of the population size ratio between two linguistic communities when they come in contact. We study how population size is a factor in one language coming to “dominate” another language upon contact. We vary the ratio by fixing and varying . Each community is pretrained in isolation to develop its own protocol before coming in contact with the other. We set the interaction ratio to and the inter-group connectivity chance to .
We refer to the original protocols of the communities right after pretraining by and . Each of these is then evolved further after these two communities come in contact, resulting in and . The previous experiment on linguistic contact suggests that based on the fact that the agents from both communities can successfully play the game after coming in contact, so we refer to the final protocol as . We examine how similar is to either of the original protocols, or . This similarity is measured by letting the agent using or play against the one using , which is naturally facilitated by the proposed framework. This historical self-play accuracy reflects the similarly of the original and final protocols.
When the population ratio deviates from , we observe that the final protocol rapidly converges to the majority protocol (), evident from the near-perfect and the near-chance in Fig. 4. This is the consequence of the fact that members from both communities are rewarded for cooperating and playing the game well (via the bridge agents). In other words, the agents prefer to integrate or assimilate rather than segregate, similar to how it has been found that minority groups shift “toward the use of dominant language” Fase et al. (1992). On the other hand, we observe and that both of these historical self-play accuracies are significantly above chance, when the population ratio is closer to or exactly . It is impossible to identify either or as an ancestor of , but is rather a combination of these two original protocols, which is “a key feature defining a contact language” (see Chapter 10 of Matras (2009)). Both of these observations suggest the potential of the proposed framework for simulating and understanding the birth and death of new languages via linguistic contact.
We further investigate the complexity of the contact language arising from two linguistic communities coming into contact. We define complexity based on the uncertainty of an agent when generating a message, measured by the entropy of the message distribution
. Higher entropy indicates that agents can express states in many different ways: in other words, the more complex a language, the higher the degree of freedom. For each linguistic community (consisting of two clusters), we then compute the average of these message distributions in order to characterize the complexity of a learned communication protocol.
We take four settings from the previous experiments— and —to investigate the evolution of linguistic complexity. In all cases, we observe in Fig. 5 (b) that the complexity decreases when two communities come in contact with each other. This observation is in agreement with a similar phenomenon of structural simplification in creole languages which are understood to arise from the contact of two or more languages Parkvall (2008); Bakker et al. (2011). We also observe that the complexity plateaus earlier when there is a larger imbalance between two communities’ population (10-3 and 10-4), while it drops further with more balanced communities (10-9 and 10-10). This implies that the new-born contact languages arising from the contact of two similarly-sized communities tend to be substantially simpler.
We generalize the previous setting to having linguistic communities in a connected chain of communities. We start by pretraining linguistic communities of populations respectively, evolving distinct communication protocols. We then chain them such that each consecutive pair, and , comes in contact with a pre-specified inter-group connectivity chance and interaction ratio , and begin training all of the communities jointly. This setup allows us to study the emergence of a linguistic continuum, which is highly relevant to dialect continuums existing in natural languages, as found e.g. in the Nordic Germanic dialects of Scandinavia, as a chain of Swedish dialect in Finland, Swedish in Sweden, Danish, Norwegian to Icelandic Chrystal (1987). Often, speakers on the border between two consecutive linguistic communities are mutually intelligible, while those from communities geographically separated by many intermediate ones cannot communicate.
We start by considering a chain of five communities of equal population (N=5). As plotted in Fig. 6 (a), we clearly observe the emergence of a linguistic continuum. The agents from a pair of adjacent communities can communicate with each other almost as well as those within a single community, while communicability rapidly degrades as the distance between a pair of communities grows (off-diagonal). The agents from and cannot understand each other at all, achieving the near-chance success rate. A similar continuum is observed when we increased the population of the center community two-fold (510). This continuum however exhibits properties different from the original chain of equal-sized communities: the center communities , and become more tightly coupled, as evident from the higher success rate among those in Fig. 6 (b-c). This however happens at the cost of communicability between the agents from furthest-removed communities.
In order to verify that the emergence of such a continuum is due to topological properties of communities, we show the protocol similarities among the densely connected five communities in Fig. 6 (d-e). Unlike chaining, we ensure that every pair of communities comes in contact with each other in a densely connected topology. As expected, the protocol similarity between any pair of communities is uniformly high, confirming that the linguistic continuum arises from the topology.
We have proposed a new framework for the large-scale investigation of complex linguistic phenomena via multi-agent communication games, which enables the analysis of communication protocols learned by linguistic communities of trainable agents. Our framework is contrasted from previous work through the complexity of the artificial agents using latest-generation deep reinforcement learning, as well as in their ability to handle rich sensory signal. We observed that a symmetric communication protocol emerges without any innate, explicit mechanism built in an agent, when there were three or more of them in a linguistic community.
We then demonstrated the emergence of linguistic phenomena in this framework. First, the result of linguistic contact between communities is determined by inter- and intra-group connectivity. Given sufficient inter-group connectivity, languages become mutually intelligible through contact, even for agents that have not themselves been exposed to the other language. Second, linguistic contact over time either converges to the dominant majority protocol, leading to the extinction of the other language, or gives rise to an original “creole” protocol that has lower complexity than the original languages, if the communities are balanced. Third, a linguistic continuum emerges, where neighboring languages are more mutually intelligible than farther removed languages. The topology of the continuum governs its behavior, and a very dominant central language causes its neighbors to lose mutual intelligibility with communities that are not directly exposed to that central language.
We conclude that intricate properties of language evolution need not depend on complex evolved linguistic capabilities, but can emerge from simple social exchanges between perceptually-enabled agents playing communication games.
We thank Alex Peysakhovich and Marco Baroni for their valuable comments.
Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters.In Advances in Neural Information Processing Systems, pages 211–217, 1990.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar, October 2014.
Proceedings of the 27th international conference on machine learning (ICML), pages 807–814, 2010.
Glove: Global vectors for word representation.In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014. URL http://www.aclweb.org/anthology/D14-1162.
We design a symmetric, multi-modal referential game in such a way that it is necessary for agents to cooperate if they are to successfully solve the game. Each agent observes a random half of an image that contains an object of specific shape and color (see Fig. 1 (a) for examples) and is given a set of textual captions of which only one correctly describes the object in the original image.
The dataset is created based on “ShapeWorld” Kuhnle and Copestake (2017). The goal of the agents is to identify the correct textual caption that describes the image. Since each agent only has partial information about the original image, the pair of agents must cooperate with each other via communication to be effective at solving the problem together. See Fig. 1 (b) for the graphical illustration of the proposed game.
At the beginning of the game, each agent makes an initial guess of the correct answer, followed by rounds of communication in which the agents take turns transmitting a binary message to the other agent. Binary message vectors have been used before for studying the emergence and evolution of language Kirby and Hurford (2002). While we selected this type of communication for efficiency reasons, it is straightforward to replace it with sequences of discrete symbols Jorge et al. (2016); Havrylov and Titov (2017); Lee et al. (2017); Cao et al. (2018); Lazaridou et al. (2018). Once the communication rounds are over, each agent makes their final guess . The game is considered successful, if both of the agents correctly guessed the answer.
The game is similar to other games in the language evolution literature Nowak and Krakauer (1999); Nowak et al. (1999), such as the naming game Steels (1995); Baronchelli et al. (2008), the guessing game Steels (2015) and the category game Puglisi et al. (2008). Unlike the guessing and category games, the proposed game is symmetric between the participating agents and is partially observed. Unlike the naming game, the agents in the proposed game can handle sensory input and learn to capture sophisticated relationships between objects with arbitrary messages by means of supervised and reinforcement learning.
The reward structure for each agent is designed as follows. First, we reward the agent when it correctly guesses the answer after communication rather than before in order to encourage it to incorporate information received from the other agent, where is an indicator function. We empirically validate the importance of rewarding after-communication behavior, shown in the first two bars in the left plot of Fig. 1 (d). Second, we reward cooperation by giving each agent a shared reward composed of both its own and the other agent’s rewards, i.e., , which significantly boosts the success rate as shown by the latter two bars in the left plot of Fig. 1 (d). Lastly, we explicitly encourage the agents to rely on communication by rewarding them for relative improvement from communication, rather than the success after communication: , where and . This final reward, which encourages both cooperation and explicit reliance on communication, reaches the highest success rate, as shown in the right plot of Fig. 1 (d).
A reference agent is implemented as a deep neural network consisting of multiple component sub-networks, based on recent advances in deep learning LeCun et al. (2015), as illustrated in Fig. 1 (c). Each agent is equipped with visual perception and the ability to communicate, both of which are implemented jointly in a single deep neural network and trained end-to-end using reinforcement learning to play the proposed communication game. The sensory sub-network is implemented as a ResNet-34, the state-of-the-art deep convolutional network from He et al. (2016)
, with fixed weights (i.e., using the weights obtained from training on ImageNet classification). We transfer the final pre-classification layer in order to extract a 512-dimensional feature vector from the partially-visible input image, which is further transformed with a trainable dense layer and ReLUNair and Hinton (2010); Glorot et al. (2011)activation function to a 100-dimensional feature vector, 2014) and is able to process multiple turns of message exchanges. It encodes the history of received binary-vector messages into a 100-dimensional feature vector . These two vectors are then combined using the fusion sub-network into a single 100-dimensional vector which represents the agent’s internal state. Based on this internal state , the agent computes three quantities. First, the predictor sub-network computes the predictive distribution over all the captions by comparing the internal state against the feature vector of each caption outputted by the text sub-network. The most likely caption under this distribution is the agent’s answer. Second, the sender sub-network computes the distribution of a message to be sent, using the output of the predictor sub-network to incorporate the agent’s current view of which caption is correct. During training, the agent stochastically samples a binary-vector message from this distribution , and during test, it uses the most likely message . The text sub-network encodes predictions (e.g. “there is a red circle”) as representations of natural language. Lastly, the reward is predicted by the value sub-network, which is only used during training to stabilize learning.
This modular agent design allows us to easily swap various sub-networks with other architectures within the same framework. For instance, by replacing the sender sub-network with a recurrent neural network, the agent can generate a sequence of symbols rather than a binary vector. One could also modify the game to include other sensory modalities by modifying the sensory sub-network.
Each agent is trained using a hybrid of supervised and reinforcement learning. Which of the two agents starts the game is decided at random. The agent computes two predictive distributions before and after message exchange, and . Since we know the correct caption
during training, we use supervised learning with these two predictive distribution,1986). Because messages are discrete, we cannot use backpropagation for learning the message generating process. Instead, we use reinforcement learning, in particular REINFORCE Williams (1992), to maximize the reward ; . We regularize learning by encouraging the entropy of the message distribution to be large to allow the agent to explore various communication strategies during learning.
We modify ShapeWorld Kuhnle and Copestake (2017) to generate a set of training, validation and test examples. Each example is a 128128 RGB image containing an object with a simple shape and color. There are eight shapes–‘circle’, ‘cross’, ‘ellipse’, ‘pentagon’, ‘rectangle’, ‘semicircle’, ‘square’ and ‘triangle’– and seven colors–‘blue’, ‘cyan’, ‘gray’, ‘green’, ‘magenta’, ‘red’ and ‘yellow’. The size and position of the object in an image are randomly decided, while ensuring that the object size is relatively small compared to the image size. Each image is associated with a textual caption, i.e., a sentences which describes the shape, color, or shape and color of the object. Some examples include “there is a blue square”, “there is a yellow shape” and “there is a square”.
We randomly select nine captions from the other images in order to create 10 candidate captions from which the correct one must be selected by both agents. 13-16% of examples are ambiguous due to the fact that objects can be described by their shape, color, or shape and color.
Each image is partitioned into two parts, each of which is shown to only one of two players. Due to the small size of the object in each image and random partitioning, the object is only visible to one of the players in approximately 82-84% of images. When the object is split into two partitions, both agents may be able to correctly solve the problem without consulting the other (yet a split rectangle may be wrongly perceived as a triangle, for instance). It is necessary for the agents to communicate in all the other cases. In Fig. 1 (a), we show example image partitions. Random partitioning happens during training and at evaluation time without any fixed partition per image.
We create 5,000 training examples while excluding the following combinations: ‘red square’, ‘green triangle’, ‘blue circle’, ‘yellow rectangle’, ‘magenta cross’ and ‘cyan ellipse’. These were excluded in order for us to test the generalization of trained agents to unseen combinations of color and shape.111 This is done to facilitate future research, and we do not test this generalization property in this paper. We similarly construct 1,000 in-domain evaluation examples with only combinations included in the training set, and 5,000 out-of-domain evaluation examples which contain all possible combinations. Both of these are held-out during training and are used for evaluation, with the in-domain results reported throughout this paper. Out of these 6,000 examples, 271 combinations of shape and color do not appear in the training set.
The agent extracts a sensory feature vector using a pretrained ResNet-34 excluding the final classification layer, denoted ResNet-34, which is available from torchvision package222 https://pytorch.org/docs/stable/torchvision/index.html
from PyTorch.333 https://pytorch.org/ It is followed by
where is a rectified linear unit, and and are trainable parameters.
A message is processed by a gated recurrent unit (GRU, Cho et al. (2014)) each time a new message is received:
where the GRU’s hidden state, , is initialized to zero at the beginning of each game. In this manuscript’s setup, there is only one message received per agent per game. The game begins with each agent receiving a blank message (all zeros) and making a prediction about the correct caption. Next the agent selected to communicate first sends a message to the other agent, who after receiving it sends a message back to the first agent.444 The order of message exchange is randomized each play. Finally, each agent again tries to predict the correct caption.
The image and message vectors, and , are concatenated and combined into a single vector by the fusion sub-network by
where and are trainable parameters. This fused vector is used to represent the agent’s internal state.
Each candidate caption is turned into a vector by the text sub-network:
where is the -th word of the candidate caption , is a trainable word embedding function with the vocabulary , and is the length of the caption. We build the embedding function using a set of pretrained 100-dimensional GloVe Pennington et al. (2014) vectors. The predictor sub-network then compares each candidate caption against the fused vector to compute its score:
These scores are normalized to become a probability Bridle (1990):
Given the fused vector, the agent computes the message to be sent to the partner. This is done by first computing the message distribution, using the normalized probabilities from the predictor sub-network to incorporate the agent’s current belief about the correct caption. Assuming a -dimensional binary message as done in this paper, the distribution is computed by first calculating a weighted sum of the caption;
This is combined with the hidden state to generate the message distribution;
where , , , , and are trainable parameters. During training, we sample from this distribution, while we simply round the probability for each bit at test time.
In order to reduce the variance of policy gradients, we use a learned value estimate as a baseline. The agent estimates the expected reward/return given the observation–image and message–using the value sub-network. It takes as input the fused vector and outputs a single scalar:
where , , and are trainable parameters.
There are four loss functions involved in each game. The first one is a prediction loss function. Given the indexof a correct caption, the prediction loss function is
This loss is used twice based on the predictions before and after the message exchange; and .
The second loss is a value loss function. After playing a game, the agent receives a reward . The value sub-network then needs to learn to predict its reward:
The third loss is a message loss function. During training, we sample one message from . If this message led to a success, we increase the probability of the sampled message. Otherwise, we decrease it. The success is measured relative to the predicted value. The cost function is then
where refers to using the predicted value but not updating the value sub-network according to this loss function. The gradient of this message loss function with respect to corresponds to REINFORCE Williams (1992).
Lastly, we include an entropy penalty. Following Evtimova et al. (2018), we encourage the entropy of the message distribution to be higher to facilitate exploration:
The overall loss function is then the weighted sum of the four loss functions:
where we set , , and .
We use stochastic gradient descent with minibatch size of 32 and use RMSProp to automatically adapt per-parameter learning rates.