In recent years, compositional language learning in the context of multi agent emergent communication has been extensively studied (Sukhbaatar et al., 2016; Foerster et al., 2016; Lazaridou et al., 2016; Baroni, 2019). These works have collectively found that while most emergent languages do not tend to be compositional, they can be guided towards this attribute through artificial task-specific constraints (Kottur et al., 2017; Lee et al., 2017).
In this paper, we focus on how a neural network, specifically a generative one, can learn a compositional language. Moreover, we ask how this can occur without task-specific constraints. To accomplish this, we first define what is a language and what we mean by compositionality. In tandem, we introduce precision and recall
, two metrics that help us measure how well a generative model at large has learned a grammar from a finite set of training instances. We then use a variational autoencoder with a discrete sequence bottleneck to investigate how well the model learns a compositional language and what biases that learning. This allows us to deriveresidual entropy
, a third metric that reliably measures compositionality in our particular environment. We use this metric to cross-validate precision and recall.
Our environment lets us experiment with a syntactic, compositional language while varying the channel width and the number of parameters, our surrogate for the capacity of a model. Our experiments reveal that our smallest models are only able to solve the task when the channel is wide enough to allow for a surface-level compositional representation. In contrast, large models learn a language as long as the channel is large enough. However, large models also have the ability to memorize the training set. We hypothesize that this memorization would lead to non-compositional representations and overfitting, albeit this does not yet manifest empirically. This setup allows us to test our hypothesis that there is a network capacity above which models will tend to produce languages with non-compositional structure.
2 Related Work
In recent years, there has been renewed interest in studies of emergent language (Foerster et al., 2016; Lazaridou et al., 2016; Havrylov and Titov, 2017) that originated with works such as Steels and Kaplan (2000); Steels (1997). Some of these approaches use referential games (Evtimova et al., 2017; Lazaridou et al., 2018) to produce an emergent language that ideally has properties of human languages, with compositionality being a commonly sought after property (Barrett et al., 2018; Baroni, 2019; Chaabouni et al., 2019a).
Our paper is most similar to Kottur et al. (2017), which showed that compositional language arose only when certain constraints on the agents are satisfied. While the constraints they examined were either making their models memoryless or having a minimal vocabulary in the language, we hypothesized about the importance for agents to have small capacity relative to the number of disentangled representations (concepts) to which they are exposed. This is more general because both of the scenarios they described fall under the umbrella of reducing model capacity. To ask this, we built a much bigger dataset to illuminate how capacity and channel width effect the resulting compositionality in the language. Another difference is that they had a back and forth exchange of agents; we had a single forward exchange. While a limitation, this helped us focus the scope.
Liska et al. (2018)
suggests that the average training run for recurrent neural networks does not converge to a compositional solution, but that a large random search will produce compositional solutions. This implies that the optimization approach biases learning, which is also confirmed in our experiments. However, we further analyze other biases.Spike et al. (2017) describes three properties that bias models towards successful learned signaling: the creation and transmission of referential information, a bias against ambiguity, and information loss. This lies on a similar spectrum to our work, but pursues a different intent in that they study biases that lead to optimal signaling; we seek compositionality. Verhoef et al. (2016); Kirby et al. (2015); Zaslavsky et al. (2018) all examine the trade-off between expression and compression in both emergent and natural languages, in addition to how that trade-off affects the learners. We differ in that we target a specific aspect of the agent (capacity) and ask how that aspect biases the learning. Chen et al. (2018)
describes how the probability distribution on the set of all strings produced by a recurrent model can be interpreted as a weighted language; this is relevant to our formulation of the language.
Most other works studying compositionality in emergent languages (Andreas, 2019; Mordatch and Abbeel, 2017; Chaabouni et al., 2019b; Lee et al., 2019) have focused on learning interpretable representations. See Hupkes et al. (2019) for a broad survey of the different approaches. By and large, these are orthogonal to our work because none pose the question we ask - how does an agent’s capacity effect the resulting language’s compositionality?
3 Compositional Language and Learning
We start by defining a language and what it means for a language to be compositional. We then discuss what it means for a network to learn a compositional language, based on which we derive evaluation metrics.
3.1 Compositional Language
A language is a subset of , where denotes a set of alphabets and denotes a string:
In this paper, we constrain a language to contain only finite-length strings, i.e., , implying that is a finite language. We use to denote the maximum length of any .
We define a generator from which we can sample one valid string at a time. It never generates an invalid string and generates all the valid strings in in a finite number of trials such that
We define the length of the description of as the sum of the number of non-terminal symbols and the number of production rules , where and . Each production rule takes as input an intermediate string and outputs another string .111
We recover generative grammer (Chomsky, 2002) if
Languages and compositionality
When the number of such production rules plus the number of intermediate symbols is smaller than the size of the language that generates, we call a compositional language. In other words, is compositional if and only if .
One such example is when we have sixty alphabets, , and six intermediate symbols, , for a total of production rules :
For each , , where
From these production rules and intermediate symbols, we obtain a language of size . We thus consider this language to be compositional and will use it in our experiments.
3.2 Learning a language
We consider the problem of learning an underlying language from a finite set of training strings randomly drawn from it:
where is the minimal length generator associated with . We assume and our goal is to use to learn a language that approximates as well as possible. We know that there exists an equivalent generator for
, and so our problem becomes estimating a generator from this finite set rather than reconstructing an entire set of strings belonging to the original language.
We cast the problem of estimating a generator as density modeling, in which case the goal is to estimate a distribution . Sampling from is equivalent to generating a string from the generator . Language learning is then
where is a regularization term, is its strength, and is a model space.
When the language was learned perfectly, any string sampled from the learned distribution must belong to . Also, any string in must be assigned a non-zero probability under . Otherwise, the set of strings generated from this generator, implicitly defined via , is not identical to the original language . This observation leads to two metrics for evaluating the quality of the estimated language with the distribution , precision and recall:
where is the indicator function. These metrics are designed to be fit for any compositional structure rather than one-off evaluation approaches. Because these are often intractable to compute, we approximate them using Monte-Carlo by sampling samples from for calculating precision and uniform samples from for calculating recall.
where and is a uniform sample from .
3.3 Compositionality, learning, and capacity
When learning an underlying compositional language , there are three possible outcomes:
Overfitting: could memorize all the strings that were presented when solving Eq. (2) and assign non-zero probabilities to those strings and zero probabilities to all others. This would maximize precision, but recall will be low as the estimated generator does not cover .
Systematic generalization: could capture the underlying compositional structures of characterized by the production rules and intermediate symbols . In this case, will assign non-zero probabilities to all the strings that are reachable via these production rules (and zero probability to all others) and generalize to strings from that were unseen during training, leading to high precision and recall. This behavior was characterized in Lake and Baroni (2017).
Failure: may neither memorize the entire training set nor capture the underlying production rules and intermediate symbols, resulting in both low precision and recall.
We hypothesize that a major factor determining the compositionality of the resulting language is the capacity of the most complicated distribution within the model space .222 As the definition of a model’s capacity heavily depends on the specific construction, we do not concretely define it here but do later when we introduce a specific family of models with which we run experiments. The hypothesis is that when the model capacity is too high, the first case of total memorization is likely. When it is too low, the third case of catastrophic failure will happen. Only when the model capacity is just right will language learning correctly capture the compositional structure underlying the original language and exhibit systematic generalization (Bahdanau et al., 2019). We are interested in empirically investigating this hypothesis using a neural network as a language learner. We do so by using a variational autoencoder with a discrete sequence bottleneck to model . This is useful because it admits the interpretation of a two-player ReferIt game (Lewis, 1969) in addition to being a density estimator, attributes that together allow us to utilize recall and precision to describe a language that has resulted from an emergent communication process.
4 Variational autoencoders and their capacity
A variational autoencoder (Kingma and Welling, 2013) consists of two neural networks which are often referred to as an encoder , a decoder , and a prior . These two networks are jointly updated to maximize the variational lower bound to the marginal log-probability of training instances :
We can use as a proxy to the true captured by this model. Once trained, we can efficiently sample from the prior and then sample a string from .
The usual formulation of variational autoencoders uses a continuous latent variable
, which conveniently admits reparametrization that reduces the variance in the gradient estimate. However, this infinitely large latent variable space (from continuous variable) makes it difficult to understand the resulting capacity of the model. We thus constrain the latent variable to be a binary string of a fixed length, i.e., . Assuming deterministic decoding, i.e., , this puts a strict upperbound of on the size of the language captured by the variational autoencoder.
4.1 Variational autoencoder as a communication channel
As described above, using variational autoencoders with a discrete sequence bottleneck allows us to analyze the capacity of the model in terms of computation and bandwidth. We can now interpret this variational autoencoder as a communication channel in which a novel protocol must emerge as a by-product of learning. If each string in the original language is a description of underlying concepts, then the goal of the encoder is to encode those concepts in a binary string following an emergent communication protocol. The decoder receives this string and must interpret which set of concepts were originally described by the speaker.
We simplify and assume that all the alphabets in the string indicate the underlying concepts. While the inputs are ordered according to the sequential concepts, our model encodes them using a bag of words (BoW) representation.
The encoder is parameterized using a recurrent policy which receives the sequence of concatenated one-hot input tokens of and converts each of them to an embedding. It then runs an LSTM (Hochreiter and Schmidhuber, 1997) non-autoregressively for timesteps taking the flattened representation of the input embeddings as its input and linearly projecting each result to a probability distribution over
. This results in a sequential Bernoulli distribution overlatent variables defined as:
From this distribution, we can sample a latent string of length .
The decoder receives and uses a bag of words (BoW) representation to encode them into its own embedding space. Taking the flattened representation of these embeddings as input, we run an LSTM for time steps, each time outputting a probability distribution over the full alphabet set :
To train the whole system end-to-end (Sukhbaatar et al., 2016; Mordatch and Abbeel, 2017) via backpropogation, we apply a continuous approximation to that depends on a learned temperature parameter . We use Straight-Through Gumbel-Softmax (Jang et al., 2016; Maddison et al., 2017) to convert the continuous distribution to a discrete distribution for each
. This corresponds to the original discrete distribution in the zero-temperature limit. The final sequence of one hot vectors encodingis our message which is passed to the decoder . If
and the Bernoulli random variable corresponding tohas class probabilities and , then .
The prior encodes the message using bag of words (BoW) representation. It gives the probability of according to the prior (binary) distribution for each and is defined as:
This can be used both to compute the prior probability of a latent string and also to efficiently sample a string fromusing ancestral sampling. Penalizing the KL divergence between the speaker’s distribution and the prior distribution in Eq. (7) encourages the emergent protocol to use latent strings that are as diverse as possible.
4.2 Capacity of a variational autoencoder with discrete sequence bottleneck
This view of a variational autoencoder with discrete sequence bottleneck presents an opportunity for us to separate the model’s capacity into two parts. The first part is the capacity of the communication channel, imposed by the size of the latent variable. As described earlier, the size of the original language that can be perfectly captured by this model is strictly upper bounded by , where is the preset length of the latent string . If , the model will not be able to learn the language completely, although it may memorize all the training strings if . A resulting question is whether is a sufficient condition for the model to learn from a finite set of training strings.
The second part involves the capacity of the encoder and decoder to map between the latent variable and a string in the original language . Taking the parameter count as a proxy to the number of patterns that could be memorized by a neural network,333 This is true in certain scenarios such as radial-basis function networks.
This is true in certain scenarios such as radial-basis function networks.we can argue that the problem can be solved if the encoder and decoder each have parameters, in which case they can implement a hashmap between a string in the original language and that of the learned latent language defined by the strings.
However, when the underlying language is compositional as defined in §3.1, we can have a much more compact representation of the entire language than a hashmap. Given the status quo understanding of neural networks, it is unfortunately impossible to even approximately correlate the parameter count with the language specification (production rules and intermediate symbols) and the complexity of using that language. It is however reasonable to assume that there is a monotonic relationship between the number of parameters , or , and the capacity of the network to encode the compositional structures underlying the original language (Collins et al., 2016). Thus, we use the number of parameters as a proxy to measure the capacity.
In summary, there are two axes in determining the capacity of the proposed variational autoencoder with a discrete sequence bottleneck: the length of the latent sequence and the number of parameters () in the encoder (decoder).444 We design the variational autoencoder to be symmetric so that the numbers of the parameters in the encoder and decoder are roughly the same. We vary these two quantities in the experiments and investigate how they affect compositional language learning by the proposed variational autoencoder.
4.3 Implications and hypotheses on compositionality
Under this framework for language learning, we can make the following observations:
If the length of the latent sequence , it is impossible for the model to avoid the failure case because there will be strings in that cannot be generated from the trained model. Consequently, recall cannot be maximized. However, this may be difficult to check using the sample-based estimate as the chance of sampling decreases proportionally to the size of . This is especially true when the gap is narrow.
When , there are three cases. The first is when there are not enough parameters to learn the underlying compositional grammar given by , , and , in which case cannot be learned. The second case is when the number of parameters is greater than that required to store all the training strings, i.e., . Here, it is highly likely for the model to overfit as it can map each training string with a unique latent string without having to learn any of ’s compositional structure. Lastly, when the number of parameters lies in between these two poles, we hypothesize that the model will capture the underlying compositional structure and exhibit systematic generalization.
In short, we hypothesize that the effectiveness of compositional language learning is maximized when both the length of the latent sequence is large enough, i.e., , and the number of parameters is between and for some positive integer . We test this experimentally by varying the length of the latent sequence and the number of parameters while checking the sample-based estimates of precision and recall (Eq. (5)–(6)).
As described in §3.1, we run experiments where the size of the language is much larger than the number of production rules. The task is to communicate concepts, each of which have possible values with a total dataset size of . We build three datasets containing a finite number of strings each:
where uniformly selects random items from without replacement. The randomly selected concept values and ensure that concept combinations are unique to each set. The symbol refers to any number of concepts, as in regular expressions.
Models and Learning
We train the proposed variational autoencoder (described in §4.1) on , using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of and weight decay coefficient of . The Gumbel-Softmax temperature parameter in is initialized to 10. Since systematic generalization may only happen in some training runs (Weber et al., 2018), each model is trained for each of 10 seeds over 200k steps.
The base model hyperparameters used in our experiments (see §4.1 for architecture details) along with the total number of parameters for 2 models.
We train two sets of models. Each set is built from the base model, the architectures of which are described in Table 1. We gradually decrease the number of LSTM units from the base model by a factor . This is how we control the number of parameters ( and ), a factor we hypothesize to influence the resulting compositionality. We obtain seven models from each of these by varying the length of the latent sequence from . These were chosen because we both wanted to show a range of bits and because we need at least bits to cover the strings in ().
5.1 Evaluation: residual entropy
Our setup allows us to design a metric by which we can check the compositionality of the learned language by examining how the underlying concepts are described by a string. For instance, describes that , , , , and . Furthermore, we know that the value of a concept is independent of the other concepts , and so our custom generative setup with a discrete latent sequence allows us to inspect a learned language by considering .
Let be a sequence of partitions of . We define the degree of compositionality as the ratio between the variability of each concept and the variability explained by a latent subsequence indexed by an associated partition . More formally, the degree of compositionality given the partition sequence is defined as a residual entropy
where there are concepts by the definition of our language. When each term inside the summation is close to zero, it implies that a subsequence explains most of the variability of the specific concept , and we consider this situation compositional. The residual entropy of a trained model is then the smallest over all possible sequences of partitions and spans from (compositional) to (non-compositional).
Fig. 1 shows the main findings of our research. In plot (a), we see the minimum parameter count below which the model cannot solve the task but above which it is solvable. Further, note the curve delineated by the lower left corner of the shift from unsuccessful to successful models. This inverse relationship between bits and parameters shows that the more parameters in the model, the fewer bits it needs to solve the task. Note however that it could only solve the task with fewer bits if it was forming a non-compositional code, suggesting that higher parameter models are able to do so while lower parameter ones cannot.
Observe further that all of our models above the minimum threshold (72,400) have the capacity to learn a compositional code. This is shown by the perfect training accuracy achieved by all of those models in plot (a) for 24 bits and by the perfect compositionality (zero entropy) in plot (b) for 24 bits. Together with the above, this validates that learning compositional codes requires less capacity than learning non-compositional codes.
Plot (c) confirms our hypothesis that large models can memorize the entire dataset. Observe that the 24 bit model with 971,400 parameters achieves a train accuracy of 1.0 and a validation accuracy of 0.0. Cross-validating this with plots (d) and (g), we find that a member of the same parameter class is non-compositional and that there is one that achieves unusually low recall. We verified that these are all the same seed, which shows that the agents in this model are memorizing the dataset.
Plots (b) and (e) show that our compositionality metrics pass two sanity checks - high recall and perfect entropy can only be achieved with a channel that is sufficiently large (i.e. 24 bits) to allow for a compositional latent representation.
Plot (f) shows that while the capacity does not affect the ability to learn a compositional language across the model range, it does dramatically change the learnability. Here we find that smaller models can fail to solve the task for any bandwidth, which coincides with literature suggesting a link between overparameterisation and learnability (Li and Liang, 2018; Du et al., 2018).
Finally, as expected, we find that no model learns to solve the task with bits, validating that the minimum required number of bits for learning a language of size is . We also see that no model learns to solve it for bits, which is likely due to optimization difficulties.
In Fig. 2, we present histograms showing precision, recall and residual entropy measured for each bit and parameter combination over the test set. The histograms show the distributions of these metrics, upon which we make a number of observations.
We first confirm the effectiveness of training by observing that almost all the models achieve perfect precision (Fig. 2 (a)), implying that , where is the language learned by the model. This occurs even with our learning objective in Eq. (7) encouraging the model to capture all training strings rather than to focus on only a few training strings.
A natural follow-up question is how large is. We measure this with recall (Fig. 2
(b)), which shows a clear phase transition according to the model capacity when. This agrees with what we saw in Fig 1 and is equivalent to saying at a value that is close to our predicted boundary of . We attribute this gap to the difficulty in learning a perfectly-parameterized neural network.
Even when , we observe training runs that fail to achieve optimal recall when the number of parameters is . Due to insufficient understanding of the relationship between the number of parameters and the capacity in a deep neural network, we cannot make a rigorous conclusion. We however conjecture that this is the upperbound to the minimal model capacity necessary to capture the tested compositional language. Above this threshold, the recall is almost always perfect, implying that the model has likely captured the compositional structure underlying from a finite set of training strings. We run further experiments up to parameters, but do not observe the expected overfitting. We also run experiments with the number of categories reduced to from (see Appendix for the associated histograms) and similarly do not find the upperbound. It is left for future studies to determine whether this is due to the lack of model capacity to memorize the hash map between all the strings in and latent strings or due to an inclination towards compositionality in our variational autoencoder.
These results clearly confirm the first part of our hypothesis - the latent sequence length must be at least as large as . They further confirm that there is a lowerbound on the number of parameters over which this variational autoencoder can learn the underlying language well. We have not been able to verify the upper bound in our experiments, which may require either a more (computationally) extensive set of experiments with even more parameters or a better theoretical understanding of the inherent biases behind learning variational autoencoders with a discrete sequence bottleneck, such as recent work on overparameterized models (Belkin et al., 2019).
In this paper, we hypothesize a thus far ignored connection between learnability, capacity, bandwidth, and compositionality for language learning. We empirically verfiy that learning the underlying compositional structure requires less capacity than memorizing a dataset. We also introduce a set of metrics to analyze the compositional properties of a learned language. These metrics are not only well motivated by theoretical insights, but are cross-validated by our task-specific metric.
This paper opens the door for a vast amount of follow-up research. All our models were sufficiently large to represent the compositional structure of the language when given sufficient bandwidth, however there should be an upper bound for representing this compositional structure that we did not reach. We consider answering that to be the foremost question.
Furthermore, while large models did overfit, this was an exception rather than the rule. We hypothesize that this is due to the large number of examples in our language, which almost forces the model to generalize, but note that there are likely additional biases at play that warrant further investigation.
- Andreas (2019) Andreas, J. (2019). Measuring compositionality in representation learning. In International Conference on Learning Representations.
- Bahdanau et al. (2019) Bahdanau, D., Murty, S., Noukhovitch, M., Nguyen, T. H., de Vries, H., and Courville, A. (2019). Systematic generalization: What is required and can it be learned? In International Conference on Learning Representations.
- Baroni (2019) Baroni, M. (2019). Linguistic generalization and compositionality in modern artificial neural networks. CoRR, abs/1904.00157.
- Barrett et al. (2018) Barrett, J. A., Skyrms, B., and Cochran, C. (2018). Hierarchical Models for the Evolution of Compositional Language. page 20.
Belkin et al. (2019)
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019).
Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854.
- Chaabouni et al. (2019a) Chaabouni, R., Kharitonov, E., Dupoux, E., and Baroni, M. (2019a). Anti-efficient encoding in emergent communication. arXiv:1905.12561 [cs]. arXiv: 1905.12561.
- Chaabouni et al. (2019b) Chaabouni, R., Kharitonov, E., Lazaric, A., Dupoux, E., and Baroni, M. (2019b). Word-order biases in deep-agent emergent communication. arXiv:1905.12330 [cs]. arXiv: 1905.12330.
- Chen et al. (2018) Chen, Y., Gilroy, S., Maletti, A., May, J., and Knight, K. (2018). Recurrent neural networks as weighted language recognizers. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2261–2271. Association for Computational Linguistics.
- Chomsky (2002) Chomsky, N. (2002). Syntactic structures. Walter de Gruyter.
- Collins et al. (2016) Collins, J., Sohl-Dickstein, J., and Sussillo, D. (2016). Capacity and trainability in recurrent neural networks. CoRR, abs/1611.09913.
- Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. (2018). Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054.
- Evtimova et al. (2017) Evtimova, K., Drozdov, A., Kiela, D., and Cho, K. (2017). Emergent language in a multi-modal, multi-step referential game. CoRR, abs/1705.10369.
Foerster et al. (2016)
Foerster, J., Assael, I. A., de Freitas, N., and Whiteson, S. (2016).
Learning to Communicate with Deep Multi-Agent Reinforcement Learning.In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems 29, pages 2137–2145. Curran Associates, Inc.
- Havrylov and Titov (2017) Havrylov, S. and Titov, I. (2017). Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols. arXiv:1705.11192 [cs]. arXiv: 1705.11192.
- Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Comput., 9(8):1735–1780.
- Hupkes et al. (2019) Hupkes, D., Dankers, V., Mul, M., and Bruni, E. (2019). The compositionality of neural networks: integrating symbolism and connectionism. arXiv:1908.08351 [cs, stat]. arXiv: 1908.08351.
- Jang et al. (2016) Jang, E., Gu, S., and Poole, B. (2016). Categorical Reparameterization with Gumbel-Softmax. ArXiv e-prints.
- Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
- Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Kirby et al. (2015) Kirby, S., Tamariz, M., Cornish, H., and Smith, K. (2015). Compression and communication in the cultural evolution of linguistic structure. Cognition, 141:87–102.
Kottur et al. (2017)
Kottur, S., Moura, J., Lee, S., and Batra, D. (2017).
Natural language does not emerge ‘naturally’ in multi-agent dialog.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2962–2967. Association for Computational Linguistics.
- Lake and Baroni (2017) Lake, B. M. and Baroni, M. (2017). Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. arXiv preprint arXiv:1711.00350.
- Lazaridou et al. (2018) Lazaridou, A., Hermann, K. M., Tuyls, K., and Clark, S. (2018). Emergence Of Linguistic Communication From Referential Games With Symbolic And Pixel. page 13.
- Lazaridou et al. (2016) Lazaridou, A., Peysakhovich, A., and Baroni, M. (2016). Multi-agent cooperation and the emergence of (natural) language. CoRR, abs/1612.07182.
- Lee et al. (2019) Lee, J., Cho, K., and Kiela, D. (2019). Countering Language Drift via Visual Grounding. arXiv:1909.04499 [cs]. arXiv: 1909.04499.
- Lee et al. (2017) Lee, J., Cho, K., Weston, J., and Kiela, D. (2017). Emergent Translation in Multi-Agent Communication. arXiv:1710.06922 [cs]. arXiv: 1710.06922.
- Lewis (1969) Lewis, D. (1969). Convention: A Philosophical Study. Wiley-Blackwell.
Li and Liang (2018)
Li, Y. and Liang, Y. (2018).
Learning overparameterized neural networks via stochastic gradient descent on structured data.In Advances in Neural Information Processing Systems, pages 8157–8166.
- Liska et al. (2018) Liska, A., Kruszewski, G., and Baroni, M. (2018). Memorize or generalize? searching for a compositional RNN in a haystack. CoRR, abs/1802.06467.
- Maddison et al. (2017) Maddison, C. J., Mnih, A., and Teh, Y. W. (2017). The concrete distribution: A continuous relaxation of discrete random variables. CoRR, abs/1611.00712.
- Mordatch and Abbeel (2017) Mordatch, I. and Abbeel, P. (2017). Emergence of Grounded Compositional Language in Multi-Agent Populations. arXiv:1703.04908 [cs]. arXiv: 1703.04908.
- Spike et al. (2017) Spike, M., Stadler, K., Kirby, S., and Smith, K. (2017). Minimal requirements for the emergence of learned signaling. In Cognitive Science.
- Steels (1997) Steels, L. (1997). The synthetic modeling of language origins. Evolution of Communication, 1(1):1–34.
- Steels and Kaplan (2000) Steels, L. and Kaplan, F. (2000). Aibo’s first words: The social learning of language and meaning. Evolution of Communication, 4.
Sukhbaatar et al. (2016)
Sukhbaatar, S., szlam, a., and Fergus, R. (2016).
Learning Multiagent Communication with Backpropagation.In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems 29, pages 2244–2252. Curran Associates, Inc.
- Verhoef et al. (2016) Verhoef, T., Kirby, S., and de Boer, B. (2016). Iconicity and the emergence of combinatorial structure in language. Cognitive Science, 40(8):1969–1994.
Weber et al. (2018)
Weber, N., Shekhar, L., and Balasubramanian, N. (2018).
The fine line between linguistic generalization and failure in seq2seq-attention models.In
Proceedings of the Workshop on Generalization in the Age of Deep Learning, pages 24–27. Association for Computational Linguistics.
- Zaslavsky et al. (2018) Zaslavsky, N., Kemp, C., Regier, T., and Tishby, N. (2018). Efficient compression in color naming and its evolution. Proceedings of the National Academy of Sciences, 115(31):7937–7942.