Log In Sign Up

Compositionality and Generalization in Emergent Languages

by   Rahma Chaabouni, et al.

Natural language allows us to refer to novel composite concepts by combining expressions denoting their parts according to systematic rules, a property known as compositionality. In this paper, we study whether the language emerging in deep multi-agent simulations possesses a similar ability to refer to novel primitive combinations, and whether it accomplishes this feat by strategies akin to human-language compositionality. Equipped with new ways to measure compositionality in emergent languages inspired by disentanglement in representation learning, we establish three main results. First, given sufficiently large input spaces, the emergent language will naturally develop the ability to refer to novel composite concepts. Second, there is no correlation between the degree of compositionality of an emergent language and its ability to generalize. Third, while compositionality is not necessary for generalization, it provides an advantage in terms of language transmission: The more compositional a language is, the more easily it will be picked up by new learners, even when the latter differ in architecture from the original agents. We conclude that compositionality does not arise from simple generalization pressure, but if an emergent language does chance upon it, it will be more likely to survive and thrive.


Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog

A number of recent works have proposed techniques for end-to-end learnin...

Compositional properties of emergent languages in deep learning

Recent findings in multi-agent deep learning systems point towards the e...

Concept Generation in Language Evolution

This thesis investigates the generation of new concepts from combination...

Iterated learning for emergent systematicity in VQA

Although neural module networks have an architectural bias towards compo...

Do Vision-Language Pretrained Models Learn Primitive Concepts?

Vision-language pretrained models have achieved impressive performance o...

The Effect of Efficient Messaging and Input Variability on Neural-Agent Iterated Language Learning

Natural languages commonly display a trade-off among different strategie...

1 Introduction

Most concepts we need to express are composite in some way. Language gives us the prodigious ability to assemble messages referring to novel composite concepts by systematically combining expressions denoting their parts. As interest raises in developing deep neural agents evolving a communication code to better accomplish cooperative tasks, the question arises of how the emergent code can be endowed with the same desirable compositionality property Kottur:etal:2017; Lazaridou:etal:2018; mordatch2018emergence; cogswell2019; Li2019. This in turn requires measures of how compositional an emergent language is Andreas:2019. Compositionality is a core notion in linguistics Partee:2004, but linguists’ definitions assume full knowledge of primitive expressions and their combination rules, which we lack when analyzing emergent languages Nefdt:2020. Also, these definitions are categorical, whereas to compare emergent languages we need to quantify degrees of compositionality.

Some researchers equate compositionality with the ability to correctly refer to unseen composite inputs (e.g., Kottur:etal:2017; cogswell2019). This approach measures the generalization ability of a language, but it does not provide any insights on how this ability comes about. Indeed, one of our main results below is that emergent languages can attain perfect generalization without abiding to intuitive notions of compositionality.

Topographic similarity has become the standard way to quantify the compositionality of emergent languages (e.g., Brighton2006; Lazaridou:etal:2018; Li2019). This metric measures whether the distance between two meanings correlates with the distance between the messages expressing them. While more informative than generalization, topographic similarity is still rather agnostic about the nature of composition. For example, when using, as is standard practice, Levenshtein distance to measure message distance, an emergent language transparently concatenating symbols in a fixed order and one mixing deletion and insertion operations on free-ordered symbols can have the same topographic similarity.

We introduce here two more ‘‘opinionated’’ measures of compositionality that capture some intuitive properties of what we would expect to happen in a compositional emergent language. One possibility we consider is that order-independent juxtapositions of primitive forms could denote the corresponding union of meanings, as in English noun conjunctions: cats and dogs, dogs and cats. The second still relies on juxtaposition, but exploits order to denote different classes of meanings, as in English adjective-noun phrases: red triangle, blue square. Both strategies result in disentangled messages, where each primitive symbol (or symbol+position pair) univocally refers to a distinct primitive meaning independently of context. We consequently take inspiration from work on disentanglement in representation learning Suter2019 to craft measures that quantify whether an emergent language follows one of the proposed composition strategies.

Equipped with these metrics, we proceed to ask the following questions. First, are neural agents able to generalize to unseen input combinations in a simple communication game? We find that generalizing languages reliably emerge when the input domain is sufficiently large. This somewhat expected result is important nevertheless, as failure-to-generalize claims in the recent literature are often based on very small input spaces. Second, we unveil a complex interplay between compositionality and generalization. On the one hand, there is no correlation between our compositionality metrics and the ability to generalize, as emergent languages successfully refer to novel composite concepts in inscrutablly entangled ways. (Order-dependent) compositionality, however, if not necessary, turns out to be a sufficient condition for generalization. Finally, more compositional languages are easier to learn for new agents, including agents that are architecturally different from the ones that evolved the language. This suggests that, while composition might not be a ‘‘natural’’ outcome of the need to generalize, it is a highly desirable one, as compositional languages will more easily be adopted by a large community of different agents. We return to the implications of our findings in the discussion.

2 Setup

2.1 The game

We designed a variant of Lewis’ signaling game Lewis:1969. The game proceeds as follows:

  1. Sender network receives one input and chooses a sequence of symbols from its vocabulary of size to construct a message of fixed length .

  2. Receiver network consumes and outputs .

  3. Agents are successful if , that is, Receiver reconstructs Sender’s input.

Each input of the reconstruction game is comprised of attributes, each with possible values. We let range from to and from to . We represent each attribute as a

one-hot vector. An input

is given by the concatenation of its attributes. For a given (, ), the number of input samples .

This environment, which can be seen as an extension of that of Kottur:etal:2017, is one of the simplest possible settings to study the emergence of reference to composite concepts (here, combinations of multiple attributes). Attributes can be seen as describing object properties such as color and shape, with their values specifying those properties for particular objects (red, round). Alternatively, they could be seen as slots in an abstract semantic tree (e.g., agent and action), with the values specifying their fillers (e.g., dog, barking). In the name of maximally simplifying the setup and easing interpretability, unlike Kottur:etal:2017, we consider a single-step game. We moreover focus on input reconstruction instead of discrimination of a target input among distractors as the latter option adds furtherx complications: for example, languages in that setup have been shown to be sensitive to the number and distribution of the distractors Lazaridou:etal:2018.

For a fixed , we endow Sender with large enough channel capacity ( and ) to express the whole input space (i.e., ). Unless explicitly mentioned, we run different initializations per setting. See Appendix 8.1 for details about the range of tested settings. The game is implemented in EGG Kharitonov:etal:2019.111Code can be found at

2.2 Agent architecture

Both agents are implemented as single-layer GRU cells (Cho2014) with hidden states of size 500.222Experiments with GRUs of different capacity are reported in the Appendix. We also informally replicated our main results with LSTMs, that were slower to converge. We were unable to adapt Transformers to successfully play our game. Sender encodes in a message of fixed length as follows. First, a linear layer maps the input vector into the initial hidden state of Sender. Next, the message is generated symbol-by-symbol by sampling from a Categorical distribution over the vocabulary , parameterized by a linear mapping from Sender’s hidden state. The generated symbols are fed back to the cell. At test time, instead of sampling, symbols are selected greedily.

Receiver consumes the entire message . Further, we pass its hidden state through a linear layer and consider the resulting vector as a concatenation of probability vectors over values each. As a loss, we use the average cross-entropy between these distributions and Sender’s input.

2.3 Optimization

Popular approaches for training with discrete communication include Gumbel-Softmax Maddison2016; Jang2016, REINFORCE Williams1992, and a hybrid in which the Receiver gradients are calculated via back-propagation and those of Sender via REINFORCE (Schulman2015). We use the latter, as recent work (e.g., Chaabouni:etal:2019)

found it to converge more robustly. We apply standard tricks to improve convergence: (a) running mean baseline to reduce the variance of the gradient estimates 

Williams1992, and (b) a term in the loss that favors higher entropy of Sender’s output, thus promoting exploration. The obtained gradients are passed to the Adam optimizer (Kingma2014) with learning rate .

3 Measurements

3.1 Compositionality

Topographic similarity (topsim) Brighton2006 is commonly used in language emergence studies as a quantitative proxy for compositionality (e.g., Lazaridou:etal:2018; Li2019). Given a distance function in the input space (in our case, attribute value overlap, as attributes are unordered, and values categorical) and a distance function in message space (in our case, following standard practice, minimum edit distance between messages), topsim is the (Spearman) correlation between pairwise input distances and the corresponding message distances. The measure can detect a tendency for messages with similar meanings to be similar in form, but it is relatively agnostic about the type of similarity (as long as it is captured by minimum edit distance).

We complement topsim with two measures that probe for more specific types of compositionality, that we believe capture what deep-agent emergent-language researchers seek for, when interested in compositional languages. In most scenarios currently considered in this line of research, the composite inputs agents must refer to are sets or sequences of primitive elements: for example, the values of a set of attributes, as in our experiment. In this restricted setup, a compositional language is a language where symbols independently referring to primitive input elements can be juxtaposed to jointly refer to the input ensembles. Consider a language with a symbol r referring to input element color:red and another symbol l referring to weight:light, where r and l can be juxtaposed (possibly, in accordance with the syntactic rules of the language) to refer to the input set color:red, weight:light. This language is intuitively compositional. On the other hand, a language where both r and l refer to these two input elements, but only when used together, whereas other symbol combinations would refer to color:red and weight:light in other contexts, is intuitively not compositional. Natural languages support forms of compositionality beyond the simple juxtaposition of context-independent symbols to denote ensembles of input elements we are considering here (e.g., constructions that denote the application of functions to arguments). However, we believe that the proposed intuition is adequate for the current state of affairs in language emergence research.

The view of compositionality we just sketched is closely related to the idea of disentanglement in representation learning. Disentangled representations are expected to enable a consequent model to generalize on new domains and tasks (Bengio2013). Even if this claim has been challenged (Bozkurt2019; Locatello2019), several interesting metrics have been proposed to quantify disentanglement, as reviewed in Suter2019. We build in particular upon the Information Gap disentanglement measure of Chen2018, evaluating how well representations capture independence in the input sets.

Our positional disentanglement (posdis) metric measures whether symbols in specific positions tend to univocally refer to the values of a specific attribute. This order-dependent strategy is commonly encountered in natural language structures (and it is a pre-condition for sophisticated syntactic structures to emerge). Consider English adjective-noun phrases with a fully intersective interpretation, such as yellow triangle. Here, the words in the first slot will refer to adjectival meanings, those in the second to nominal meanings. In our simple environment, it might be the case that the first symbol is used to discriminate among values of an attribute, and the second to discriminate among values of another attribute. Let’s denote the -th symbol of a message and the attribute that has the highest mutual information with : . In turn, is the second highest informative attribute, . Denoting the entropy of -th position (used as a normalizing term), we define posdis as:


We ignore positions with zero entropy. Eq. 1 captures the intuition that, for a language to be compositional given our inputs, each position of the message should only be informative about a single attribute. However, unlike the related measure proposed by Resnick2019, it does not require knowing which set of positions encodes a particular attribute, which makes it computationally simpler (only linear in ).

Posdis assumes that a language uses positional information to disambiguate symbols. However, we can easily imagine a language where symbols univocally refer to distinct input elements independently of where they occur, making order irrelevant.333This is not unlike what happens in order-insensitive constructions such as English conjunctions: dogs and cats, cats and dogs. Hence, we also introduce bag-of-symbols disentanglement (bosdis). The latter maintains the requirement for symbols to univocally refer to distinct meanings, but captures the intuition of a permutation-invariant language, where only symbol counts are informative. Denoting by a counter of the -th symbol in a message, bosdis is given by:


In all experiments, the proposed measures topsim, posdis and bosdis are calculated on the train set.

In Appendix 8.2, we illustrate how the three metrics behave differently on three miniature languages. Across the languages of all converging runs in our simulations, their Spearman correlations are: topsim/posdis: ; topsim/bosdis: ; posdis/bosdis: . These correlations, while not extremely high, are statistically significant (), which is reassuring as all metrics attempt to capture compositionality. It is also in line with reasonable expectations that the most ‘‘opinionated’’ posdis measure is the one that behaves most differently from topsim.

3.2 Generalization

In our setup, generalization can be straightforwardly measured by splitting all possible distinct inputs so that the test set only contains inputs with attribute combinations that were not observed at training. Generalization is then simply quantified by test accuracy. In intuitive terms, at training time the agents are exposed to blue triangles and red circles, but blue circles only appear at test time. This requires Sender to generate new messages, and Receiver to correctly infer their meaning. If a blue circle is accurately reconstructed, then agents do generalize.

For all the considered settings, we split the possible distinct inputs into train and test items. This implies that the absolute training/test set sizes increase with input dimension (this issue is further discussed in Appendix 8.4).

Finally, we only evaluate generalization for runs that successfully converged, where convergence is operationalized as training-set accuracy.

4 Generalization emerges ‘‘naturally" if the input space is large

Figure 1: Average accuracy on unseen combinations as a function of input size of successful runs. The x-axis is ordered by increasing input size . Brackets denote (,

). Vertical bars represent the standard error of the mean (SEM). Horizontal brackets group settings with same

but different (, ).

Fig. 1 shows that emergent languages are able to almost perfectly generalize to unseen combinations as long as input size is sufficiently large (input size/test accuracy Spearman , ). The figure also shows that the way in which a large input space is obtained (manipulating or

) does not matter (no significant accuracy difference between the bracketed runs, according to a set of t-tests with

). Moreover, the correlation is robust to varying agents’ capacity (Appendix 8.3; see Resnick2019 for a thorough study of how agent capacity impacts generalization and compositionality). Importantly, the effect is not simply a product of larger input sizes coming with larger training corpora, as we replicate it in Appendix 8.4 while keeping the number of distinct training examples fixed, but varying input combinatorial variety. What matters is that, in the training data, specific attribute values tend to occur with a large range of values from other attributes, providing a cue about the composite nature of the input.

That languages capable to generalize will only emerge when the input is varied enough might seem obvious, and it has been shown before in mathematical simulations of language emergence nowak2000evolution, as well as in studies of deep network inductive biases zhao2018. However, our result suggests an important caveat when interpreting experiments based on small input environments that report failures in the generalization abilities of deep networks (e.g., Kottur:etal:2017; Lake:Baroni:2017). Before assuming that special architectures or training methods are needed for generalization to emerge, such experiments should be repeated with much larger/varied input spaces, where it is harder for agents to develop ad-hoc strategies overfitting the training data and failing to generalize.

We also considered the relation between channel capacity and language emergence. Note that is a prerequisite for successful communication, and a perfectly compositional language could already generalize at the lower bound. Indeed, limiting channel capacity has been proposed as an important constraint for the emergence of compositionality Nowak:Krakauer:1999. However, we find that, when is sufficiently large to support generalization, our deep agents need in order to even converge at training time. The minimum ratio across all converging runs for each configuration with (the settings where we witness generalizing languages) is on average 5.9 (s.d.: 4.4).

Concretely, this implies that none of our successful languages is as compact as a minimal fully-compositional solution would afford. Appendix 8.5 reports experiments focusing, more specifically, on the relation between channel capacity and generalization, showing that it is essential for to be above a large threshold to reach near-perfect accuracy, and further increasing beyond that does not hamper generalization.

5 Generalization does not require compositionality

Having established that emergent languages can generalize to new composite concepts, we test whether languages that generalize better are also more compositional. Since bosdis and topsim correlate with (Appendix 8.6), we compute Spearman correlations between test accuracy and compositionality metrics across all converging runs of each (, , , ) configuration separately. Surprisingly, in just 4 out of distinct settings the correlation is significant () for at least 1 measure.444, and (different) significant settings for topsim, posdis and bosdis, respectively.

We further analyze the (=, =, =, =) setting, as it has a large number of generalizing runs, and it is representative of the general absence of correlation we also observe elsewhere. Fig. 2 confirms that even non-compositional languages (w.r.t. any definition of compositionality) can generalize well. Indeed, for very high test accuracy (), we witness a large spread of posdis (between and ), bosdis (between and ) and topsim (between and ). In other words, deep agents are able to communicate about new attribute combinations while using non-compositional languages. We note moreover that even the most compositional languages according to any metric are far from the theoretical maximum ( for all metrics).

We observe however that the top-left quadrants of Fig. 2 panels are empty. In other words, it never happens that a highly compositional language has low accuracy. To verify this more thoroughly, for each compositionality measure , we select those languages, among all converging runs in all configurations, that have , and compute the proportion of them that reaches high test accuracy (). We find that this ratio equates , , and for posdis, bosdis, and topsim respectively. That is, while compositionality is not a necessary condition for generalization, it appears that the strongest form of compositionality, namely posdis, is at least sufficient for generalization. This provides some evidence that compositionality is still a desirable feature, as further discussed in Section 6.

(a) posdis
(b) bosdis
(c) topsim
Figure 2: Compositionality in function of generalization. Each point represents a successful run in the (=, =, =, =) setting. Red and black points correspond respectively to the medium- and low-disentanglement languages analyzed in Section 5 and Appendix 8.7.

We gain further insights on what it means to generalize without full compositionality by taking a deeper look at the language shown in red in Fig. 2, that has near-perfect generalization accuracy (99%), and whose posdis score (0.70), while near the relative best, is still far from the theoretical maximum (we focus on posdis since it is the easiest compositional strategy to qualitatively characterize). As its behavior is partially interpretable, this ‘‘medium-posdis’’ language offered us clearer insights than more strongly entangled cases. We partially analyze one of the latter in Appendix 8.7.

Note that, with (=, =), a (=, =) channel should suffice for a perfectly positionally disentangled strategy. Why does the analyzed language use (=) instead? Looking at its mutual information profile (Appendix Table 5), we observe that positions 2 and 3 (pos2 and pos3) are respectively denoting attributes 2 and 1 (att2 and att1): pos3 has high mutual information with att1 and low mutual information with att2; the opposite holds for pos2. The remaining position, pos1, could then be simply redundant with respect to the others, or encode noise ignored by Receiver. However, this is not quite the case, as the language settled instead for a form of ‘‘leaky disentanglement’’. The two disentangled positions do most of the job, but the third, more entangled one, is still necessary for perfect communication.

To see this, consider the ablations in Table 1. Look first at the top block, where the trained Receiver of the relevant run is fed messages with the symbol in one original position preserved, the others shuffled. Confirming that communication is largely happening by disentangled means, preserving pos2 alone suffices to have Receiver guessing a large majority of att2 values, and keeping pos3 unchanged is enough to guess almost 90% of att1 values correctly. Conversely, preserving pos1 alone causes a complete drop in accuracy for both attributes. However, neither pos2 nor pos3 are sufficient on their own to perfectly predict the corresponding attributes. Indeed, the results in the bottom block of the table (one symbol shuffled while the others stay in their original position) confirm that pos1 carries useful complementary information: when fixing the latter and either one of the other positions, we achieve 100% accuracy for the relevant attribute (att2 for pos1+pos2 and att1 for pos1+pos3), respectively.

In sum, pos2 and pos3 largely specialized as predictors of att2 and att1, respectively. However, they both have a margin of ambiguity (in pos2 and pos3 there are 96 and 98 symbols effectively used, respectively, whereas a perfect 1-to-1 strategy would require 100). When the symbols in these positions do not suffice, pos1, that can refer to both attributes, serves a disambiguating role. We quantified this complementary function as follows. We define the cue validity of (symbol in position ) w.r.t an attribute as , where iterates over all possible values of . is significantly higher in those (train/test) messages where is below average. Similarly, is significantly higher in messages where is below average ( in both cases). We might add that, while there is a huge difference between our simple emergent codes and natural languages, the latter are not perfectly disentangled either, as they feature extensive lexical ambiguity, typically resolved in a phrasal context Piantadosi:etal:2012.

att1 att2 both atts
fixing pos1 1 3 0
1 position pos2 1 68 0
pos3 89 1 1
shuffling pos1 89 69 61
1 position pos2 100 3 3
pos3 1 100 1
Table 1: Feeding shuffled messages from the analyzed language to the corresponding trained Receiver. Average percentage accuracy across 10 random shufflings (s.d. always ) when: top: symbols in all positions but one are shuffled across the data-set; bottom: symbols in a single position are shuffled across the data-set. The data-set includes all training and test messages produced by the trained Sender and correctly decoded in their original form by Receiver (99% of total messages).

6 Compositionality and ease of transmission

The need to generalize to new composite inputs does not appear to constitute a sufficient pressure to develop a compositional language. Given that compositionality is ubiquitous in natural language, we conjecture that it has other beneficial properties, making it advantageous once agents chanced upon it. Compositional codes are certainly easier to read out by humans (as shown by our own difficulty in qualitatively analyzing highly entangled languages), and we might hypothesize that this ease-of-decoding is shared by computational agents. A long tradition of subject studies and computational simulations has shown that the need to transmit a language across multiple generations or to populations of new learners results in the language being more compositional (e.g., kirby2001spontaneous; Kirby:etal:2015; Verhoef:etal:2015; Cornish:etal:2017; cogswell2019; Guo:etal:2019; Li2019). Our next experiments are closely related to this earlier work, but we adopt the opposite perspective. Instead of asking whether the pressure to transmit a language will make it more compositional, we test whether languages that have already emerged as compositional, being easier to decode, are more readily transmitted to new learners.555Li2019 established this for hand-crafted languages; we extend the result to spontaneously emerging ones.

Specifically, we run games in the largest input setting (=, =), varying the channel parameters. We select the pairs of agents that achieved a high level of generalization accuracy (0.80). Next, following the paradigm of Li2019, we freeze Sender, and train a new Receiver from scratch. We repeat this process

times per game, initializing new Receivers with different random seeds. Once the newly formed pair of agents is successful on the training set, we measure its test accuracy. We also report speed of learning, measured by area under the epochs vs. training accuracy curve. We experiment with three Receiver architectures. The first two, GRU (500) and GRU (50), are GRUs with hidden layer sizes of 500 (identical to the original Receiver) and 50, respectively. The third is a two-layer Feed-Forward Network (FFN) with a ReLu non-linearity and hidden size 500. The latter Receiver takes the flattened one-hot representation of the message as its input. This setup allows probing ease of language transmission across models of different complexity. We leave the study of language propagation across multiple generations of speakers to future work.

posdis bosdis topsim
GRU(500) GRU(50) FFN GRU(500) GRU(50) FFN GRU(500) GRU(50) FFN
Learning Speed 0.87 0.71 0.35 0.85 0.68 0.33 0.87 0.71 0.35
Generalization 0.80 0.55 0.50 0.81 0.55 0.51 0.79 0.54 0.48
Table 2: Spearman correlation between compositionality metrics and ease-of-transmission measures for (=, =, =, =). All values are statistically significant ().

Results in the same setting studied in Section 5 are presented in Table 2 (experiments with other setups are in Appendix 8.8). Both learning speed and generalization accuracy of new Receivers are strongly positively correlated with degree of compositionality. The observed correlations reach values almost as high as for learning speed and for generalization, supporting our hypothesis that, when emergent languages are compositional, they are simpler to understand for new agents, including smaller ones (GRU (50)), and those with a different architecture (FFN).

7 Discussion

The natural emergence of generalization

There has been much discussion on the generalization capabilities of neural networks, particularly in linguistic tasks where humans rely on compositionality

(e.g., Fodor:Lepore:2002; Marcus:2003; vanderVelde:etal:2004; Brakel:Frank:2009; Kottur:etal:2017; Lake:Baroni:2017; Andreas:2019; Hupkes:etal:2019; Resnick2019). In our setting, the emergence of generalization is very strongly correlated with variety of the input environment. While this result should be replicated in different conditions, it suggests that it is dangerous to study the generalization abilities of neural networks in ‘‘thought experiment’’ setups where they are only exposed to a small pool of carefully-crafted examples. Before concluding that garden-variety neural networks do not generalize, the simple strategy of exposing them to a richer input should always be tried. Indeed, even studies of the origin of human language conjecture that the latter did not develop sophisticated generalization mechanisms until pressures from an increasingly complex environment forced it to evolve in that direction Bickerton:2014; Hurford:2014.

Generalization without compositionality

Our most important result is that there is virtually no correlation between whether emergent languages are able to generalize to novel composite inputs and the presence of compositionality in their messages (Andreas:2019 noted in passing the emergence of non-compositional generalizing languages, but did not explore this phenomenon systematically). Supporting generalization to new composite inputs is seen as one of the core purposes of compositionality in natural language (e.g., Pagin:Westerstahl:2010). While there is no doubt that compositional languages do support generalization, we also found other systems spontaneously arising that generalize without being compositional, at least according to our intuitive measures of compositionality. This has implications for the ongoing debate on the origins of compositionality in natural language, (e.g., Townsend:etal:2018, and references there), as it suggests that the need to generalize alone might not constitute a sufficient pressure to develop a fully compositional language. Our result might also speak to those linguists who are exploring the non-fully-compositional corners of natural language (e.g., Goldberg:2019). A thorough investigation of neural network codes that can generalize while being partially entangled might shed light on similar phenomena in human languages. Finally, and perhaps most importantly, recent interest in compositionality among AI researchers stems from the assumption that compositionality is crucial to achieve good generalization through language (e.g., Lake:Baroni:2017; Lazaridou:etal:2018; Baan:etal:2019). Our results suggest that the pursuit of generalization might be separated from that of compositionality, a point also recently made by Kharitonov:Baroni:2020 through hand-crafted simulations.

What is compositionality good for?

We observed that positional disentanglement, while not necessary, is sufficient for generalization. If agents develop a compositional language, they are then very likely to be able to use it correctly to refer to novel inputs. This supports the intuition that compositional languages are easier to fully understand. Indeed, when training new agents on emerged languages that generalize, it is much more likely that the new agents will learn them fast and thoroughly (i.e., they will be able to understand expressions referring to novel inputs) if the languages are already compositional according to our measures. That language transmission increases pressure for structured representations is an established fact (e.g., Kirby:etal:2015; Cornish:etal:2017)

. Here, we reversed the arrow of causality and showed that, if compositionality emerges (due to chance during initial language development), it will make a language easier to transmit to new agents. Compositionality might act like a ‘‘dominant’’ genetic feature: it might arise by a random mutation but, once present, it will survive and thrive, as it guarantees that languages possessing it will generalize and will be easier to learn. From an AI perspective, this suggests that trying to enforce compositionality during language emergence will increase the odds of developing languages that are quickly usable by wide communities of artificial agents, that might be endowed with different architectures. From the linguistic perspective, our results suggest an alternative view of the relation between compositionality and language transmission--one in which the former might arise by chance or due to other factors, but then makes the resulting language much easier to be spread.

Compositionality and disentanglement

Language is a way to represent meaning through discrete symbols. It is thus worth exploring the link between the area of language emergence and that of representation learning (Bengio2013). We took this route, borrowing ideas from research on disentangled representations to craft our compositionality measures. We focused in particular on the intuition that, if emergent languages must denote ensembles of primitive input elements, they are compositional when they use symbols to univocally denote input elements independently of each other.

While the new measures we proposed are not highly correlated with topographic similarity, in most of our experiments they did not behave significantly differently from the latter. On the one hand, given that topographic similarity is an established way to quantify compositionality, this serves as a sanity check on the new measures. On the other, we are disappointed that we did not find more significant differences between the three measures.

Interestingly one of the ways in which they did differ is that, when a language is positionally disentangled, (and, to a lesser extent, bag-of-symbols disentangled), it is very likely that the language will be able to generalize--a guarantee we don’t have from less informative topographic similarity.

The representation learning literature is not only proposing disentanglement measures, but also ways to encourage emergence of disentanglement in learned representations. As we argued that compositionality has, after all, desirable properties, future work could adapt methods for learning disentangled representations (e.g., Higgins2017; Kim2018) to let (more) compositional languages emerge.


We thank the reviewers for feedback that helped us to make the paper clearer.


8 Supplementary material

8.1 Grid search over (, , , )

We report in Table 3 the different (, , , ) combinations we explored. They were picked according to the following criteria:

  • so that agents are endowed with enough different messages to refer to all inputs;

  • discard some so that we have approximately the same number of settings per (, ) (between and different (, ));

  • include some (, ) that are large enough that they can be tested with all the considered (, ).

Unless it is mentioned explicitly, we run different initializations per setting.

Table 3 shows that, for large , GRU-agents need strictly larger than . This suggests that, for large , the emergence of a perfectly non-ambiguous compositional languages, where each message symbol denotes only one attribute value and each value attribute is denoted by only one message symbol, is impossible.

[width=8em, height=3em]     (, )
5 10 50 100
, , , ,
(,) X X X X X X X X X
(,) X X X X X X X X X X X X
(,) X X X X X X X X X X
(,) X X X X X X X X X
(,) - X X X X X X X X X X
(,) X - X X X X X X X X
(,) {-, X} - X X X X - X X X
(,) - X X X X X X X X
(,) - X - X X X X X X
(,) X - X - X X X X X X X
(,) {-, X} - X X X X - X X X
Table 3: Grid search. ‘X’ indicates tested settings with at least one successful run. ‘-’ indicates tested settings without any successful run. Finally, blank cells correspond to settings that were not explored for the reasons indicated in the text.

8.2 Behavior of the compositionality measures on hand-crafted miniature languages

We construct simple miniature languages to illustrate the different behaviors of topsim, posdis and bosdis: Lang1, Lang2 and Lang3. We fix , , and .666Only Lang3 uses the whole available Table 4 shows the input-message mappings of each language and reports their degree of compositionality. Note that all languages respect a bijective mapping between inputs and messages.

Lang1 is perfectly posdis-compositional (posdis=1). However, topsim , as symbols encode one attribute (we need the first two symbols to recover the value of the first attribute). Lang1 is penalized by topsim because it does not have a one-to-one attribute-position mapping; a feature that arguably is orthogonal to compositionality.

Lang2 and Lang3 are equally topsim-compositional. Nonetheless, they differ fundamentally in terms of the type of compositionality they feature. If Lang2 is more posdis-compositional, Lang3 is perfectly bosdis-compositional.

Input Lang1 Lang2 Lang3
0,0 0,0,0 0,0,0 0,0,4
0,1 0,0,1 0,0,1 0,0,5
0,2 0,0,2 0,0,2 0,0,6
0,3 0,0,3 0,0,3 0,0,7
1,0 0,1,0 1,2,0 1,4,1
1,1 0,1,1 1,2,1 1,5,1
1,2 0,1,2 1,2,2 1,6,1
1,3 0,1,3 1,2,3 1,7,1
2,0 2,0,0 2,3,0 2,4,2
2,1 2,0,1 2,3,1 2,5,2
2,2 2,0,2 2,3,2 2,6,2
2,3 2,0,3 2,3,3 2,7,2
3,0 2,1,0 3,1,0 3,4,3
3,1 2,1,1 3,1,1 3,3,5
3,2 2,1,2 3,2,1 3,3,6
3,3 2,1,3 3,3,1 3,3,7
Table 4: Input-message mappings and compositionality measures for the miniature languages.

8.3 Generalization for different agents’ capacity

We demonstrated in the main paper that agent’s generalization correlates with input size. In fact, agents can successfully reconstruct new attribute combinations if trained on large input spaces. This could be due to agents overfitting when presented with few training samples. To test this hypothesis, we repeat the training/evaluation experiments with GRU agents of different capacities in the following settings: (=, =), a small input space where agents do not generalize; and (=, =), a large input space where agents generalize.777We only report experiments with GRUs, but the same results were replicated with differently-sized LSTMs. Fig. 3 shows that, even for small-capacity agents (one-layer GRU with hidden state of size ), test accuracy is 0 for (=, =). Moreover, agents do not overfit when trained on (=, =) even with two-layer GRUs with hidden state of size .

Figure 3: Average accuracy on unseen combinations as a function of agents capacity ((hidden size, number of layers)) for input sizes (, ) and (, ). Vertical bars represent SEM.

8.4 Input space density

We showed in the main paper that generalization positively correlates with . We further investigate here whether it is simply the increasing absolute number of distinct training samples that is at the root of this phenomenon, or whether the variety of seen inputs also plays a role, independently of absolute input size.

To verify this, we design an experiment where we keep the absolute number of distinct input samples constant, but we change their density, defined as the proportion of sampled items over the the size of the space they are sampled from. When sampling points from a small space, on average each value of an attribute will occur with a larger range of values from other attributes, compared to a larger space, which might provide more evidence about the combinatorial nature of the underlying space.

In practice, we fix (=, =, =) and sample points from spaces with = (density=), = (density=) and = (density=), respectively. As usual, we use 90% of the data for training, 10% for testing. In all cases, we make sure that all values are seen at least once during training (as visually illustrated in Fig. 4).

We obtain test accuracies of , and for densities , and respectively. That is, the high generalization observed in the main paper is (also) a consequence of density, and hence combinatorial variety, of the inputs the agents are trained on, and not (only) of the number of training examples.

(a) density=
(b) density=
(c) density=
Figure 4: Sampling the same number of input instances () with different densities. The axes of the shown matrices represent the values of two attributes, with the dark-red cells standing for inputs that were sampled. We ensure that each value of each attribute is picked at least once by always sampling the full diagonal.

8.5 Impact of channel capacity on generalization

We showed in the main paper that generalization is very sensitive to input size. In this section, we focus on the relation between channel capacity and generalization.

Figure 5: Average accuracy on unseen combinations as a function of channel capacity of the successful runs. The x-axis is ordered by increasing channel capacity. In the brackets we note (, ). Vertical bars represent SEM.

First, when we aggregate across input sizes, Fig. 5 shows that has a just small effect on generalization, with a low Spearman correlation . Next, if we study this relation for specific large (where we observe generalization), we notice in Fig. 6 that agents need to be endowed with a above a certain threshold, with , in order to achieve almost perfect generalization. Moreover, contradicting previous claims (e.g., Kottur:etal:2017), having does not harm generalization.

(a) (, )
(b) (, )
Figure 6: Average accuracy on unseen combinations as a function of channel capacity of the successful runs for two different (, ). The x-axis is ordered by increasing channel capacity. In the brackets we note (, ). Vertical bars represent SEM.

8.6 Impact of channel capacity on the compositionality measures

A good compositionality measure should describe the structure of the language independently of the used channel, so the corresponding score should not be greatly affected by . However, Fig. 7 shows clear negative correlations of both topsim and bosdis with .

Figure 7: Average of different compositionality measures in function of channel capacity (, )). Vertical bars represent SEM.

8.7 Analysis of example medium- and low-posdis languages

We present more data about the medium-posdis language analyzed in the main article, and we provide comparable evidence for a language with similarly excellent generalization (99%) but very low posdis (0.05), that we will call here low-posdis. The latter language is depicted in black in Fig. 2 of the main text. Both languages come from the training configuration with 2 100-valued input attributes and 3 100-symbol positions.

Mutual information profiles.

Table 5 reports mutual information for the two languages. Note how the highly entangled low-posdis is almost uniform across the table cells.

medium-posdis low-posdis
att1 att2 att1 att2
pos1 1.10 2.01 1.72 1.95
pos2 0.19 4.16 1.74 1.71
pos3 4.44 0.13 2.16 1.77
Table 5: Mutual information of each position with each attribute for the studied languages.

Vocabulary usage.

Considering all messages produced after training for the full training and test set inputs, effective vocabulary usage for pos1, pos2 and pos3 are as follows (recall that 100 symbols are maximally available):

  • medium-posdis: 91, 96, 98

  • low-posdis: 99, 99, 100

Although vocabulary usage is high in both cases, medium-posdis is slightly more parsimonious than low-posdis.

Ablation studies.

Table 6 reports ablation experiments with both languages. The results for medium-posdis are discussed in the main text. We observe here how virtually any ablation strongly impacts accuracy in denoting either attribute by the highly entangled low-posdis language. This points to another possible advantage of (partially) disentangled languages such as medium-posdis: since pos2 and pos3 are referring to att2 and att1 independently, in ablations in which they are untouched, Receiver can still retrieve partial information, by often successfully guessing the attribute they each refer to. We also report in the table the effect of shuffling across the positions of each message. This is very damaging not only for medium-posdis, but for low-posdis as well, showing that even the latter is exploiting positional information, albeit in an inscrutable, highly entangled way. Note in Fig. 2 of the main article that neither language has high bos.

medium-posdis low-posdis
att1 att2 both att1 att2 both
fixing pos1 1 3 0 4 5 0
1 position pos2 1 68 0 4 4 0
pos3 89 1 1 8 5 0
shuffling pos1 89 69 61 31 18 6
1 position pos2 100 3 3 30 25 8
pos3 1 100 1 15 20 3
shuffling msg 1 2 0 2 4 0
Table 6: Feeding shuffled messages from the medium-posdis and low-posdis

languages to the corresponding trained Receivers. Mean percentage accuracy across 10 random shufflings (standard deviation is always

) when: top: symbols in all positions but one are shuffled across the data-set; middle: symbols in a single position are shuffled across the data-set; bottom: shuffling the symbols within each message (ensuring all symbols move). The data-set includes all training and test messages produced by the trained Sender and correctly decoded in their original form by Receiver (99% of total messages).

8.8 Effect of channel capacity on ease of transmission

Table 7 replicates the ease-of-transmission analysis presented in the main text across various channel capacities. We observe in most cases a significantly positive correlation, that is even higher (1) for larger Receivers and (2) for emergent languages with shorter messages (smaller ).

posdis bosdis topsim
(, ) measure GRU (500) GRU (50) FFN (500) GRU (500) GRU (50) FFN (500) GRU (500) GRU (50) FFN (500)
(,) Learning Speed 0.82 0.78 0.74 0.71 0.67 0.62 0.72 0.74 0.66
Generalization 0.77 0.77 0.75 0.61 0.62 0.66 0.75 0.76 0.74
(,) Learning Speed 0.79 0.44 0.48 0.76 0.51 0.47 0.89 0.59 0.61
Generalization 0.73 - 0.50 0.77 0.27 0.54 0.84 0.41 0.61
(,) Learning Speed 0.82 0.77 0.79 0.79 0.76 0.77 0.89 0.85 0.87
Generalization 0.78 0.56 0.69 0.76 0.55 0.67 0.85 0.65 0.77
(,) Learning Speed 0.75 0.56 0.78 0.80 0.68 0.78 0.75 0.55 0.71
Generalization 0.67 0.27 0.68 0.78 0.41 0.70 0.53 - 0.54
(,) Learning Speed 0.51 0.29 0.60 0.42 0.31 0.48 0.72 0.49 0.73
Generalization 0.39 - 0.44 0.47 - 0.36 0.41 0.27 0.57
(,) Learning Speed - - - 0.33 - - 0.49 - 0.35
Generalization - -0.28 - - - - - - -
(,) Learning Speed 0.87 0.71 0.35 0.85 0.68 0.33 0.87 0.71 0.35
Generalization 0.80 0.55 0.50 0.81 0.55 0.51 0.79 0.54 0.48
(,) Learning Speed 0.84 0.54 0.43 0.82 0.54 0.46 0.86 0.57 0.49
Generalization 0.82 0.38 0.47 0.80 0.39 0.47 0.82 0.41 0.48
(,) Learning Speed 0.88 0.83 0.80 0.89 0.78 0.78 0.94 0.87 0.83
Generalization 0.87 0.68 0.68 0.90 0.69 0.67 0.90 0.70 0.68
(,) Learning Speed 0.85 0.58 0.62 0.82 0.59 0.64 0.72 0.74 0.66
Generalization 0.86 0.39 0.47 0.81 0.50 0.37 0.72 0.35 0.46
(,) Learning Speed 0.73 0.58 0.65 0.79 0.59 0.65 0.70 0.57 0.66
Generalization 0.69 0.39 0.37 0.67 0.37 0.37 0.49 0.34 0.46
(,) Learning Speed 0.39 - 0.27 0.69 - 0.40 0.67 - 0.51
Generalization 0.38 - 0.34 0.52 - 0.38 0.36 - -
Average Learning Speed 0.75 0.62 0.61 0.72 0.63 0.59 0.79 0.67 0.62
Generalization 0.71 0.42 0.54 0.72 0.49 0.51 0.70 0.51 0.63
Table 7: Statistically significant () Spearman correlations between retraining performance (measured by new Receiver Learning Speed and Generalization) and compositionality measures (posdis, bosdis and topsim) for (, ) and different channel capacity. ‘-’ indicates no significant correlations.