Measuring non-trivial compositionality in emergent communication

by   Tomasz Korbak, et al.
Uniwersytet Warszawski

Compositionality is an important explanatory target in emergent communication and language evolution. The vast majority of computational models of communication account for the emergence of only a very basic form of compositionality: trivial compositionality. A compositional protocol is trivially compositional if the meaning of a complex signal (e.g. blue circle) boils down to the intersection of meanings of its constituents (e.g. the intersection of the set of blue objects and the set of circles). A protocol is non-trivially compositional (NTC) if the meaning of a complex signal (e.g. biggest apple) is a more complex function of the meanings of their constituents. In this paper, we review several metrics of compositionality used in emergent communication and experimentally show that most of them fail to detect NTC - i.e. they treat non-trivial compositionality as a failure of compositionality. The one exception is tree reconstruction error, a metric motivated by formal accounts of compositionality. These results emphasise important limitations of emergent communication research that could hamper progress on modelling the emergence of NTC.



page 1

page 2

page 3

page 4


Developmentally motivated emergence of compositional communication via template transfer

This paper explores a novel approach to achieving emergent compositional...

Catalytic Role Of Noise And Necessity Of Inductive Biases In The Emergence Of Compositional Communication

Communication is compositional if complex signals can be represented as ...

Functional Decomposition using Principal Subfields

Let f∈ K(t) be a univariate rational function. It is well known that any...

The Combinatorics of Salva Veritate Principles

Various concepts of grammatical compositionality arise in many theories ...

Compositionality as we see it, everywhere around us

There are different meanings of the term "compositionality" within scien...

A Framework for Measuring Compositional Inductive Bias

We present a framework for measuring the compositional inductive bias of...

Embeddings of k-complexes into 2k-manifolds

If K is a simplicial k-complex, the standard van Kampen obstructions tel...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Compositionality is an important explanatory target in emergent communication and language evolution as well as a goal in representation learning and natural language processing

(Brighton, 2002; Lake et al., 2016; Chaabouni et al., 2020). The vast majority of computational models of communication account for the emergence of only a very basic form of compositionality: trivial (Steinert-Threlkeld, 2020) or naïve compositionality (Kharitonov and Baroni, 2020). Natural languages are non-trivially compositional (NTC) as they include phenomena like quantifiers, negation, word order and context-dependence. A communication protocol is NTC if for a certain complex signal (e.g. biggest apple) its meaning is not

just the intersection of the meanings of its constituents but some more complex function of the constituents. Despite being a necessary milestone towards accounting for the evolution of language, NTC has received little attention from the machine learning and evolutionary linguistics communities. We conjecture that this state of affairs is partly due to an inability to quantitatively measure progress toward NTC in computational settings.

In this study, we review seven metrics of compositionality and experimentally show that most fail to detect NTC — i.e. they treat non-trivial compositionality as a failure of compositionality. We do observe, however, that properly parametrised tree reconstruction error (TRE) (Andreas, 2019) — a metric directly motivated by a formal account of compositionality (Montague, 1970) — detects NTC to a significant degree.

To summarise, the contributions of this paper are (i) providing a common framework for comparing different approaches to measuring compositionality in emergent communication, (ii) experimentally showing that most metrics used in machine learning and evolutionary linguistics fail to detect NTC, and (iii) demonstrating how to parametrise TRE to be able to detect NTC. The anonymised code for all experiments and reusable implementations of all metrics are publicly available from

2 Background

Figure 1: Most research on emergent communication assumes the setup displayed on the orange plate: a sender observes an RGB image , sends a message to the receiver and the receiver acts based on the message. To measure compositionality, we can simplify this setup (blue plate) and consider only a derivation describing a compositional structure of a situation and a function mapping each derivation to a message. A derivation mirrors RGB image while the representation is identical to the message sent by the sender upon observing .

Most research on emergent communication has focused on a Lewis signalling game of the following form: a sender sends a message to a receiver upon observing a situation and the receiver acts based upon the message (Lewis, 1969; Skyrms, 2010). If the incentives of the sender and the receiver are aligned, they will agree on a communication protocol. To simplify the problem of measuring compositionality, let us assume that the situations observed by the sender are governed by an underlying compositional structure (known to us but hidden from the sender) and let us focus on the communication protocol itself, understood as a mapping from those hidden compositional structures (henceforth called derivations) to messages (henceforth called representations) as shown in Figure 1.


A derivation can be thought of as a tree representing a situation. Derivations are defined recursively such that if and are derivations, then is a derivation, where is a derivation composition function. A primitive derivation is called a concept. For instance, a blue circle corresponds to a derivation built out of two concepts: blue and circle. In an emergent communication setting, that derivation can be the structure underlying an RGB image observed by the sender.


Let us have a set of representations . A representation can be thought of as a description of a derivation. Representations can be composed together, e.g. , where and is a representation composition function. A primitive representation is called a symbol. Finally, a communication protocol is a function mapping derivations to representations.

In this paper, we will consider three perspectives of what kind of object a representation is:

  1. [leftmargin=*]

  2. From a communication perspective, is the set of messages understood as strings over an alphabet , i.e. . Then, corresponds to string concatenation. For instance, .

  3. From a semantic perspective, is the set of meanings associated with derivations. We will assume meanings to be sets of objects such as circles or boxes. Then, corresponds to a function over sets (e.g. set intersection). For instance, .

  4. From a geometric perspective,

    is a vector space. Then,

    corresponds to vector addition. For instance, , where .

While we ultimately care about the communication perspective, defining the distinction between trivial compositionality and NTC requires the semantic perspective and measuring compositionality in terms of TRE requires approximating the semantic perspective in terms of the geometric perspective.


Intuitively, a communication protocol embodied by is compositional if the space of representations is homeomorphic to the space of derivations. More formally, is compositional if the following holds:


In other words, the composition function over representations mirrors the composition function over derivations : for each derivation obtained by applying operator its image can be obtained by applying a corresponding operator .

This mathematical model of compositionality, originally constructed by Montague (1970) using universal algebra, is the dominant approach in formal semantics (see (Janssen, 2010) for a review). In the context of emergent communication, this model was recently explicitly assumed by Andreas (2019) and Steinert-Threlkeld (2020).

Trivial and non-trivial compositionality

Let us take the semantic perspective and assume to be a set of sets of objects. Then, a communication protocol is trivially compositional (TC) if the representation composition function is set intersection. Alternatively, is NTC if is a more complex function over sets of objects (Steinert-Threlkeld, 2020).

Most signalling games studied in machine learning and evolutionary linguistics are confined to TC communication protocols. For instance, a communication protocol defined over objects with shapes and colours would probably be TC, with the meaning of a message

blue circle being the intersection of the set of circle with the set of blue objects (Mordatch and Abbeel, 2017; Kottur et al., 2017; Korbak et al., 2019). On the other hand, a great deal of natural language semantics is NTC. For instance, the meaning of the phrase good cook is not the intersection of the set of cooks with the set of good people. Rather, the adjective good is highly contextual and complements the meaning of the noun cook differently than it complements the meaning of, for example, the noun climber.

Tree reconstruction error

TRE is a metric of compositionality proposed by Andreas (2019) and directly motivated by Montague’s account of compositionality embodied in (1). First, assume there is a distance function over representations . Then we can define a compositional approximation of with parameters as follows:


In other words, assigns each an embedding vector and composes these vectors using for complex derivations.

can be a non-parametric vector operation, e.g. addition, or a parametric transformation, e.g. a linear transformation. The parameters

(embedding vectors for concepts and possible parameters of ) are optimised so we have


The irreducible distance given the optimal parameters is the TRE.

Unlike in a signalling game, while optimising we do have explicit access to the underlying derivation . Therefore what TRE measures is how well a given communication protocol can be reconstructed while respecting the compositional structure of . A compositional protocol satisfying (1) will by definition respect , and hence can be reconstructed perfectly.

3 Experiments

In our experiments, we consider a well-studied signalling game in which the sender observes objects endowed with two discernible features: shape and colour. The corresponding derivations are ordered tuples of two kinds of concepts: shapes and colours, e.g. . The set of primitive derivations consists of 25 colours and 25 shapes. We take the set of representations to be a set of strings of length over a finite alphabet, i.e. , where .

We consider nine pre-defined communication protocols suitable for solving the signalling game defined above: one TC, six NTC (entangled, diagonal, negation, rotated, context-sensitive) and two non-compositional baselines (random and holistic). We designed these protocols as minimal models of NTC phenomena found in natural languages and formal languages: negation (e.g. not circle), conversational context (e.g. requiring only shape to be communicated), word order (ab is different from ba) and entanglement in the representation learning sense (Kharitonov and Baroni, 2020). These probing protocols are thus aligned with linguistic intuitions (and with existing literature, whenever possible) about what constitutes (trivial or non-trivial) compositionality. For a detailed description of all communication protocols considered in this experiment, see appendix A.

We then consider seven metrics of compositionality used by the machine learning community and report how they score the protocols. The metrics considered are TRE, conflict count, topographic similarity, BOW disentanglement, generalisation, positional disentanglement and context independence. For TRE, we implemented , the composition functions for -dimensional vector representations of derivations, as linear transformation — i.e. , where are vector-encoded symbols and are learnable parameters. We describe experiments with other implementations of in appendix C. For a detailed descriptions of used metrics, see appendix B.

The results of the evaluation are displayed in Figure 2. We can observe that while (almost) all protocols assign high compositionality scores to the TC protocol and low compositionality scores to non-compositional (holistic and random) protocols, most also assign low scores to NTC protocols. Generalisation, somewhat in line with recent results (Chaabouni et al., 2020), is low for some NTC protocols (negation, context-sensitive, entangled, diagonal, rotated). Assuming that receiver’s generalisation requires both productivity of a communication protocol used by the sender and capacity of the receiver, low generalisation for NTC protocols may be explained by insufficient capacity of the receiver. This, in turn, suggests that generalising to NTC requires higher capacity and/or stronger inductive biases than generalising to TC. Context independence, topographical similarity, positional disentanglement, BOW disentanglement and conflict count can pick up only the simplest forms of NTC such as negation and and order-sensitivity. TRE is the only metric that assigns high scores to all NTC protocols.

Figure 2: Scores assigned by various compositionality metrics to various protocols. Each subplot corresponds to a different metric, with protocols on the Y-axis and scores on the X-axis. Note that the X-axes does not share a common scale and we report negative TRE and negative

conflict count. Thus, for each metric, a higher value means greater compositionality. Conflict count is undefined for negation and context-sensitive protocols. We observed negligible variance across across five random seeds.

4 Discussion

We conjecture that the reason most metrics fail to capture NTC is because they were designed under the assumption of TC as the canonical form of compositionality. This assumption, however, seems to be guided by a simplified signalling game setup rather than formal accounts of compositionality (Janssen, 2010) or corpus data. That design choice may, in turn, stem from a more general problem of designing meaningful metrics for emergent communication (Lowe et al., 2019) and translating theoretical accounts in linguistics into quantitative measures. To illustrate one such translation problem, if we do not restrict the composition function to be of a particular class, any language may be considered compositional (see appendix C).

Despite these difficulties, NTC is ubiquitous in natural languages. Phenomena such as function words and dependency relations (Rizzi and Cinque, 2016) demonstrate that primitive concepts cannot be treated as completely orthogonal (Murphy, 1988) and natural languages use more than one form of symbol composition (Gärdenfors, 1995). Our aim in this paper was to introduce NTC as an explanatory target for emergent communication and to demonstrate how to measure it in terms of TRE. We hope these contributions will guide future work accounting for the emergence of NTC and closing the gap between emergent and human communication.

Broader impact

The field of emergent communication constitutes basic research and is focused on a theoretical problem: the emergence of language. However, the problem of learning compositional representations and understanding compositional language has broader implications for natural language processing and representation learning. Concrete problems that can be posed as emergent communication include image captioning

(Kottur et al., 2017)) and unsupervised machine translation (Lee et al., 2017) (both of which can be considered visually grounded communication) as well as explainability (Andreas et al., 2017). Research on learning compositional representations also informs the development of natural language processing technologies such as semantic parsing and machine reasoning (Hudson and Manning, 2018). These systems are vulnerable to bias, permit malicious use and can give rise to unintended adverse effects. On the other hand, the conjectured interpretability and robustness of compositional representations could improve the transparency and fairness of machine learning systems that utilise such representations, as well as advance progress on conversational systems that empower disadvantaged individuals.


Tomasz Korbak, Julian Zubek and Joanna Rączaszek-Leonardi were funded by a National Science Centre (Poland) grant OPUS 2018/29/B/HS1/00884. The authors are grateful to Krzysztof Główka, Łukasz Kuciński and Paweł Kołodziej for their helpful feedback.


  • J. Andreas, A. Dragan, and D. Klein (2017) Translating Neuralese. Cited by: Broader impact.
  • J. Andreas (2019) Measuring Compositionality in Representation Learning. International Conference on Learning Representations. Note: arXiv: 1902.07181 External Links: Link Cited by: §1, §2, §2.
  • J. Barrett, B. Skyrms, and C. Cochran (2018) Hierarchical Models for the Evolution of Compositional Language. Institute for Mathematical Behavioral Sciences Technical Report MBS 18-03. Cited by: Appendix A.
  • B. Bogin, M. Geva, and J. Berant (2018) Emergence of Communication in an Interactive World with Consistent Speakers. arXiv:1809.00549 [cs]. Note: arXiv: 1809.00549 External Links: Link Cited by: Appendix B.
  • H. Brighton and S. Kirby (2006) Understanding Linguistic Evolution by Visualizing the Emergence of Topographic Mappings. Artificial Life 12 (2), pp. 229–242 (en). External Links: ISSN 1064-5462, 1530-9185, Link, Document Cited by: Appendix B.
  • H. Brighton (2002) Compositional Syntax From Cultural Transmission. Artificial Life 8 (1), pp. 25–54 (en). External Links: ISSN 1064-5462, 1530-9185, Link, Document Cited by: §1.
  • R. Chaabouni, E. Kharitonov, D. Bouchacourt, E. Dupoux, and M. Baroni (2020) Compositionality and Generalization In Emergent Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4427–4442 (en). External Links: Link, Document Cited by: Appendix B, Appendix B, §1, §3.
  • R. T. Q. Chen, X. Li, R. Grosse, and D. Duvenaud (2019)

    Isolating Sources of Disentanglement in Variational Autoencoders

    arXiv:1802.04942 [cs, stat]. Note: arXiv: 1802.04942 External Links: Link Cited by: Appendix B.
  • N. Chomsky (1957) Syntactic structures. (English). Note: OCLC: 934673149 External Links: ISBN 978-1-61427-804-7 Cited by: Appendix B.
  • P. Gärdenfors (1995) Language and the Evolution of Cognition. (eng). Note: ISSN: 1101-8453 External Links: Link Cited by: §4.
  • S. Hochreiter and J. Schmidhuber (1997) Long Short-Term Memory. Neural Computation 9 (8), pp. 1735–1780 (en). External Links: ISSN 0899-7667, 1530-888X, Link, Document Cited by: Appendix B.
  • D. A. Hudson and C. D. Manning (2018) Compositional Attention Networks for Machine Reasoning. Note: _eprint: 1803.03067 Cited by: Broader impact.
  • T. M. V. Janssen (2010) Compositionality. In Handbook of Logic and Linguistics, Cited by: §2, §4.
  • E. Kharitonov and M. Baroni (2020) Emergent Language Generalization and Acquisition Speed are not tied to Compositionality. arXiv:2004.03420 [cs]. Note: arXiv: 2004.03420 External Links: Link Cited by: Appendix A, Appendix A, Appendix B, §1, §3.
  • E. Kharitonov, R. Chaabouni, D. Bouchacourt, and M. Baroni (2019) EGG: a toolkit for research on Emergence of lanGuage in Games. arXiv:1907.00852 [cs]. Note: arXiv: 1907.00852 External Links: Link Cited by: Appendix B.
  • D. P. Kingma and J. Ba (2014) Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs]. Note: arXiv: 1412.6980 External Links: Link Cited by: Appendix B, Appendix B.
  • T. Korbak, J. Zubek, Ł. Kuciński, P. Miłoś, and J. Rączaszek-Leonardi (2019) Developmentally motivated emergence of compositional communication via template transfer. NeurIPS 2019 workshop Emergent Communication: Towards Natural Language. External Links: Link Cited by: §2.
  • S. Kottur, J. M. F. Moura, S. Lee, and D. Batra (2017) Natural Language Does Not Emerge ’Naturally’ in Multi-Agent Dialog. arXiv:1706.08502 [cs]. Note: arXiv: 1706.08502 External Links: Link Cited by: §2, Broader impact.
  • N. Kriegeskorte (2008) Representational similarity analysis – connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience. External Links: ISSN 16625137, Link, Document Cited by: Appendix B.
  • L. Kuciński, P. Kołodziej, and P. Miłoś (2020) Emergence of compositional language in communication through noisy channel. Cited by: Appendix B.
  • B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman (2016) Building Machines That Learn and Think Like People. arXiv:1604.00289 [cs, stat]. Note: arXiv: 1604.00289 External Links: Link Cited by: Appendix B, §1.
  • A. Lazaridou, K. M. Hermann, K. Tuyls, and S. Clark (2018) Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input. arXiv:1804.03984 [cs]. Note: arXiv: 1804.03984 External Links: Link Cited by: Appendix B.
  • J. Lee, K. Cho, J. Weston, and D. Kiela (2017) Emergent Translation in Multi-Agent Communication. Note: _eprint: 1710.06922 Cited by: Broader impact.
  • V. I. Levenshtein (1966) Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10, pp. 707. Cited by: Appendix B.
  • D. K. Lewis (1969) Convention: a philosophical study. Nachdr. edition, Blackwell, Oxford (eng). Note: OCLC: 837747718 External Links: ISBN 978-0-631-23257-5 978-0-631-23256-8 Cited by: §2.
  • R. Lowe, J. Foerster, Y. Boureau, J. Pineau, and Y. Dauphin (2019) On the Pitfalls of Measuring Emergent Communication. arXiv:1903.05168 [cs, stat]. Note: arXiv: 1903.05168 External Links: Link Cited by: Appendix B, §4.
  • R. Montague (1970) Universal grammar. Theoria 36 (3), pp. 373–398 (en). External Links: ISSN 00405825, 17552567, Link, Document Cited by: §1, §2.
  • I. Mordatch and P. Abbeel (2017) Emergence of Grounded Compositional Language in Multi-Agent Populations. In AAAI, Cited by: §2.
  • G. L. Murphy (1988) Comprehending Complex Concepts. Cognitive Science 12 (4), pp. 529–562 (en). External Links: ISSN 03640213, Link, Document Cited by: §4.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in PyTorch

    Cited by: Appendix B, Appendix B.
  • C. Resnick, A. Gupta, J. Foerster, A. M. Dai, and K. Cho (2020) Capacity, Bandwidth, and Compositionality in Emergent Language Learning. arXiv:1910.11424 [cs, stat]. Note: arXiv: 1910.11424 External Links: Link Cited by: Appendix B.
  • L. Rizzi and G. Cinque (2016) Functional Categories and Syntactic Theory. Annual Review of Linguistics 2 (1), pp. 139–163 (en). External Links: ISSN 2333-9683, 2333-9691, Link, Document Cited by: §4.
  • B. Skyrms (2010) Signals: evolution, learning, & information. Oxford University Press, Oxford ; New York. Note: OCLC: ocn477256653 External Links: ISBN 978-0-19-958082-8 978-0-19-958294-5 Cited by: §2.
  • S. Steinert-Threlkeld (2020) Towards the Emergence of Non-trivial Compositionaliy. Philosophy of Science. Cited by: §1, §2, §2.

Appendix A Communication protocols used in the experiments

Recall that given an observation based on derivation the sender sends a message composed of symbols :


Unless specified otherwise, , , where are primitive observations for shape and colour. Then, each message has a form .

The communication protocols are defined as follows.

Trivially compositional (TC) protocol

The TC protocol was constructed by assuming a fixed one-to-one mapping between concepts and symbols, e.g. blue with a and circle with b and generating a message for each shape–colour pair by concatenating associated symbols, e.g. .

Holistic protocol

In animal communication research, a holistic communication system is one in which messages (e.g. ) only have meaning as a whole and their parts ( and ) are meaningless without context. Hence, we construct a holistic communication protocol by uniformly sampling a pair of symbols (without replacement) from for each .

Random protocol

We construct the random protocol by sampling each symbol separately, with replacement, i.e. for each each .

Entangled NTC protocol

We use an example provided by Kharitonov and Baroni (2020), a compositional protocol that refers to complex properties of the objects constructed as combinations of basic concepts.

Let us also assume that both concepts and symbols are represented by non-negative integers from finite fields closed under modular addition and subtraction, e.g. . Then we have:


The non-triviality of here stems from the fact that it entangles shapes and colours, so that both and depend on both the shape and the colour. Kharitonov and Baroni (2020) consider this protocol to be a counter-example to naïve compositionality, which is essentially what we mean by NTC.

Rotated NTC protocol

The rotated protocol is similar to the entangled protocol. It is obtained by encoding concepts numerically, rotating the coordinate system by 45 degrees and then mapping obtained values to symbols in the order they appear on the newly obtained axes. Under the same assumptions as the entangled protocol above, the concept–symbol relationship is defined as:


Order-sensitive NTC protocol

This protocol is constructed analogously to the TC protocol with one difference: each symbol is used to communicate both a colour and a shape. For instance, d means blue when on the first position in the message and square when on the second position. This protocol is NTC because it constrains to be non-commutative.

Context-sensitive NTC protocol

This protocol is based on the game described by Barrett et al. (2018). We modify our setup so the derivations are nested tuples . context can be colour, shape or both. This corresponds to a signalling game when the sender is also provided with information which concept (shape and/or colour) must be communicated to the receiver and disincentivised from communicating too much. This protocol is NTC because message length is a function of context.

Negation NTC protocol

The negation protocol is based on the intuition that negation constitutes a minimal instance of NTC in natural language semantics. To illustrate this, we modify our setup by assuming that there are just two shapes, circle and box, and the agent only has a word x to refer to the box. However, it also uses a symbol ! as a negation and can refer to circles as ‘not boxes’, i.e. !x. The part of the message communicating the shape is then trivially composed, with the second part communicating colour. The non-triviality of this protocol lies in the fact that the meaning of !x is a nontrivial function of the meaning ! and x. The semantics of this protocol cannot be formulated in terms of set intersection, because there is no set that could constitute the meaning of the negation token !.

Diagonal NTC protocol

This protocol reflects a language with two-word utterances, where one word represents intensity and the second certain property or axis of variability (examples from natural language: ‘low brightness’, ‘high contrast’, ‘medium volume’). In this example we refer to a complex property being a combination of basic concepts.

Let us assume both concepts and symbols are represented by non-negative integers: , , , . An object embodying concepts , is represented with a pair of symbols :


Concrete examples of communication protocols described above are provided in Tables 1 and 2.

derivation holistic TC random entangled order sensitive diagonal rotated
ia hf fj ii dd dd di
hj hi eg jd da jj ef
ea hd gd af dc hh ch
bd ha ic eg de bb hc
fc hc if cb db aa fe
gc jf bh dd cd jd bf
hg ji hf if ca hj dh
ga jd bf jg cc bh ec
gi ja hi ab ce ab ce
ba jc dh eh cb gb hd
fa ef fb ff ed hd jh
ef ei ff dg ea bj bc
gf ed aj ib ec ah de
fg ea ae jh ee gh ed
jf ec ji ac eb ch cb
gg gf fh gg ad bd ac
ha gi aa fb aa aj je
fb gd ib dh ac gj bd
ig ga ia ic ae cj db
ii gc gg je ab ij ej
db bf bc bb bd ad ge
bf bi jd gh ba gd ad
hh bd ch fc bc cd jb
ce ba fa de be id bj
jc bc fb ia bb ed da
Table 1: Examples of holistic, TC, random, entangled, order sensitive, diagonal and rotated protocols for derivations with five colours and five shapes.
derivation message
(a) Context-sensititive protocol
derivation message
(b) Negation protocol
Table 2: Examples of context-sensitive and negation protocols

Appendix B Compositionality metrics


Compositionality is widely considered to be the feature of language and thought that explains the generalisation capabilities of humans (Chomsky, 1957; Lake et al., 2016). While recent research in emergent communication shows that the relationship between compositionality and generalisation is nuanced (Chaabouni et al., 2020; Kharitonov and Baroni, 2020) and in some signalling games compositionality is not necessary for generalisation, generalisation to novel situations remains an intuitive hallmark of compositionality.

Here we measure the test set accuracy of a receiver trained to predict the ground-truth derivations based on messages send by a fixed sender

. More concretely, we implement the receiver as a neural networks that first embeds each symbol of a message

into a 50 embedding vector, feeds each of these embedding to a single-layer LSTM (Hochreiter and Schmidhuber, 1997)

and then feeds the last hidden state vector of the LSTM into a two-layer feed-forward neural network. The output of the network is a tuple of categorical distributions over all concepts in the derivation. The loss function consists is a sum of cross-entropy errors for for all concepts. The neural network is implemented in PyTorch

(Paszke et al., 2017) using EGG (Kharitonov et al., 2019). We train it using Adam (Kingma and Ba, 2014) with learning rate and batch size 1. We use regularisation with coefficient and initialise the embedding vectors by sampling from .

For the purpose of our experiments, we split the set of derivations into (80% of the derivations) and (20% of the derivations). We train the receiver on

for 100 epochs or until it achieves training set accuracy 1. We then measure the accuracy of the receiver on

. The reported accuracies are averaged across five random seeds.

Positional disentanglement

Chaabouni et al. (2020) introduced positional disentanglement as an adaptation of similar metrics developed in the representation learning community (Chen et al., 2019). It is also related to context independence and residual entropy introduced by Resnick et al. (2020). Let denote the -th symbol of a message , and the concept with the highest mutual information with , and with the second highest mutual information:


where is mutual information and . Then, positional disentanglement posdis is defined as


where is the maximum message length and is entropy over the distribution of symbols at -th place in messages for each . We ignore positions with zero entropy.

Bag-of-words disentanglement

Note that positional disentanglement assumes that compositionality involves fixed order (e.g. the meaning of symbol a at first place is different from the meaning of symbol a at second place in the message). Bag-of-words disentanglement relaxes this assumption by only considering symbols counts: is the number of occurrences in -th symbol in a message. Then, bag-of-words disentanglement, bosdis, is defined as


where is the number of symbols available in the protocol.

Tree reconstruction error

We define to be a neural network so we can optimise its parameters via gradient descent over . More concretely, to generate a reconstruction of a derivation , we follow (2) and first embed each concept forming into an -dimensional embedding vector, where (with the maximum message length, fixed in advance). Then, we encode the entire into an -dimensional embedding vector by recursively applying in a bottom-up manner. The ground truth message corresponding to derivation is encoded as one-hot vectors. We then define to be a sum of cross-entropy errors between -th segment of the reconstruction and

-th one-hot-encoded symbol in the ground truth message

. The neural network was implemented in PyTorch (Paszke et al., 2017). We train it for 1000 epochs using Adam (Kingma and Ba, 2014) with learning rate . We use regularisation with coefficient and initialise the embedding vectors by sampling from .

Context independence

Context independence (Bogin et al., 2018) measures the alignment between symbols forming a message and concepts forming a derivations. Let us denote the set of concepts by and the set of symbols by . By , we mean the probability that maps a derivation containing concept to a message containing symbol . We define the inverse probability similarly. Finally, we define ; is the symbol most often sent in presence of a concept . Then, context independence metric is

; the expectation is taken with respect to the joint uniform distribution


For instance, when the derivation consists of a shape and a colour, our experiments, context independence measures the consistency of associating symbols with shapes irrespective of colour and vice versa. Note that context independence effectively punishes the agents for using synonyms, i.e. associating multiple symbols with a single concept (Lowe et al., 2019).

Topographical similarity

Topographical similarity (Brighton and Kirby, 2006; Lazaridou et al., 2018) is a measure of structural similarity between messages and derivations. Let us define to be a distance over derivations and to be a distance over messages. Topographical similarity is the Spearman correlation of and measured over a joint uniform distribution . Topographical similarity mirrors the approach known as representation similarity analysis in systems neuroscience (Kriegeskorte, 2008) where it is used to quantify structural similarity between a stimulus and neural activity evoked by the stimulus

We choose to be the Levenshtein (1966)

distance and treat derivations as ordered pairs of concepts so we can choose

to be the Hamming distance.

Conflict count

Conflict count was introduced by Kuciński et al. (2020). It assumes that the number of concepts in a derivation is equal to message length and that there is a one-to-one mapping between each concept and symbol . It then counts how often this mapping is violated.

Let us denote each permutation mapping the position of a symbol to the position of concepts as , where . Then, let us denote the principal meaning of a symbol at position as , where


Here denotes the indicator function and the -th concept in a derivation . Then, conflict count is


where .

Because conflict count assumes the number of concepts in a derivation to be equal to message length , it is undefined for two protocols violating this assumption: negation and context sensitive.

Appendix C Effect of composition function in TRE

In this additional experiment, we analyse the effect of various implementations of (the composition function for -dimensional vector representations of derivations ) on TRE scores across protocols. We consider three implementations of :

  1. Additive composition, where is vector addition:

  2. Linear composition, where is a linear transformation:


    where are learnable parameters.

  3. Non-linear composition, where is a two-layer feedforward neural network:


    Here , , , and denotes the size of the hidden layer. We choose .

The results of the experiments are presented in Figure 3

. While additive and linear composition perform similarly, the model capacity of non-linear composition is probably too strong for the task, resulting in severe overfitting (e.g. low TRE even for random and holisitc protocols) and a false negative for the context-sensitive protocol. The presented results were stable across hyperparameters of TRE (e.g. learning rate, weight decay coefficient, number of epochs).

Figure 3: Scores assigned by various compositionality metrics to various protocols. Each subplot corresponds to a different metric, with protocols on the Y-axis and scores on the X-axis. Again, the X-axes do not share a common scale and we report negative

TRE so a higher value means greater compositionality. The width of the box for each score is its confidence interval describing the standard deviation across five random seeds.