Learning to Improve Representations by Communicating About Perspectives

09/20/2021 ∙ by Julius Taylor, et al. ∙ Inria 0

Effective latent representations need to capture abstract features of the externalworld. We hypothesise that the necessity for a group of agents to reconcile theirsubjective interpretations of a shared environment state is an essential factor in-fluencing this property. To test this hypothesis, we propose an architecture whereindividual agents in a population receive different observations of the same under-lying state and learn latent representations that they communicate to each other. Wehighlight a fundamental link between emergent communication and representationlearning: the role of language as a cognitive tool and the opportunities conferredby subjectivity, an inherent property of most multi-agent systems. We present aminimal architecture comprised of a population of autoencoders, where we defineloss functions, capturing different aspects of effective communication, and examinetheir effect on the learned representations. We show that our proposed architectureallows the emergence of aligned representations. The subjectivity introduced bypresenting agents with distinct perspectives of the environment state contributes tolearning abstract representations that outperform those learned by both a single au-toencoder and a population of autoencoders, presented with identical perspectives.Altogether, our results demonstrate how communication from subjective perspec-tives can lead to the acquisition of more abstract representations in multi-agentsystems, opening promising perspectives for future research at the intersection ofrepresentation learning and emergent communication.



There are no comments yet.


page 6

Code Repositories


Research project seeking to bridge the gap between representation learning and emergent communication

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning effective internal representations of the world lies at the core of cognition, being associated with vital cognitive functions such as planning and imagination in both human and artificial agents (de_vega_symbols_2008; harnadSymbolGroundingProblem1990; santoroSymbolicBehaviourArtificial2021a)

. Self-supervised representation learning techniques offer a successful and promising paradigm which has received much attention by the machine learning community in recent times

(kingma2013auto; higgins2016beta; pathakCuriositydrivenExplorationSelfsupervised2017). Nevertheless, contributions employing these techniques have so far been mainly examined from a single-agent perspective, limiting the study of both opportunities and challenges arising through the concurrent learning of representations. In this paper we highlight a link that has dominated discussions in the study of human cognition: the relationship between representation learning and communication. In particular, we focus on the effect that social interaction in shared situations has on the emergence of effective representations (mirolli_language_2009; cangelosiGroundingSharingSymbols2006).

In this quest, we draw inspiration from studies of human cognition and development, arguing that language is not only a communication system but also a cognitive tool that improves all aspects of human cognition (vygotsky2012thought; clark2012magic; waxman1995words). In particular, scientists in the developmental robotics community have long recognised that language has a double function (mirolli_language_2009; cangelosi2010integration). On one hand, it is a communicative means that facilitates interaction in groups of agents. This aspect has motivated most works in emergent communication, from earlier works introducing referential games (steels1997synthetic)

, to more recent contributions in multi-agent reinforcement learning

(jaquesSocialInfluenceIntrinsic2019a; mordatchEmergenceGroundedCompositional2018a). On the other hand, language is a cognitive tool capable of transforming high-level internal functions related to cognition and categorization (mirolli_language_2009; vygotsky2012thought). Here again, earlier work in developmental robotics have for instance shown how perceptual categories are affected by emergent communication (bleys2009grounded), while recent work in machine learning has shown how language can be used as a cognitive tool enhancing exploration through compositional imagination (colasLanguageCognitiveTool2020c).

We propose a multi-agent interpretation of representation learning using convolutional autoencoders (masci2011stacked; guo2017deep). We allow agents to communicate about their perceptual inputs using messages, which we then use to compute multi-agent objectives beyond the classical autoencoding loss. These losses are designed to give agents an incentive to learn an aligned latent mapping of the data which can be understood by all participants within a population of agents. We hypothesise that the need to find aligned representations, agents are forced to abstract away unimportant details of the data and thus generally find overall better representations. We further assume that all agents always observe the same underlying state of the environment, but do so from different "points of view", i.e. each agents receives a privat observation representing the current state. We hypothesise that this effect of subjectivity is an important factor for learning good representations, because small disparities among the different agents incentivise an alignment of internal representations towards a more "objective" representation that abstracts away the particularities of the subjective inputs. Our proposed setting is grounded in hypotheses related to human social cognition, particularly the hypothesis that the uniquely human ability of understanding and reconciling perspectives emerged in settings of joint attention (tomaselloTwoKeySteps2012a).

Our work aims at examining which conditions favour the alignment of representations in a population of learning agents and what effect it has on the learned representations. We study the role of different factors including training objectives, subjectivity, communication noise, and the ability to differentiate through the communication channel. We investigate the quality of learned representations in terms of linear separability, natural clustering, and alignment within the agent population. We find that under specific conditions the autoencoders can converge on aligned and efficient representations, even in a fully decentralised setting where gradients cannot pass the communication channel barrier. Moreover, when allowing differentiable communication, this shared latent code also exhibits better separation of the input data than a code which is learned by a population of baseline autoencoders.


We summarise our contributions as follows:

  1. We highlight an interesting intersection between emergent communication and self-supervised representation learning.

  2. We propose a canonical multi-agent, self-supervised representation learning architecture, together with a set of training objectives that can be optimised in fully-decentralised setting.

  3. A detailed analysis of the conditions ensuring both the learning of efficient individual representations and the alignment of those representations across the agent population.

2 Related work

The overarching benefit of effective representations is that of capturing task-agnostic priors of the world. Capturing such abstract properties has been equated with learning representations invariant to local changes in the input (bengio_represent). NIPS2009_428fca9b, for example, find that deep autoencoders exhibit this ability in image recognition tasks. However, representations used by an encoder-decoder pair result from the co-adaptation of the two networks and are, therefore, not abstract in any objective sense. In contrast, tieleman_shaping_2019 introduce community-based autoencoders, which adopt a population-based approach that randomly pairs all encoders and decoders in a set, forcing them to learn more abstract representations. Their work is closely related to ours, as it is motivated by a link between representation learning and communication. However, they train all autoencoders on the same data and do not encourage an alignment of internal representations; thus, they do not examine the effect of reconciling perspectives. Moreover, they only study their architecture in a centralised setting, which limit its applications in decentralised multi-agent systems. In contrast, our work assumes that agents neither have access to other agents inputs nor their reconstructions.

Our proposed architecture has close links to contrastive learning (chen2020simple; khosla2020supervised), because it incentivises moving positive examples (or in our case different perspectives) closer together in a learned latent space. While contrastive learning requires researchers to agree on defintions of positive and negative pairs are, as well as data augmentation functions, our work uses the inherent properties of multi-agent systems to naturally infer positive samples.

One of the essential features of human intelligence is the ability to cooperate through language. How language evolves as a byproduct of joint optimisation tasks is studied in the emergent communication literature. There is much recent work studying the emergence of communication in multi-agent settings (foersterLearningCommunicateDeep2016a; ecclesBiasesEmergentCommunication2019a; jaquesSocialInfluenceIntrinsic2019a). lazaridouMultiAgentCooperationEmergence2017a, for example, study the emergence of communication in referential games, where comunication is the very purpose of the game. Matrix games and riddles have been explored by lowePitfallsMeasuringEmergent2019a and foersterLearningCommunicateSolve2016, and 2D physics simulations by mordatchEmergenceGroundedCompositional2018a. See lazaridou_emergent_2020

for a recent review of emergent communication with deep learning. While in most of those works task scores are examined as a prior for successful communication, there is no work, to the best of our knowledge, which specifically focuses the effect of the emergence of communication on the underlying learned representations.

3 Methods

Figure 1: Proposed architecture. Two agents and (top and bottom dashed boxes) are presented with different observations ( and ) of the same underlying environment state (). Each agent (say ) implements a standard autoencoder architecture with an encoder mapping the input observations to latent representations and a decoder mapping latent representations to reconstructed observations. Each agent communicates its latent representation (also called a message) of its own observation input to the other agents. This way, each agent is able to reconstruct the observation from its own latent representation (: reconstruction by agent from its own message ) as well as from the latent representation of the other (: reconstruction by agent from the other’s message ). The architecture can be trivially extended to a larger population, where each agent communicates their latent representations to with each others.

3.1 Problem definition

We consider a population of agents and environment states , hidden to the agents. Each agent receives a private observation of the state , where

is an observation space. Agents are essentially a convolutional autoencoders, though other self-supervised learning technique could be used (for example variational autoencoders

(kingma2013auto)). We define encoder and decoder functions and , respectively, where is a latent representation space (also called a message space, see below). Given an input observation, an agent encodes it into a latent representation and attempts at reconstructing the observation through , dropping the dependence on

for brevity. Agents will use these latent vectors to communicate to other agents about their perceptual inputs (hence the term message space for

). When agent receives a message from agent they decode the message using their own decoder,i.e. . The diagram in Fig. 1 depicts a possible setup with 2 agents.

Importantly, while the observations provided to the agents at each time step are sampled from the same environment state, we systematically ensure that agents never access this state, nor the input observations and the reconstructions of each other. This makes our approach applicable to a wide range of decentralized multi-agent settings.

Given this architecture and a dataset mapping each state to a set of observations in , where is the number of observations available in the dataset for state , we are interested in the following research questions:

  • Under which conditions can the agents converge towards aligned representations? (see below for the definition of alignment measures)

  • Does multi-agent representation learning improve the efficiency of the learned representations compared to a single agent baseline? If so, what are the main factors influencing it?

3.2 Losses for communication

In order to incentivise communication in our system, we define three loss functions which encourage agents to converge on a common protocol in their latent spaces. We define the message-to-message loss as

This loss inventivises that two messages are similar. Since messages are always received in a shared context, this loss encourages agents to find a common representation for it. Next, we propose the decoding-to-input loss, given by

This loss brings the decoding of agent from agent ’s message closer to agent ’s input observation. Finally, we have the decoding-to-decoding loss, given by

which is closely related to is similar to , but uses as the target the decoded image rather than the observed state. Note that the standard autoencoding loss is given by .

Especially when optimising for , there is a potential downfall: the trivial solution of mapping all data points to a single point in latent space, yielding . Thus, optimising will always require to be optimised in conjunction with losses which prevent this degenerate solution, such as . We further add noise to the communication channel to a) simulate a more realistic communication environment and b) to enforce more diverse representation in the latent space to prevent it from collapsing when optimising . The message of agent is thus defined as , with , where

is a hyperparameter in our system. The total loss we optimise is thus

3.3 Environment

We use MNIST (GPL license) as the dataset


; we determine the state at each time step by drawing from a discrete uniform distribution, i.e.

. We then derive individual observations of agents by sampling a random digit from the subset of digits representing denoted as , without replacement. This way, agents get as their inputs at time different representations of the same underlying true digit .

3.4 Training procedure

To train our system, we sample two agents without replacement from the population of agents and then minimise the losses we defined in section 3.2. For each round of multi-agent training, we also train a pair of regular autoencoders with no access to multi-agent losses and use these agents as baseline for comparison. The full algorithmic loop is described in the Algorithm 1.

Initialise a population of agents
Initialise a population of baseline agents
while not converged do
       Sample two agents
       Sample random digit
       Sample inputs
       Agents and compute and minimise losses
       Sample baseline agents
       Agents minimise
end while
Algorithm 1 Multi-agent autoencoding

3.5 Evaluation

We want to measure if the latent representations learned by agents are both aligned and represent the underlying data well. Thus, we introduce measures for quantifying these properties. First, in order to check whether the representations capture important properties about the data, for each agent

, we train a linear classifier

to predict class identity from the latent space, similar to tieleman_shaping_2019. If agents find good representations, this classification task will be easier, thus we consider classification accuracy as a proxy for the quality of representation. When training linear classifiers, we always freeze the agents weights such that only the linear part is learned.

Next, in order to measure the alignment of representations, each agent uses the linear classifier from the previous step and classifies latent representations of the dataset computed by other agents, thus computing with . We record the zero-shot classification performance (without any training), which we call swap accuracy. Further, we define an additional proxy for alignment, called the agreement, which is computed as

and can be described as the fraction of time agent and agree on the same class label, when using as input.

3.6 Hyperparameter search

To determine the best combination of hyperparameters, we run a search over the space of possible combinations. In order to balance exhaustiveness, flexibility, and computation time, we opt to use a mixture of grid search and random sampling. For each hyperparameter controlling a loss term, i.e. , we uniformly sample a value between 0 and 1. We then evaluate this combination of losses at four different values of noise levels, ie. . Additionally, to inspect each loss in isolation, we run a special set of runs where one specific is set to 1, and the others to 0. Those runs are also evaluated at the aforementioned noise levels.

We analyse our methods in two distinct conditions. In the first one, called differentiable communication (DC), we allow gradients to pass the communication channel barrier during optimisation (see e.g. foersterLearningCommunicateDeep2016a

for a similar setting in emergent communication). This means that gradients are backpropagated from the decoder of an agent (e.g.

in Fig. 1) to the message sent by another agent (e.g. in Fig. 1). In the second one, called non-differentiable communication (NDC), we relax this assumption, therefore turning to a fully-decentralised setting where gradient backpropagation is strictly contained within each agent architecture (i.e. within each of the dashed boxes in Fig. 1).

We implement all code and models using the Python programming language (PSF license) (van1995python)

, the PyTorch (modified BSD license)

(pytorch2019) library and generate all plots with matplotlib (PSF license) (mpl2007) and seaborn (BSD license) (michael_waskom_2017_883859)

. We open source our github repository at the following link


4 Experiments and results

If not stated otherwise, all experiments are run with ten unique random seeds and error bars indicate the 95% confidence interval (bar plots) or the standard deviation (line plots). Significance is determined using an unequal variances t-test (Welch’s t-test). We annotate significance levels in bar plots using the symbols defined in Table 

1. We use the statannot package222https://github.com/webermarcolivier/statannot for annotation (MIT license). Before measurements, we train populations of 3 agents (if not stated otherwise) for 50000 total rounds according to the procedure described in Algorithm 1, using a batch size of 1028. We use a internal distributed architecture which allows the training on up to 128 Nvidia V100 GPUs concurrently. In total, we consumed 50000 GPU hours for all experiments.

We provide results in both the DC (Section 4.1) and NDC (Section 4.2) conditions, as defined in Section 3.6. We analyse 400 parameter settings in each condition and rank each individual experiment according to the resulting swap accuracy, defined in section 3.

4.1 Differentiable communication

Alignment of representations

Figure 2: Swap accuracy (left) and agreement (right) as a function of hyperparameters in settings with differentiable communuication. DTI describes a configuration where only is optimised, whereas AE+MTM uses a combination of and . AE agents use only and serve as a baseline.

We first want to investigate which combination of loss functions defined in section 3 leads to the learning of aligned representations in an agent population. In order to measure alignment, we use both swap accuracy and agreement, defined in section 3. We rank each individual experiment from the hyperparameter search according to the resulting swap accuracy. We plot swap accuracy and agreement measure for the hyperparameters leading to the highest swap accuracy (DTI), a baseline where only is optimized (AE), and the hyperparameters which has the highest performance in the decentralised setting (AE+MTM). We present the results in Fig. 4.

We find that solely optimising yields the best alignment of representations. We find that this case significantly outperforms all other settings we tested. Furthermore, the results show that a group of baseline agents (AE) do not find aligned representations.

Quality of learned representations

Figure 3: Classification accuracy when predicting class identity from the learned latent space, evaluated for different variables, using differentiable communication. Left: Accuracy for different runs and with or without perspective. Right: Accuracy of best run (MTI) as a function of the number of agents per population.

We now turn to the investigation of the properties of the learned latent spaces and try to answer the question how the learning of an aligned protocol impacts the underlying representations. To that end, we first train populations of agents and then predict class identity from the learned latent space, as described in section 3. Similar to the previous section, we use the hyperparameter search to rank outcomes according to the classification accuracy and report the best outcome (DTI) in addition to AE and AE+MTM. Further, we investigate the dependence on the number of participating agent by re-running the DTI with 4, 5, and 8 agents. We also conduct an ablation study where we supply all agents from the population with the same input images, instead of using different images from the same underlying class. We display the results in Fig. 3.

We find that DTI yields the highest classification performance. In fact, DTI finds representations which result in classification performance significantly better compared to the baseline (AE) and AE+MTM. We would like to highlight that the experiment achieving the best performance in terms of classification accuracy, uses the same parameters as the experiment which yielded the strongest alignment of representations in the previous paragraph (DTI).

When using the same input image for all agents during training (Fig. 3 left, right column), we find that the previously observed benefits do not hold anymore. In fact, all of the displayed configurations yield the same resulting classification accuracy. Fig. 3 further reveals that the number of agents per population seems to improve classification performance for DTI.

4.2 Non-differentiable communication

We now turn to the setting whith non-differntiable communication (NDC). Overall, we find that with NDC, different losses are useful for incentivising communication and that the learned representations are not as favorable as in the setting with DC.

Alignment of representations

Figure 4: Swap accuracy (left) and agreement (right) as a function of hyperparameters in settings with non-differentiable communuication. DTI describes a configuration where only is optimised, whereas AE+MTM uses a combination of and . AE agents use only and serve as a baseline.

We repeat all experiments from section 4.1 with NDC. Again, we first rank the results of the hyperparameter search according to swap accuracy. We show the best result (AE+MTM) together with the baseline (AE) and the best result from the differentiable setting (DTI). We display the results in Fig .4.

Contrary to the previous section, the outcomes reveal that DTI does not yield aligned representations. We find that a combination of and yields the highest classification accuracy. While it seems necessary to a introduce to achieve good performance, agents only optimising , while finding aligned representations, converge on degenerate solutions in which they map all inputs to the same point in latent space.

Quality of the learned representations

Figure 5: Classification accuracy when predicting class identity from the learned latent space, evaluated for different variables, using non-differentiable communication. Left: Accuracy for different runs and with or without perspective. Right: Accuracy of best run (MTI) as a function of the number of agents per population.

To assess representation capacity in the NDC setting, we rank the results of the hyperparameter search according to classification accuracy. Again, we display the best result from the previous analysis (AE+MTM) along with the baseline (AE) and the best result from the DC setting (MTI). We display the results in Fig. 5. We run the same ablations concerning perspective and analyse the dependence of the best performing run on the number of agents in the populationo.

Fig. 5 (left) demonstrates that DTI does not find good latent representations of the data. It is at chance level and thus significantly worse than both AE and AE+MTM. While we find that AE+MTM finds suitable representations, we find no significant difference to the baseline. When evaluating the effect of subjectivity, we also find there to be no significant difference in classfication accuracy among agents. Similar results hold for the evaluation of experiments with bigger agent populations. We observe no significant improve in classification performance when evaluating AE+MTM at 3, 4, 5, or 8 agents.

5 Discussion

In this work, we show that better data representations can be obtained by using a population of agents optimising their respective latent space for communication while exploiting the property of subjectivity, which is inherent to multi-agent systems. We show that, using our proposed multi-agent losses, we can achieve an aligned protocol shared among all agents. Most interestingly, we find that the conditions which give rise to the most aligned protocols, also give rise to the best representations. Thus, we conjecture, that in the process of aligning their latent spaces, agents are forced to escape the local minima induced by the potential co-adaptation of their encoder-decoder pairs. In addition, our results show that exploiting the inherent subjectivity of the systems seems to be crucial for better representations to arise. Additionaly, we want to highlight that under the assumption of no subjectivity and when only optimising DTI, our approach reduces to the approach introduced by tieleman_shaping_2019 (see Related Work). Our results actually seem to contradict their finding that more efficient representations can be obtained without subjectivity.

Because our work is limited in scope and complexity, we can afford to investigate properties such as the interaction of our proposed losses deeply and with statistical rigor. Nevertheless, we believe that the insights gained here should be taken to a more complex domain. One natural extension to our work would encompass the integration into a full multi-agent reinforcement learning loop. This is particularly interesting, because a shared context arises from multi-agent interaction in a natural way. When agents are in close vicinity, they most likely observe cohesive parts of their environments. In this setup, the improved representations can be used for policy learning or the building of effective world models. hafnerDreamControlLearning2020a, for example, show that good representations are crucial for the success of model-based reinforcement learning (wangBenchmarkingModelBasedReinforcement2019; moerlandModelbasedReinforcementLearning2020b).

Finally, when analysing our approach in settings without differential communication, we are able to find aligned representations, but these do not seem to feature the same favourable properties as in settings where gradients may pass agent barriers. We leave it for future work to explore ways of approximating differentiable communication in a truly decentralised setting.

6 Broader impact

This work provides a step in the direction of building agents that can collectively learn representations of their environment and communicate about it. While we chose to use a minimal architecture in order to study the properties of the system in a well-controlled setting, future extensions can also lead to real-world applications. One that comes to mind is multi-robot systems that can jointly learn world representations and communicate about it in order to solve cooperative tasks. If successfully implemented, this technology can raise issues concerning automation of certain tasks resulting in loss of jobs. The question of the controllability and the interpretability of such multi-robot systems can also raise ethical issues.


  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? See section 4 and section 5

    2. Did you describe the limitations of your work? See section 5

    3. Did you discuss any potential negative societal impacts of your work? See section 6.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? See section 3.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? See section 4.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? See section 3.

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See section 3.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators? See section 3

    2. Did you mention the license of the assets?

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Appendix

symbol p
ns 0.05 p 1
* 0.01 p 0.05
** 0.001 p 0.01
*** 0.0001 p 0.001
**** p 0.0001
Table 1: p-value annotation legend.