Log In Sign Up

Two Body Problem: Collaborative Visual Task Completion

by   Unnat Jain, et al.

Collaboration is a necessary skill to perform tasks that are beyond one agent's capabilities. Addressed extensively in both conventional and modern AI, multi-agent collaboration has often been studied in the context of simple grid worlds. We argue that there are inherently visual aspects to collaboration which should be studied in visually rich environments. A key element in collaboration is communication that can be either explicit, through messages, or implicit, through perception of the other agents and the visual world. Learning to collaborate in a visual environment entails learning (1) to perform the task, (2) when and what to communicate, and (3) how to act based on these communications and the perception of the visual world. In this paper we study the problem of learning to collaborate directly from pixels in AI2-THOR and demonstrate the benefits of explicit and implicit modes of communication to perform visual tasks. Refer to our project page for more details:


page 17

page 18

page 20

page 21

page 22

page 23

page 25

page 26


When2com: Multi-Agent Perception via Communication Graph Grouping

While significant advances have been made for single-agent perception, m...

Collaborative Visual Navigation

As a fundamental problem for Artificial Intelligence, multi-agent system...

Learning-Based Physical Layer Communications for Multi-agent Collaboration

Consider a collaborative task carried out by two autonomous agents that ...

Learning-Based Physical Layer Communications for Multiagent Collaboration

Consider a collaborative task carried out by two autonomous agents that ...

Multi-Agent Embodied Visual Semantic Navigation with Scene Prior Knowledge

In visual semantic navigation, the robot navigates to a target object wi...

Learning Distilled Collaboration Graph for Multi-Agent Perception

To promote better performance-bandwidth trade-off for multi-agent percep...

CWcollab: A Context-Aware Web-Based Collaborative Multimedia System

Remote collaboration tools for conferencing and presentation are gaining...

1 Introduction

Developing collaborative skills is known to be more cognitively demanding than learning to perform tasks independently. In AI, multi-agent collaboration has been studied in more conventional [32, 43, 9, 58] and modern settings [53, 28, 79, 35, 56, 61]. These studies have mainly been performed on grid-worlds and have factored out the role of perception in collaboration.

In this paper we argue that there are aspects of collaboration that are inherently visual. Studying collaboration in simplistic environments does not permit to observe the interplay between perception and communication, which is necessary for effective collaboration. Imagine moving a piece of furniture with a friend. Part of the collaboration is rooted in explicit communication through exchanging messages, and some part of it is done through implicit communication through interpreting perceivable cues about the other agent’s behavior. If you see your friend going around the furniture to grab it, you would naturally stay on the opposite side to avoid toppling it over. Additionally, communication and collaboration should be considered jointly with the task itself. The way you communicate, either explicitly or implicitly, in a soccer game is very different from when you move furniture. This suggests that factoring out perception and studying collaboration in isolation (grid-world) might not result in an ideal outcome.

In short, learning to perform tasks collaboratively in a visual environment entails joint learning of (1) how to perform tasks in that environment, (2) when and what to communicate, and (3) how to act based on implicit and explicit communication. In this work, we develop one of the first frameworks that enables the study of explicitly and implicitly communicating agents collaborating together in a photo-realistic environment.

To this end we consider the problem of finding and lifting bulky items, ones which cannot be lifted by a single agent. While conceptually simple, attaining proficiency in this task requires multiple stages of communication. The agents must search for the object of interest in the environment (possibly communicating their findings to each other), position themselves appropriately (for instance, opposing each other), and then lift the object simultaneously. If the agents position themselves incorrectly, lifting the object will cause it to topple over. Similarly, if the agents pick up the object at different time steps, they will not succeed.

To study this task, we use the AI2-THOR virtual environment [48], a photo-realistic, physics-enabled environment of indoor scenes used in past work to study single agent behavior. We extend AI2-THOR to enable multiple agents to communicate and interact.

We explore collaboration along several modes: (1) The benefits of communication for spatially constrained tasks (e.g., requiring agents to stand across one another while lifting an object) vs. unconstrained tasks. (2) The ability of agents to implicitly and explicitly communicate to solve these tasks. (3) The effect of the expressivity of the communication channel on the success of these tasks. (4) The efficacy of these developed communication protocols on known environments and their generalizability to new ones. (5) The challenges of egocentric visual environments vs. grid-world settings.

We propose a Two Body Network, or TBONE, for modeling the policies of agents in our environments. TBONE operates on a visual egocentric observation of the 3D world, a history of past observations and actions of the agent, as well as messages received from other agents in the scene. At each time step, agents go through two rounds of communication, akin to sending a message each and then replying to messages that are received in the first round. TBONE is trained with a warm start using a variant of DAgger [70], followed by a minimization of a sum of an A3C loss and a cross entropy loss between the agents actions and the actions of an expert policy.

We perform a detailed experimental analysis of the impact of communication using metrics including accuracy, number of failed pickup actions, and episode lengths. Following our above research questions, our findings show that: (1) Communication clearly benefits both constrained and unconstrained tasks but is more advantageous for constrained tasks. (2) Both explicit and implicit communication are exploited by our agents and both are beneficial, individually and jointly. (3) For our tasks, large vocabulary sizes are beneficial. (4) Our agents generalize well to unseen environments. (5) Abstracting our environments towards a grid-world setting improves accuracy, confirming our notion that photo-realistic visual environments are more challenging than grid-world like settings. This is consistent with findings by past works for single agent scenarios.

Finally we interpret the explicit mode of communication between agents by fitting logistic regression models to the messages to predict the values such as oracle distance to target, next action,

etc., and find strong evidence matching our intuitions about the usage of messages between agents.

2 Related Work

We now review related work in the directions of visual navigation, navigation and language, visual multi-agent reinforcement learning (RL), and virtual learning environments employed in past works to evaluate algorithms.

Figure 2: A schematic depicting the inputs to the policy network. An agent’s policy operates on a partial observation of the scene’s state and a history of previous observations, actions, and messages received.
Figure 3: Overview of our TBONE architecture for collaboration.

Visual Navigation: A large body of work focuses on visual navigation, i.e., locating a target using only visual input. Prominent early map-based navigation methods [47, 6, 7, 64] use a global map to make decisions. More recent approaches [76, 87, 23, 85, 46, 71] reconstruct the map on the fly. Simultaneous localization and mapping [84, 74, 24, 12, 67, 77] consider mapping in isolation. Upon having obtained a map of the environment, planning methods [13, 44, 52] yield a sequence of actions to achieve the goal. Combinations of joint mapping and planning have also been discussed [27, 50, 49, 31, 3]. Map-less methods [38, 54, 69, 72, 66, 92, 36] often formulate the task as obstacle avoidance given an input image or reconstruct a map implicitly. Conceptually, for visual navigation, we must learn a mapping from visual observations to actions which influence the environment. Consequently the task is well suited for an RL formulation, a perspective which has become popular recently [62, 1, 16, 17, 33, 42, 86, 59, 5, 8, 90, 25, 36, 91, 37]. Some of these approaches compute actions from observations directly while others attempt to explicitly/implicitly reconstruct a map.

Following recent techniques, our proposed approach also uses RL for visual navigation. While our proposed approach could be augmented with explicit or implicit maps, our focus is upon multi-agent communication. In the spirit of factorizing out orthogonal extensions from the model, we defer such extensions to future work.

Navigation and Language: Another line of work has focused on communication between humans and virtual agents. These methods more accurately reflect real-world scenarios since humans are more likely to interact with an agent using language rather than abstract specifications. Recently Das et al[19, 21] and Gordon et al[34] proposed to combine question answering with robotic navigation. Chaplot et al[15], Anderson et al[2] and Hill et al[39] propose to guide a virtual agent via language commands.

While language directed navigation is an important task, we consider an orthogonal direction where multiple agents need to collaboratively solve a specified task. Since visual multi-agent RL is itself challenging, we refrain from introducing natural language complexities. Instead, in this paper, we are interested in developing a systematic understanding of the utility and character of communication strategies developed by multiple agents through RL.

Visual Multi-Agent Reinforcement Learning: Multi-agent systems result in non-stationary environments posing significant challenges. Multiple approaches have been proposed over the years to address such concerns [82, 83, 81, 30]. Similarly, a variety of settings from multiple cooperative agents to multiple competitive ones have been investigated [51, 65, 57, 11, 63, 35, 56, 29, 61].

Among the plethora of work on multi-agent RL, we want to particularly highlight work by Giles and Jim [32], Kasai et al[43], Bratman et al[9], Melo et al[58], Lazaridou et al[53], Foerster et al[28], Sukhbaatar et al[79] and Mordatch and Abbeel [61], all of which investigate the discovery of communication and language in the multi-agent setting using maze-based tasks, tabular setups, or Markov games. For instance, Lazaridou et al[53] perform experiments using a referential game of image guessing, Foerster et al[28] focus on switch-riddle games, Sukhbaatar et al[79] discuss multi-turn games on the MazeBase environment [80], and Mordatch and Abbeel [61] evaluate on a rectangular environment with multiple target locations and tasks. Most recently, Das et al[20] demonstrate, especially in grid-world settings, the efficacy of targeted communication where agents must learn to whom they should send messages.

Our work differs from the above body of work in that we consider communication for visual tasks, i.e., our agents operate in rich visual environments rather than a grid-like maze, a tabular setup or a Markov game. We are particularly interested in investigating how communication and perception support each other.

Reinforcement Learning Environments: As just discussed, our approach is evaluated on a rich visual environment. Suitable environment simulators are AI2-THOR [48], House3D [88], HoME [10], MINOS [73] for Matterport3D [14] and SUNCG [78]. Common to these environments is the goal of modeling real world living environments with substantial visual diversity. This is in contrast to other RL environments such as the arcade environment [4], Vizdoom [45], block towers [55], Malmo [41], TORCS [89], or MazeBase [80]. Of these environments, we chose AI2-THOR as it was easy to extend, provides high fidelity images, and has interactive physics enabled scenes, opening up interesting multi-agent research directions beyond this current work.

3 Collaborative Task Completion

Figure 4: Communication and belief refinement module for the talk stage (marked with the superscript of ) of explicit communication. Here our vocab. is of size .

We are interested in understanding how two agents can learn, from pixels, to communicate so as to effectively and collaboratively solve a given task. To this end, we develop a task for two agents which consists of two components, each tailored to a desirable skill for indoor agents. The components are: (1) visual navigation, which the agents may solve independently, but which may also benefit from some collaboration; and (2) jointly synchronized interaction with the environment, which typically requires collaboration to succeed. The choice of these components stems from the fact that navigating to a desired position in an environment or to locate a desired object is a quintessential skill for an indoor agent, and synchronized interaction is fundamental to understanding any collaborative multi-agent setting.

We first discuss the collaborative task more formally, then detail the components of our network, TBONE, used to complete the task.

3.1 Task: Find and Lift Furniture

We task two agents to lift a heavy target object in an environment, a task that cannot be completed by a single agent owing to the weight of the object. The two agents as well as the target object are placed at random locations in a randomly chosen AI2-THOR living room scene. Both agents must locate the target, approach it, position themselves appropriately, and then simultaneously lift it.

To successfully complete the task, both agents perform actions over time according to the same learned policy (Fig. 2). Since our agents are homogeneous, we share the policy parameters for both agents. Previous works [35, 61] have found this to train agents more efficiently. For an agent, the policy operates on (1) an ego-centric observation of the environment as well as a previous history of (a) observations, (b) actions taken by the agent, and (c) messages sent by the other agent. At each time step, the two agents process their current observations and then perform two rounds of explicit communication. Each round of communication involves each of the agents sending a single message to the other. The agents also have the ability to watch the other agent (when in view) and possibly even recognize their actions over time, thereby using implicit communication as a means of gathering information.

More formally, an agent perceives the scene at time in the form of an image and chooses its action by computing a policy, i.e

., a probability distribution

, over all actions . In our case, the images are first-person views obtained from AI2-THOR. Following classical recurrent models, our policy leverages information computed in the previous time-step via the representation . The set of available actions consists of the five options MoveAhead, RotateLeft, RotateRight, Pass, and Pickup. The actions MoveAhead, RotateLeft, and RotateRight allow the agent to navigate. To simplify the complexities of continuous time movement we let a single MoveAhead action correspond to a step of size 0.25 meters, a single RotateRight action correspond to a 90 degree rotation clockwise, and a single RotateLeft action correspond to a 90 degree rotation anti-clockwise. The Pass action indicates that the agent should stand-still and Pickup is the agent’s attempt to pick up the target object. Critically, the Pickup action has the desired effect only if three preconditions are met, namely both agents must (1) be within 1.5 meters of the object and be looking directly at it, (2) be a minimum distance away from one another, and (3) carry out the Pickup action simultaneously. Note that asking agents to be at a minimum distance from one another amounts to adding specific constraints on their relative spatial layouts with regards to the object and hence requires the agents to reason about such relationships. This is akin to requiring the agents to stand across each other when they pick up the object. The motivation to model spatial constraints with a minimum distance constraint is to allow us to easily manipulate the complexity of the task. For instance, setting this minimum distance to 0 loosens the constraints and only requires agents to meet two of the above preconditions.

In our experiments, we train agents to navigate within and interact with 30 indoor environments. Specifically, an episode is considered successful if both agents navigate to a known object and, jointly, lift it within a fixed number of time steps. As our focus is the study of collaboration and not primarily object recognition, we keep the sought object, a television, constant. Importantly, environments as well as the agents’ start locations and the target object location are randomly assigned at the start of each episode. Consequently, the agents must learn to (1) search for the target object in different environments, (2) navigate towards it, (3) stay within the object’s vicinity until the second agent arrives, (4) coordinate that both agents are apart from each other by at least the specified distance, and (5) finally and jointly perform the pickup action.

Intuitively, we expect the agents to perform better on this task if they can communicate with each other. We conjecture that explicit communication will allow them to both signal when they have found the object and, after navigation, help coordinate when to attempt a Pickup, whereas implicit communication will help to reason about their relative locations with regards to each other and the object. To measure the impact of explicit and implicit means of communication in the given task, we train models with and without message passing as well as by making agents (in)visible to one another. Explicit communication would seem to be especially important in the case where implicit communication isn’t possible. Without any communication, there seems to be no better strategy than for both agents to independently navigate to the object and then repeatedly try Pickup actions in the hope that they will be, at some point, in sync. The expectation that such a policy may be forthcoming gives rise to one of our metrics, namely the count of failed pickup events among both agents in an episode. We discuss metrics and results in Section 4.

Data Accuracy Reward
Visual 59.0 4.0 -2.7 0.3 0.3 0.09 2.9 0.8
Visualdepth 65.7 3.9 -2.0 0.3 0.4 0.1 3.2 0.9
Grid-world 78.2 3.4 -0.6 0.2 0.1 0.05 0.7 0.1
Table 1: Effect of adding oracle depth as well as moving to a grid-world setting on unseen scenes, Constrained task.

3.2 Network Architecture

In the following we describe the learned policy (actor) and value (critic) functions for each agent in greater detail. See Fig. 3 for a high level visualization of our network structure. Let represent a catch-all parameter encompassing all the learnable weights in TBONE. At the -th timestep in an episode we obtain as an agent’s observation, from AI2-THOR, a RGB image which is then processed by a four layer CNN

into the 1024-dimensional vector

. Onto we append an 8-dimensional learnable embedding which, unlike all other weights in the model, is not shared between the two agents. This agent embedding gives the agents the capacity to develop distinct complementary strategies. The concatenation of and is fed, along with historical embeddings from time

, into a long-short-term-memory (LSTM) 

[40] cell resulting in a 512-dimensional output vector capturing the beliefs of the agent given its prior history and most recent observation. Intuitively, we now would like the two agents to refine their beliefs via communication before deciding on a course of action. We consider this process in several stages (Fig. 4).

Communication: We model communication by allowing the agents to send one another a -dimensional vector derived by performing soft-attention over a vocabulary of a fixed size . More formally, let , , and be (learnable) weight matrices with the columns of representing our vocabulary. Then, given the representation described above, the agent computes soft-attention over the vocabulary producing the message which is relayed to the other agent.

Belief Refinement: Given the agents’ current beliefs and the message

from the other agent, we model the process of refining one’s beliefs given new information using a two layer fully connected neural network with a residual connection. In particular,

and are concatenated, and new beliefs are formed by computing where , , and are learnable weight matrices. We set the value of to 8.

Reply and Additional Refinement: The above step is followed by one more round of communication and belief refinement by which the representation is transformed into . These additional stages have new sets of learnable parameters including a new vocabulary matrix. Note that, unlike in the standard LSTM framework where would be fed into the cell at time , we instead give the LSTM cell the refined vector .

Linear Actor and Critic: Finally the policy and value functions are computed as , and where , , , and are learned.

Figure 5: Unseen scenes metrics (Constrained task): (a) Failed pickups (b) Missed pickups (c) Relative ep. len (d) Accuracy.

3.3 Learning

Similar to others [19, 36, 18, 22], we found training of our agents from scratch to be infeasible when using a pure reinforcement learning (RL) approach, e.g., with asynchronous actor critic (A3C) [60], even in simplified settings, without extensive reward shaping. Indeed, often the agents must make upwards of 60 actions to navigate to the object and will only successfully complete the episode and receive a reward if they jointly pick up the object. This setting of extremely sparse rewards is a well known failure mode of standard RL techniques. Following the above prior work, we use a “warm-start” by training with a variant of DAgger [70]

. We train our models online using imitation learning for 10,000 episodes with actions for episode

sampled from the mixture where are the parameters learned by the model up to episode , is an expert policy (described below), and decays linearly from to as increases. This initial warm-start allows the agents to learn a policy for which rewards are far less sparse, allowing traditional RL approaches to be applicable. Note that our expert supervision only applies to the actions, there is no supervision for how agents should communicate. Instead the agents must learn to communicate in such a way that would increase the probability of expert actions.

After the warm-start period, trajectories are sampled purely from the agent’s current policy and we train our agents by minimizing the sum of an A3C loss, and a cross entropy loss between the agents’ actions and the actions of an expert policy. The A3C and cross entropy losses here are complementary, each helping correct for a deficiency in the other. Namely, the gradients from an A3C loss tend to be noisy and can, at times, derail or slow training; the gradients from the cross entropy loss are noise free and thereby stabilize training. A pure cross entropy loss however fails to sufficiently penalize certain undesirable actions. For instance, diverging from the expert policy by taking a MoveAhead action when directly in front of a wall should be more strongly penalized than when the area in front of the agent is free as the former case may result in damage to the agent; both these cases are penalized equally by a cross entropy loss. The A3C loss, on the other hand, accounts for such differences easily so long as they are reflected by the rewards the agent receives.

We now describe the expert policy. If both agents can see the TV, are within 1.5 meters of it, and are at least a given minimum distance apart from one another then the expert action is to Pickup for both agents. Otherwise given a fixed scene and TV position we obtain, from AI2-THOR, the set of all positions (on a grid with square size meters) and rotations within 1.5 meters of the TV from which the TV is visible. Letting be the length of the shortest path from the current position of agent to we then assign each the score . We then compute the lowest scoring tuple for which and are at least a given minimum distance apart and assign agent 0 the expert action corresponding to the first navigational step along the shortest path from agent 0 to (and similarly for agent 1 whose expert goal is ).

Note that our training strategy and communication scheme can be extended to more than two agents. We defer such an analysis to future work, a careful analysis of the two-agent setting being an appropriate first step.

Implementation Details. Each model was trained for 100,000 episodes. Each episode is initialized in a random train (seen) scene of AI2-THOR. Rewards provided to the agents are: 1 to both agents for a successful pickup action, constant -0.01 step penalty to discourage long trajectories, -0.02 for any failed action (e.g., running into a wall) and -0.1 for a failed pickup action. Episodes run for a maximum of 500 steps (250 steps for each agent) after which the episode is considered failed.

4 Experiments

In this section, we present our evaluation of the effect of communication towards collaborative visual task completion. We first briefly describe the multi-agent extensions made to AI2-THOR, the environments used for our analysis, the two tasks used as a test bed and metrics considered. This is followed by a detailed empirical analysis of the tasks. We then provide a statistical analysis of the explicit communication messages used by the agents to solve the tasks, which sheds light on their content. Finally we present qualitative results.

Framework and Data. We extend the AI2-THOR environment to support multiple agents that can each be independently controlled. In particular, we extend the existing initialization action to accept an agentCount parameter allowing an arbitrarily large number of agents to be specified. When additional agents are spawned, each is visually depicted as a capsule of a distinct color. This allows agents to observe each other’s presence and impact on the environment, a form of implicit communication. We also provide a parameter to render agents invisible to one another, which allows us to study the benefits of implicit communication. Newly spawned agents have the full capabilities of a single agent, being able to interact with the environment by, for example, picking up and opening objects. These changes are publicly available with AI2-THOR v1.0. We consider the 30 AI2-THOR living room scenes for our analysis, since they are the largest in terms of floor area and also contain a large amount of furniture. We train on 20 and test on the 20 seen scenes as well as the remaining 10 unseen ones.

Figure 6: Reward vs. training episodes on the Constrained task. (left) Seen scenes (right) Unseen scenes.

Tasks. We consider two tasks, both requiring the two agents to simultaneously pick up the TV in the environment: (1) Unconstrained: No constraints are imposed here with regards to the locations of the agents with respect to each other. (2) Constrained: The agents must be at least 8 steps from each other (akin to requiring them to stand across each other when they pick up the object). Intuitively, we expect the Constrained setting to be more difficult than the Unconstrained, since it requires the agents to spatially reason about themselves and objects in the scene. For each of the above tasks, we train 4 variants of TBONE, resulting from switching explicit and implicit communication on and off. Switching off implicit communication amounts to rendering the other agent invisible.

Metrics. We consider the following metrics: (1) Reward, (2) Accuracy: % successful episodes, (3) Number of Failed pickups, (4) Number of Missed pickups: where both agents could have picked up the object but did not, (5) Relative episode length: relative to an oracle. These metrics are aggregated over 400 random initializations (Unseen scenes: 10 scenes 40 inits, Seen scenes: 20 scenes 20 inits). Note that accuracy alone isn’t revealing enough. Naïve agents that wander around and randomly pick up objects will eventually succeed. Also, agents that correctly locate the TV and then keep attempting a pickup in the hope of synchronizing with the other agent will also succeed. Both these cases will however do poorly on the other metrics.

Quantitative analysis.

All plots and metrics referenced in this section contain 90% confidence intervals.

Fig. 5 compares the four metrics: Accuracy, Failed pickups, Missed pickups, and Relative episode length for unseen scenes and the Constrained task. With regards to accuracy, explicit+implicit communication fares only moderately better than implicit communication, but the need for explicit communication is dramatic in the absence of an implicit one. But when one considers all metrics, the benefits of having both explicit and implicit communication are clearly visible. The number of failed and missed pickups is lower, while episode lengths are a little better than just using implicit communication. The differences between just explicit vs. just implicit also shrink when looking at all metrics together. However, across the board, it is clear that communicating is advantageous over not communicating.

Fig. 6 shows the rewards obtained by the 4 variants of our model on seen and unseen environments for the Constrained task. While rewards on seen scenes are unsurprisingly higher, the models with communication do generalize well to unseen environments. Adding the two means of communication is more beneficial than either and far better than not having any means of communication. Interestingly just implicit communication fares better than just explicit, on accuracy.

Figure 7: Constrained vs. unconstrained task (on unseen scenes): (left) Accuracy, (right) Relative episode length.
 (a) Constrained setting agent trajectories (b) Communication between agents
Figure 8: Single episode trajectory with associated agent communication.

Fig. 7 presents the accuracy and relative episode lengths metrics for the unseen scenes and Unconstrained task in contrast to the Constrained task. In these plots, for brevity we only consider the extreme cases of having full communication vs. no communication. As expected, the Unconstrained setting is easier for the agents with higher accuracy and lower episode lengths. Communication is also advantageous in the Unconstrained setting, but its benefits are lesser compared to the Constrained setting.

Table 1 shows a large jump in accuracy when we provide a perfect depth map as an additional input on the Constrained task, indicating that improved perception is beneficial to task completion. We also obtained significant jumps in accuracy (from 31.8 3.8 to 37.2 4.0) when we increase the size of our vocabulary from 2 to 8. This analysis was performed in the explicit-only communication and Constrained environment setup. However, note that even with a vocabulary of 2, agents may be using the full continuous spectrum to encode more nuanced events.

Grid-world abstraction. In order to assess impact of learning to communicate from pixels rather than, as in most prior work, from grid-world environments, we perform a direct translation of our task into a grid-world and compare its performance to our best model. We transform the 1.25m 2.75m area in front of our agent into a grid where each square is assigned a 16 dimensional embedding based on whether it is free space, occupied by another agent, occupied by the target object, otherwise unreachable, or unknown (in the case the grid square leaves the environment). The agents then move in AI2-THOR but perceive this partially observable grid-world. Agents in this setting acquire a large bump in accuracy on the Constrained task (Table 1), confirming our claim that photo-realistic visual environments are more challenging than grid-world like settings.

Interpreting Communication.

Est. 0.35 1.23 -0.35 0.88 0.59 -1.1
SE 0.013 0.019 0.013 0.013 0.015 0.013

Est 1.06 -0.01 -0.04 0 -0.03 -1.09
SE 0.012 0.007 0.006 0.007 0.006 0.021
Table 2: Estimates, and corresponding robust bootstrap standard errors, of the parameters from Section 4.

While we have seen, in Section 6, that communication can substantially benefit our task, we now investigate what these agents have learned to communicate. We focus on the communication strategies learned by agents with a vocabulary of two in the Constrained setting. fig:communication displays one episode trajectory of the two agents with the corresponding communication. From fig:communication(b) we generate hypotheses regarding communication strategies. Suppressing the dependence on episode and step, for let be the weight assigned by agent to the element of the vocabulary in the round of communication, and similarly let be as but for the round of communication. When the agent with the red trajectory (henceforth called agent 0 or ) begins to see the TV the weight increases and remains high until the end of the episode. This suggests that the round of communication may be used to signify closeness to or visibility of the TV. On the other hand, the pickup actions taken by the two agents are associated with the agents making and simultaneously small.

To add evidence to these hypotheses we fit logistic regression models to predict, from (functions of) and , two oracle values (e.g., whether the TV is visible) and whether or not the agents will attempt a pickup action. As the agents are largely symmetric we take the perspective of and define the models , and where

is the logit function. Details of how these models are fit can be found in the appendix.

From Table 2, which displays the estimates of the above parameters along with their standard errors, we find strong evidence for the above intuitions. Note, for all of the estimates discussed above, the standard errors are very small, suggesting highly statistically significant results. The large positive coefficients associated with and suggest that, conditional on being held constant, an increase in the weight is associated with a higher probability of being near, and seeing, the TV. Note also that the estimated value of is fairly large in magnitude and negative. This is very much in line with our prior hypothesis that is made small when agent 0 wishes to signal a readiness to pickup the object. Finally, essentially all estimates of coefficients in the final model are close to 0 except for which is large and negative. Hence, conditional on other values being fixed, being small is associated with a higher probability of a subsequent pickup action. Of course again lending evidence to the hypothesis that the agents coordinate pickup actions by setting to small values.

5 Conclusion

We study the problem of learning to collaborate in visual environments and demonstrate the benefits of learned explicit and implicit communication to aid task completion. We compare performance of collaborative tasks in photo-realistic visual environments to an analogous grid-world environment, to establish that the former are more challenging. We also provide a statistical interpretation of the communication strategy learned by the agents.

Future research directions include extensions to more than two agents, more intricate real-world tasks and scaling to more environments. It would be exciting to enable natural language communication between the agents which also naturally extends to involving human-in-the-loop.


This material is based upon work supported in part by the National Science Foundation under Grants No. 1563727, 1718221, 1637479, 165205, 1703166, Samsung, 3M, Sloan Fellowship, NVIDIA Artificial Intelligence Lab, Allen Institute for AI, Amazon, AWS Research Awards and Thomas & Stacey Siebel Foundation.


Appendix A Appendix

Figure 9: Two stages of communication and belief refinement module - talk and reply. The refined belief from the talk stage is further refined by another round of communication between agents at the reply stage. In this illustration the size of vocabulary is 2 i.e. .

This appendix presents the following content:

  1. Visualizations of the grid-world abstraction of our task,

  2. Our learning algorithm,

  3. Interplay between talk and reply stages of the communication and belief refinement module,

  4. Implementation details of model

  5. A detailed explanation of metrics used in our paper,

  6. Quantitative evaluation of our models but now evaluated on seen scenes,

  7. Statistical analysis of agent communication strategies but now demonstrated on unseen scenes,

  8. Qualitative results of agents with different communication abilities deployed on unseen scenes. This includes clip summaries with agent communication signals for video

a.1 AI2-THOR to Grid-world

In order to assess the impact of learning to communicate directly from pixels rather than, as in most prior work, from grid-world environments, we perform a direct translation of our task into a grid-world and compare its performance to our best model. For this purpose we transform AI2-THOR into a grid-world environment. Figure 10 visualizes, for a single AI2-THOR scene, this transformation. To make our comparison fair, as our pixel-based agents only obtain partial information about their environment at any given timestep, we impose the same restriction on our grid-world agents by only providing them with an egocentric view of their environment (see Figure 11).

1:Randomly initialize shared model weights 2:Set global episode counter 3:while maxEpisodes in parallel do 4:      5:      6:     Randomly choose environment 7:     Randomize agents’ positions and TV location 8:     Set 9:     Set 10:     Roll out trajectory of length from both agents using . 11:      A3C loss for trajectory 12:      cross entropy loss of trajectory w.r.t.  13:     if no expert actions sampled in trajectory then 14:          15:     else 16:                17:     Perform one gradient update of using ADAM with gradients and statistics shared across processes 18:end
Algorithm 1 Learning Algorithm

a.2 Learning algorithm

Algorithm 1 succinctly summarizes our learning procedure as otherwise described in Section 3.3 of the main paper.

a.3 Talk and Reply stages

Explicit communication happens via two stages - talk and reply. As illustrated in fig:talk_and_reply, each stage has it’s own weights (, , , ). These are clearly marked using superscripts of and for the talk and reply stage, respectively.

a.4 Implementation Details.

We use the same hyperparameters and embedding dimensionality in all of our experiments. In our A3C loss we discount rewards with a factor of

and weight the entropy maximization term with a factor of . We use the Adam optimizer with a learning rate of

, momentum values of 0.9 and 0.999 (for the first and second moments respectively), and share optimizer statistics across processes. Gradient steps are made in the hogwild approach, that is without explicit synchronization or locks between processes 


Each model was trained for 100,000 episodes. Each episode is initialized in a random train (seen) scene of AI2-THOR. Rewards provided to the agents are: 1 to both agents for a successful pickup action, constant -0.01 step penalty to discourage long trajectories, -0.02 for any failed action (e.g., running into a wall) and -0.1 for a failed pickup action. Episodes run for a maximum of 500 total steps (250 steps for each agent) after which the episode is considered failed. The minimum aggregate achievable reward in an episode, obtained by successive attempting failed pickup actions by both agents is -65 while the maximum reward is 1.98 achieved by both agents immediately picking up the object as their first action and only receiving a single step penalty.

a.5 Metrics

We now present a more detailed explanation of the metrics we use to evaluate our models.

  1. Per agent reward structure:

    • +1 for performing a successful joint pickup,

    • -0.1 for a failed pickup action,

    • -0.02 for any other failed action (trying to move into walls, furniture, etc.), and

    • -0.01 for each step to encourage short trajectories.

  2. Accuracy: the percentage of episodes which led to the successful pickup action by both agents.

  3. Number of unsuccessful pickups: total number of pickup actions attempted by both agents which didn’t lead to the target being picked up. The three preconditions necessary for a successful joint pickup action are as follows.

    1. Both agents perform the pickup action simultaneously,

    2. Both agents are closer than 1.5m to the target and the target is visible, and

    3. Both agents are a minimum distance apart from each other (0 for the Unconstrained and 8 steps = 2 meters in the manhattan distance for the Constrained setting).

  4. Number of missed pickups: total number of episode steps where both agents could have picked up the object but did not. This is the number of opportunities where 3ii and 3iii were met, but the agents didn’t perform simultaneous pickup actions.

  5. Relative episode length: the quantity

    As it has access to information not available to the agents, our expert policy is also referred to as the oracle policy. As mentioned in the paper, the oracle plans a shortest path from each agent location to the target. This is achieved by leveraging the full map of the scene (i.e., free space, occupied areas, location of other agent, and the target location).

(a) Top view of AI2-THOR scene (b) Corresponding grid-world
Figure 10: An AI2-THOR scene from a top-down view along the corresponding grid-world. Note that each agent (teal triangles) only observes a small portion of the grid-world at any given time-step, see Figure 11 for details. Here each color corresponds to a different category: freespace (green), impassable terrain (red), target object (orange), and unknown (purple).
(a) First-person AI2-THOR agent view (b) Agent partially observed grid-world overlayed on map view (c) Grid-world corresponding to agent view
Figure 11: First person viewpoints of agents in AI2-THOR and the corresponding grid-world observations. Note that white squares are unobserved and blue squares correspond to another agent, see Figure 10 for a description of the other colors.

a.6 Quantitative evaluation

In this section we provide quantitative evaluation results of variants of TBONE. We provide results on seen (train) and unseen (test) scenes. Many of the unseen scenes results are already included in the main paper, but we reproduce the full suite of graphs here, for ease of comparison.

For the Constrained task, fig:constained_seen and fig:constained_unseen show the above metrics on seen and unseen scenes, respectively. For the Unconstrained task, fig:unconstained_seen and fig:unconstained_unseen show the above metrics on seen and unseen scenes, respectively.

On the Constrained task in seen scenes (fig:constained_seen), having both modes of communication clearly produces better rewards. And having either or both modes of communication easily outperforms agents with no means of communication. While the accuracy metric is similar to having only implicit means of communication, the number of unsuccessful pickups, missed pickups, and relative episode lengths metrics show the benefit of having both modes of communication over any one of them. A similar trend is seen in unseen scenes for the same task (fig:constained_unseen).

On the Unconstrained task, the benefits of communication are, as expected, less dramatic (fig:unconstained_seen and fig:unconstained_unseen). Since the task is simpler and potentially can be solved without communication, agents with no means of communication are able to obtain high accuracies. But in the absence of communication, agents end up having a large number of unsuccessful pickups. This is expected. With no means of communication, agents simply go close to the TV and start attempting pickups. Only with communication can they lower this metric by coordinating with each other.

(1) Reward (2) Accuracy (3)
# Unsuccessful
# Missed
Relative episode
Figure 12: Constrained task, seen scenes.
(1) Reward (2) Accuracy (3)
# Unsuccessful
# Missed
Relative episode
Figure 13: Constrained task, unseen scenes.
(1) Reward (2) Accuracy (3)
# Unsuccessful
# Missed
Relative episode
Figure 14: Unconstrained task, seen scenes.
(1) Reward (2) Accuracy (3)
# Unsuccessful
# Missed
Relative episode
Figure 15: Unconstrained task, unseen scenes.

a.7 Interpreting Communication

To fit the logistic models described in Section 4 of the main paper we randomly initialize 2,687 episodes on the 20 training scenes from which we obtain a corresponding number of agent trajectories. Treating each step in these trajectories as a single observation, this results in a dataset containing 143,401 samples. We fit these logistic models using the statsmodels package [75] in Python. As observations within a single episode are highly correlated, we use the bootstrap [26] to obtain robust standard errors for our estimates.

As the analysis above is done on the seen scenes, it begs the question of whether the same trends occur when agents communicate in unseen environments. To address this, we sample 1,333 agent episodes on the 10 test scenes resulting in a dataset of 201,738 samples. We fit identical logistic regression models to this dataset as in the main paper and report the resulting estimates and standard errors in Table 3. While several estimates differ, in a statistically significant way, from those on the seen scenes, all trends remain the same suggesting that agents communicate in largely the same way in unseen environments as they do in previously seen environments.

Est. 0.07 1.29 -0.14 0.65 0.57 -0.88
SE 0.033 0.027 0.031 0.041 0.027 0.042

Est 1.15 -0.0 -0.04 -0.01 -0.04 -1.17
SE 0.037 0.009 0.009 0.009 0.011 0.041
Table 3: Estimates, and corresponding robust bootstrap standard errors, of the parameters from the main paper’s Section 4 when using trajectories sampled from the unseen scenes as described in Section A.7.

a.8 Qualitative results

a.8.1 Effect of communication

We present qualitative results of agents with three communication abilities: implicit + explicit vs. implicit only vs. no communication. We compare the effect by deploying this agents for a particular initialization of an episode i.e. the same scene, agents’ start locations and target object location. We find both explicit and implicit communication help achieve the task faster as seen fig:example1a, fig:example1b and fig:example1c which have episode lengths of 86, 165 and 250 respectively. Another such initialization is compared in fig:example2a, fig:example2b and fig:example2c which have episode lengths of 17, 72 and 217 respectively.

a.8.2 Video

The associated video includes episode visualizations for the Constrained task on Unseen scenes, and can be found here: For these episodes we ran inference on the model with both explicit and implicit communication. The six clips in the video are summarized in fig:clip1, fig:clip2, fig:clip3, fig:clip4, fig:clip5 and fig:clip6. The first four culminated in successful pickup of the target object. The last two videos highlight typical error modes.

Figure 16: Initialization 1: With explicit and implicit communication, episode length is 86 per agent. Associated agent communication in plot below, see Figure 8 in the main paper for a legend.
Figure 17: Initialization 1: With only implicit communication, episode length is 165 per agent.
Figure 18: Initialization 1: With no communication, episode length is 250 per agent (unsuccessful).
Figure 19: Initialization 2: With explicit and implicit communication, episode length is 17 per agent. Associated agent communication in plot below, see Figure 8 in the main paper for a legend.
Figure 20: Initialization 2: With only implicit communication, episode length is 72 per agent.
Figure 21: Initialization 2: With no communication, episode length is 217 per agent.
Figure 22: Clip 1 summary, see Figure 8 in the main paper for a legend.
Figure 23: Clip 2 summary, see Figure 8 in the main paper for a legend.
Figure 24: Clip 3 summary, see Figure 8 in the main paper for a legend.
Figure 25: Clip 4 summary, see Figure 8 in the main paper for a legend.
Figure 26: Clip 5 summary, see Figure 8 in the main paper for a legend.
Figure 27: Clip 6 summary, see Figure 8 in the main paper for a legend.