Collaboration is the defining principle of our society. Humans have refined strategies to efficiently collaborate, developing verbal, deictic, and kinesthetic means. In contrast, progress towards enabling artificial embodied agents to learn collaborative strategies is still in its infancy. Prior work mostly studies collaborative agents in grid-world like environments. Visual, multi-agent, collaborative tasks have not been studied until very recently [das2018tarmac, jain2019CVPRTBONE]. While existing tasks are well designed to study some aspects of collaboration, they often don’t require agents to closely collaborate throughout the task. Instead such tasks either require initial coordination (distributing tasks) followed by almost independent execution, or collaboration at a task’s end (, verifying completion). Few tasks require frequent coordination, and we are aware of none within a visual setting.
To study our algorithmic ability to address tasks which require close and frequent collaboration, we introduce the furniture moving (FurnMove) task (see Fig. 1), set in the AI2-THOR environment. Given only their egocentric visual observations, agents jointly hold a lifted piece of furniture in a living room scene and must collaborate to move it to a visually distinct goal location. As a piece of furniture cannot be moved without both agents agreeing on the direction, agents must explicitly coordinate at every timestep
. Beyond coordinating actions, high performance in our task requires agents to visually anticipate possible collisions, handle occlusion due to obstacles and other agents, and estimate free space. Akin to the challenges faced by a group of roommates relocating a widescreen television, this task necessitates extensive and ongoing coordination amongst all agents at every time step.
In prior work, collaboration between multiple agents has been enabled primarily by (i) sharing observations or (ii) learning low-bandwidth communication. (i) is often implemented using a centralized agent, , a single agent with access to all observations from all agents [boutilier1999sequential, peng2017multiagent, usunier2016episodic]. While effective it is also unrealistic: the real world poses restrictions on communication bandwidth, latency, and modality. We are interested in the more realistic decentralized setting enabled via option (ii). This is often implemented by one or more rounds of message passing between agents before they choose their actions [FoersterNIPS2016, LoweNIPS2017, jain2019CVPRTBONE]. Training decentralized agents when faced with FurnMove
’s requirement of coordination at each timestep leads to two technical challenges. Challenge 1: as each agent independently samples an action from its policy at every timestep, the joint probability tensor of all agents’ actions at any given time is rank-one. This severely limits which multi-agent policies are representable. Challenge 2: the number of possible mis-steps or failed actions increases dramatically when requiring that agents closely coordinate with each other, complicating training.
Addressing challenge 1, we introduce SYNC (Synchronize Your actioNs Coherently) policies which permit expressive (, beyond rank-one) joint policies for decentralized agents while using interpretable communication. To ameliorate challenge 2 we introduce the Coordination Loss (CORDIAL) that replaces the standard entropy loss in actor-critic algorithms and guides agents away from actions that are mutually incompatible. A 2-agent system using SYNC and CORDIAL obtains a 58% success rate on test scenes in FurnMove, an impressive absolute gain of 25 percentage points over the baseline from [jain2019CVPRTBONE] (76% relative gain). In a 3-agent setting, this difference is even more extreme.
In summary, our contributions are: (i) FurnMove, a new multi-agent embodied task that demands ongoing coordination, (ii) SYNC, a collaborative mechanism that permits expressive joint action policies for decentralized agents, (iii) CORDIAL, a training loss for multi-agent setups which, when combined with SYNC
, leads to large gains, and (iv) improvements to the open-source AI2-THOR environment including afaster gridworld equivalent enabling fast prototyping.
2 Related work
We start by reviewing single agent embodied AI tasks followed by non-visual Multi-Agent RL (MARL) and end with visual MARL.
Single-agent embodied systems: Single-agent embodied systems have been considered extensively in the literature. For instance, literature on visual navigation, , locating an object of interest given only visual input, spans geometric and learning based methods. Geometric approaches have been proposed separately for mapping and planning phases of navigation. Methods entailing structure-from-motion and SLAM [tomasi1992shape, frahm2016structurefrommotion, thorpe2000structure, cadena2016past, smith1986on, smith1986estimating] were used to build maps. Planning algorithms on existing maps [CannyMIT1988, KavrakiRA1996, Lavalle2000] and combined mapping & planning [elfes1989using, kuipers1991byun, konolige2010viewbased, fraundorfer2012visionbased, aydemir2013active] are other related research directions.
While these works propose geometric approaches, the task of navigation can be cast as a reinforcement learning (RL) problem, mapping pixels to policies in an end-to-end manner. RL approaches [oh2016control, abel2016exploratory, daftry2016learning, giusti2016he, kahn2017plato, toussaint2003learning, mirowski2017learning, tamar2016value] have been proposed to address navigation in synthetic layouts like mazes, arcade games and other visual environments [wymann2013torcs, bellemare2013the, kempka2016vizdoom, lerer2016learning, johnson2016the, SukhbaatarARXIV2015]. Navigation within photo-realistic environments [BrodeurARXIV2017, SavvaARXIV2017Minos, Chang3DV2017Matterport, ai2thor, xia2018gibson, stanford2d3d, GuptaCVPR2018, xia2019interactive, habitat19iccv] led development of embodied AI agents. The early work [ZhuARXIV2016] addressed object navigation (find an object given an image) in AI2-THOR. Soon after, [GuptaCVPR2018]
showed how imitation learning permits agents to learn to build a map from which they navigate. Methods also investigate the utility of topological and latent memory maps[GuptaCVPR2018, savinov2018semiparametric, henriques2018mapnet, wu2019bayesian], graph-based learning [wu2019bayesian, yang2018visual], meta-learning [wortsman2019learning], unimodal baselines [thomason2019shifting], 3D point clouds [Wijmans2019EQAPhoto], and effective exploration [wang2019reinforced, savinov2018semiparametric, Chaplot2020Explore, ramakrishnan2020exploration] to improve embodied navigational agents. Embodied navigation also aids AI agents to develop behavior such as instruction following [HillARXIV2017, anderson2018vision, Suhr2019CerealBar, wang2019reinforced, anderson2019NeuripsChasing], city navigation [chen2019touchdown, mirowski2018learningcity, mirowski2019streetlearn, de2018talkthewalk], question answering [DasCVPR2018, DasECCV2018, GordonCVPR2018, Wijmans2019EQAPhoto, das2020probing], and active visual recognition [yang2018visualsemantic, yang2019embodied]. Recently, with visual and acoustic rendering, agents have been trained for audio-visual embodied navigation [chen2019audio, gao2020visualechoes].
In contrast to the above single-agent embodied tasks and approaches, we focus on collaboration between multiple embodied agents. Porting the above single-agent architectural novelties (or a combination of them) to multi-agent systems such as ours is an interesting direction for future work.
Non-visual MARL: Multi-agent reinforcement learning (MARL) is challenging due to non-stationarity when learning. Multiple methods have been proposed to address resulting issues [TanICML1993, TesauroNIPS2004, TampuuPLOS2017, FoersterARXIV2017]. For instance, permutation invariant critics have been developed recently [LiuCORL2019]. In addition, for MARL, cooperation and competition between agents has been studied [LauerICML2000, Panait2005, MatignonIROS2007, Busoniu2008, OmidshafieiARXIV2017, GuptaAAMAS2017, LoweNIPS2017, FoersterAAAI2018, LiuCORL2019]. Similarly, communication and language in the multi-agent setting has been investigated [GilesICABS2002, KasaiSCIA2008, BratmanCogMod2010, MeloMAS2011, LazaridouARXIV2016, FoersterNIPS2016, SukhbaatarNIPS2016, MordatchAAAI2018, Baker2019EmergentTU] in maze-based setups, tabular tasks, or Markov games. These algorithms mostly operate on low-dimensional observations such as kinematic measurements (position, velocity, ) and top-down occupancy grids. For a survey of centralized and decentralized MARL methods, kindly refer to [zhang2019multi]
. Our work differs from the aforementioned MARL works in that we consider complex visual environments. Our contribution of SYNC-Policies is largely orthogonal to RL loss function or method. For a fair comparison to[jain2019CVPRTBONE], we used the same RL algorithm (A3C) but it is straightforward to integrate SYNC into other MARL methods [rashid2018qmix, FoersterAAAI2018, LoweNIPS2017] (for details, see sec:extra-training-details of the supplement).
Visual MARL: Recently, Jain [jain2019CVPRTBONE] introduced a collaborative task for two embodied visual agents, which we refer as FurnLift. In this task, two agents are randomly initialized in an AI2-THOR living room scene, must visually navigate to a TV, and, in a singe coordinated PickUp action, work to lift that TV up. Note that FurnLift doesn’t demand that agents coordinate their actions at each timestep. Instead, such coordination only occurs at the last timestep of an episode. Moreover, as success of an action executed by an agent is independent (with the exception of the PickUp action), a high performance joint policy need not be complex, , it may be near low-rank. More details on this analysis and the complexity of our proposed FurnMove task are provided in sec:task.
Similarly, a recent preprint [chen2019visual] proposes a visual hide-and-seek task, where agents can move independently. Das [das2018tarmac] enable agents to learn who to communicate with, on predominantly 2D tasks. In visual environments they study the task where multiple agents parallely navigate to the same object. Jaderberg [jaderberg2019human] recently studied the game of Quake III and Weihs [weihs2019artificial] develop agents to play an adversarial hiding game in AI2-THOR. Collaborative perception for semantic segmentation and recognition classification have also been investigated recently [Liu_2020_CVPR, liu2020who2com].
To the best of our knowledge, all previous visual or non-visual MARL in the decentralized setting operate with a single marginal probability distribution per agent, , a rank-one joint distribution. Moreover,FurnMove is the first multi-agent collaborative task in a visually rich domain requiring close coordination between agents at every timestep.
3 The furniture moving task (FurnMove)
We describe our new multi-agent task FurnMove, grounded in the real-world experience of moving furniture. We begin by introducing notation.
RL background and notation. Consider collaborative agents , . At every timestep the agents, and environment, are in some state and each agent obtains an observation recording some partial information about . For instance, might be the egocentric visual view of an agent embedded in some simulated environment. From observation and history , which records prior observations and decisions made by the agent, each agent forms a policy where is the probability that agent chooses to take action from a finite set of options at time . After the agents execute their respective actions , which we call a multi-action, they enter a new state and receive individual rewards . For more on RL see [SuttonMIT1998, MnihNature2015, MnihEtAlPMLR2016].
Task definition. FurnMove is set in the near-photorealistic and physics enabled simulated environment AI2-THOR [ai2thor]. In FurnMove, agents collaborate to move a lifted object through an indoor environment with the goal of placing this object above a visually distinct target as illustrated in fig:teaser. Akin to humans moving large items, agents must navigate around other furniture and frequently walk in-between obstacles on the floor.
In FurnMove, each agent at every timestep receives an egocentric observation (a RGB image) from AI2-THOR. In addition, agents are allowed to communicate with other agents at each timestep via a low bandwidth communication channel. Based on their local observation and communication, each agent must take an action from the set . The space of actions available to an agent is comprised of the four single-agent navigational actions MoveAhead, RotateLeft, RotateRight, Pass used to move the agent independently, four actions MoveWithObjectX Ahead, Right, Left, Back used to move the lifted object and the agents simultaneously in the same direction, four actions MoveObjectXAhead, Right, Left, Back used to move the lifted object while the agents stay in place, and a single action used to rotate the lifted object clockwise . We assume that all movement actions for agents and the lifted object result in a displacement of 0.25 meters (similar to [jain2019CVPRTBONE, habitat19iccv]) and all rotation actions result in a rotation of 90 degrees (counter-)clockwise when viewing the agents from above.
|(a) FurnMove||(b) FurnLift|
Close and on-going collaboration is required in FurnMove due to restrictions on the set of actions which can be successfully completed jointly by all the agents. These restrictions reflect physical constraints: for instance, if two people attempt to move in opposite directions while carrying a heavy object they will either fail to move or drop the object. For two agents, we summarize these restrictions using the coordination matrix shown in Fig. 2. For comparison, we include a similar matrix in fig:coordination_matrix_furnlift corresponding to the FurnLift task from [jain2019CVPRTBONE]. We defer a more detailed discussion of these restrictions to sec:action-restrictions of the supplement. Generalizing the coordination matrix shown in Fig. 2, at every timestep we let be the -valued -dimensional tensor where if and only if the agents are configured such that multi-action satisfies the restrictions detailed in sec:action-restrictions. If we say the actions are coordinated.
3.1 Technical challenges
As we show in our experiments in sec:experiments, standard communication-based models similar to the ones proposed in [jain2019CVPRTBONE] perform rather poorly when trained to complete the FurnMove task. In the following we identify two key challenges that contribute to this poor performance.
Challenge 1: rank-one joint policies. In classical multi-agent settings [Busoniu2008, Panait2005, LoweNIPS2017], each agent samples its action independently of all other agents. Due to this independent sampling, at time , the probability of the agents taking multi-action equals . This means that the joint probability tensor of all actions at time can be written as the rank-one tensor . This rank-one constraint limits the joint policy that can be executed by the agents, which has real impact. sec:rank-one-challenge-example considers two agents playing rock-paper-scissors with an adversary: the rank-one constraint reduces the expected reward achieved by an optimal policy from 0 to -0.657 (minimal reward being -1). Intuitively, a high-rank joint policy is not well approximated by a rank-one probability tensor obtained via independent sampling.
Challenge 2: exponential failed actions. The number of possible multi-actions increases exponentially as the number of agents grows. While this is not problematic if agents act relatively independently, it’s a significant obstacle when the agents are tightly coupled, , when the success of agent ’s action is highly dependent on the actions of the other agents. Just consider a randomly initialized policy (the starting point of almost all RL problems): agents stumble upon positive rewards with an extremely low probability which leads to slow learning. We focus on small , nonetheless, the proportion of coordinated action tuples is small ( when and when ).
4 A cordial sync
To address the aforementioned two challenges we develop: (a) a novel action sampling procedure named Synchronize Your actioNs Coherently (SYNC) and (b) an intuitive & effective multi-agent training loss named the Coordination Loss (CORDIAL).
Addressing challenge 1: SYNC-policies. For readability, we consider agents and illustrates an overview in fig:model. The joint probability tensor is hence a matrix of size . Recall our goal: using little communication, multiple agents should sample their actions from a high-rank joint policy. This is difficult as (i) little communication means that, except in degenerate cases, no agent can form the full joint policy and (ii) even if all agents had access to the joint policy it is not obvious how to ensure that the decentralized agents will sample a valid coordinated action.
To achieve this note that, for any rank matrix
, there are vectorssuch that . Here, denotes the outer product. Also, the non-negative rank of a matrix equals the smallest integer such that can be written as the sum of non-negative rank-one matrices. Furthermore, a non-negative matrix has non-negative rank bounded above by . Since is a joint probability matrix, , is non-negative and its entries sum to one, it has non-negative rank , , there exist non-negative vectors and whose entries sum to one such that .
We call a sum of the form a mixture-of-marginals. With this decomposition at hand, randomly sampling action pairs from can be interpreted as a two step process: first sample an index and then sample and .
This stage-wise procedure suggests a strategy for sampling actions in a multi-agent setting, which we refer to as SYNC-policies. Generalizing to an agent setup, suppose that agents have access to a shared random stream of numbers. This can be accomplished if all agents share a random seed or if all agents initially communicate their individual random seeds and sum them to obtain a shared seed. Furthermore, suppose that all agents locally store a shared function where are learnable parameters, is the dimensionality of all communication between the agents in a timestep, and is the standard -probability simplex. Finally, at time suppose that each agent produces not a single policy but instead a collection of policies . Let be all communication sent between agents at time . Each agent then samples its action as follows: (i) compute the shared probabilities , (ii) sample an index using the shared random number stream, (iii) sample, independently, an action from the policy . Since both and the random number stream are shared, the quantities in (i) and (ii) are equal across all agents despite being computed individually. This sampling procedure is equivalent to sampling from the tensor which, as discussed above, may have rank up to . Intuitively, SYNC enables decentralized agents to have a more expressive joint policy by allowing them to agree upon a strategy by sampling from .
Addressing challenge 2: CORDIAL. We encourage agents to rapidly learn to choose coordinated actions via a new loss. In particular, letting be the joint policy of our agents, we propose the coordination loss (CORDIAL)
where is applied element-wise, is the usual Frobenius inner product, and is defined in sec:task. Notice that CORDIAL encourages agents to have a near uniform policy over the actions which are coordinated. We use this loss to replace the standard entropy encouraging loss in policy gradient algorithms (, the A3C algorithm [MnihEtAlPMLR2016]). Similarly to the parameter for the entropy loss in A3C, is chosen to be a small positive constant so as to not overly discourage learning.
Note that the coordination loss is less meaningful when , , when is rank-one. For instance, suppose that has ones along the diagonal, and zeros elsewhere, so that we wish to encourage the agents to all take the same action. In this case it is straightforward to show that so that
simply encourages each agent to have a uniform distribution over its actions and thus actually encourages the agents to place a large amount of probability mass on uncoordinated actions. Indeed, tab:cl_study shows that usingCORDIAL without SYNC leads to poor results.
We study four distinct policy types: central, marginal, marginal w/o comm, and SYNC. Central samples actions from a joint policy generated by a central agent with access to observations from all agents. While often unrealistic in practice due to communication bottlenecks, central serves as an informative baseline. Marginal follows prior work, , [jain2019CVPRTBONE]: each agent independently samples its actions from its individual policy after communication. Marginal w/o comm is identical to marginal but does not permit agents to communicate explicitly (agents may still see each other). Finally, SYNC is our newly proposed policy described in sec:method. For a fair comparison, all decentralized agents (, SYNC, marginal, and marginal w/o comm), use the same TBONE backbone architecture from [jain2019CVPRTBONE], see fig:model. We have ensured that parameters are fairly balanced so that our proposed SYNC has close to (and never more) parameters than the marginal and marginal w/o comm nets. Note, we train central and SYNC with CORDIAL, and the marginal and marginal w/o comm without it. This choice is mathematically explained in sec:method and empirically validated in sec:quantitative.
Architecture details: For readability we describe the policy and value net for the 2 agent setup while noting that it can be trivially extended to any number of agents. As noted above, decentralized agents use the TBONE backbone from [jain2019CVPRTBONE]. Our primary architectural novelty extends TBONE to SYNC-policies. An overview of the TBONE backbone and differences between sampling with marginal and SYNC policies is shown in fig:model.
As a brief summary of TBONE, agent observes at time inputs , , a RGB image returned from AI2-THOR which represents the -th agent’s egocentric view. For each agent, the observation is encoded by a four layer CNN and combined with an agent specific learned embedding (that encodes the ID of that agent) along with the history embedding
. The resulting vector is fed into a long-short-term-memory (LSTM)[HochreiterNC1997] unit to produce a -dimensional embedding corresponding to the agent.
The agents then undergo two rounds of communication resulting in two final hidden states and messages , with message being produced by agent in round and then sent to the other agent in that round. In [jain2019CVPRTBONE]
, the value of the agents’ state as well as logits corresponding to the policy of the agents are formed by applying linear functions to.
We now show how SYNC can be integrated into TBONE to allow our agents to represent high rank joint distributions over multi-actions (see Fig. 3). First each agent computes the logits corresponding to . This is done using a -layer MLP applied to the messages sent between the agents, at the second stage. In particular, where , , , and are a learnable collection of weight matrices and biases. After computing we compute a collection of policies for . Each of these policies is computed following the TBONE architecture but using additional, and learnable, linear layers per agent.
6.1 Experimental setup
Simulator. We evaluate our models using the AI2-THOR environment [ai2thor] with several novel upgrades. First, we introduce new methods which permit to (a) randomly initialize the lifted object and agent locations close to each other and looking towards the lifted object, and (b) simultaneously move agents and the lifted object in a given direction with collision detection. Secondly, we build a top-down gridworld version of AI2-THOR for faster prototyping, that is faster than [jain2019CVPRTBONE]. For details about framework upgrades, see sec:extra-training-details of the supplement.
Tasks. We compare against baselines on FurnMove, Gridworld-FurnMove, and FurnLift [jain2019CVPRTBONE]. FurnMove is the novel task introduced in this work (sec:task): agents observe egocentric visual views, with field-of-view 90 degrees. In FurnMove-Gridworld the agents are provided a top-down egocentric 3D tensor as observations. The third dimension of the tensor contains semantic information such as, if the location is navigable by an agent or navigable by the lifted object, or whether the location is occupied by another agent, the lifted object, or the goal object. Hence, FurnMove-Gridworld agents do not need visual understanding, but face other challenges of the FurnMove task – coordinating actions and planning trajectories. We consider only the harder variant of FurnLift, where communication was shown to be most important (‘constrained’ with no implicit communication in [jain2019CVPRTBONE]). In FurnLift, agents observe egocentric visual views.
Data. As in [jain2019CVPRTBONE], we train and evaluate on a split of the living room scenes. As FurnMove is already quite challenging, we only consider a single piece of lifted furniture (a television) and a single goal object (a TV-stand). Twenty rooms are used for training, for validation, and for testing. The test scenes have very different lighting conditions, furniture, and layouts. For evaluation our test set includes episodes equally distributed over the five scenes.
Training. For training we augment the A3C algorithm [MnihEtAlPMLR2016] with CORDIAL
. For our studies in the visual domain, we use 45 workers and 8 GPUs. Models take around two days to train. For more details about training, including hyperparameter values and the reward structure, see sec:extra-training-details of the supplement.
For completeness, we consider a variety of metrics. We adapt SPL, , Success weighted by (normalized inverse) Path Length [anderson2018evaluation], so that it doesn’t require shortest paths but still provides similar semantic information111For FurnMove, each location of the lifted furniture corresponds to states, making shortest path computation intractable (more details in sec:quant-eval-extra-details of the supplement).: We define a Manhattan Distance based SPL as , where denotes an index over episodes, equals the number of test episodes, and is a binary indicator for success of episode . Also is the number of actions taken per agent, is the Manhattan distance from the lifted object’s start location to the goal, and is the distance between adjacent grid points, for us m. We also report other metrics capturing complementary information. These include mean number of actions in an episode per agent (Ep len), success rate (Success), and distance to target at the end of the episode (Final dist).
We also introduce two metrics unique to coordinating actions: TVD, the mean total variation distance between and its best rank-one approximation, and Invalid prob, the average probability mass allotted to uncoordinated actions, , the dot product between and . By definition, TVD is zero for the marginal model, and higher values indicate divergence from independent marginal sampling. Note that, without measuring TVD we would have no way of knowing if our SYNC model was actually using the extra expressivity we’ve afforded it. Lower Invalid prob values imply an improved ability to avoid uncoordination actions as detailed in sec:task and fig:coordination_matrix.
6.3 Quantitative evaluation
We conduct four studies: (a) performance of different methods and relative difficulty of the three tasks, (b) effect of number of components on SYNC performance, (c) effect of CORDIAL (ablation), and (d) effect of number of agents.
|Marginal w/o comm [jain2019CVPRTBONE]||0.032||0.164||224.1||2.143||0.815||0|
|Marginal w/o comm [jain2019CVPRTBONE]||0.111||0.484||172.6||1.525||0.73||0|
if their 95% confidence interval has no overlap with the confidence intervals of other values
|FurnLift [jain2019CVPRTBONE] (‘constrained’ setting with no implicit communication)|
|Marginal w/o comm [jain2019CVPRTBONE]||0.029||0.15||229.5||2.455||0.11||0||25.219||6.501|
|(a) FurnMove||(b) Gridworld-FurnMove||(c) FurnLift|
Comparing methods and tasks. We compare models detailed in sec:baselines on tasks of varying difficulty, report metrics in tab:quant, and show the progress of metrics over training episodes in fig:tb-graphs. In our FurnMove experiments, SYNC performs better than the best performing method of [jain2019CVPRTBONE] (, marginal) on all metrics. Success rate increases by % and % absolute percentage points on FurnMove and Gridworld-FurnMove respectively. Importantly, SYNC is significantly better at allowing agents to coordinate their actions: for FurnMove, the joint policy of SYNC assigns, on average, probability mass to invalid actions pairs while the marginal and marginal w/o comm models assign and probability mass to invalid action pairs. Additionally, SYNC goes beyond rank-one marginal methods by capturing a more expressive joint policy using the mixture of marginals. This is evidenced by the high TVD of for marginal. In Gridworld-FurnMove, oracle-perception of a 2D gridworld helps raise performance of all methods, though the trends are similar. tab:quant-furnlift shows similar trends for FurnLift but, perhaps surprisingly, the Success of SYNC is somewhat lower than the marginal model (2.6% lower, within statistical error). As is emphasized in [jain2019CVPRTBONE] however, Success alone is a poor measure of model performance: equally important are the failed pickups and missed pickups metrics (for details, see sec:quant-eval-extra-details of the supplement). For these metrics, SYNC outperforms the marginal model. That SYNC does not completely outperform marginal in FurnLift is intuitive, as FurnLift does not require continuous close coordination the benefits of SYNC are less pronounced.
While the difficulty of a task is hard to quantify, we will consider the relative test-set metrics of agents on various tasks as an informative proxy. Replacing the complex egocentric vision in FurnMove with the semantic 2D gridworld in Gridworld-FurnMove, we see that all agents show large gains in Success and MD-SPL, suggesting that Gridworld-FurnMove is a dramatic simplification of FurnMove. Comparing FurnMove to FurnLift is particularly interesting. The MD-SPL and Success metrics for the central agent do not provide a clear indication regarding task difficulty amongst the two. However, notice the much higher TVD for the central agent for FurnMove and the superior MD-SPL and Success of the Marginal agents for FurnLift. These numbers clearly indicate that FurnMove requires more coordination and additional expressivity of the joint distribution than FurnLift.
|in SYNC||MD-SPL||Success||Ep len||
Effect of number of mixture components in SYNC. Recall (sec:method) that the number of mixture components in SYNC is a hyperparameter controlling the maximal rank of the joint policy. SYNC with is equivalent to marginal. In tab:mixture we see increase from to when increasing from 2 to 13. This suggests that SYNC learns to use the additional expressivity. Moreover, we see that this increased expressivity results in better performance. A success rate jump of % from to demonstrates that substantial benefits are obtained by even small increases in expressitivity. Moreover with more components, , & we obtain more improvements. Notice however that there are diminishing returns, the model performs nearly as well as the model. This suggests a trade-off between the benefits of expressivity and the increasing complexities in optimization.
Effect of CORDIAL. In tab:cl_study we quantify the effect of CORDIAL. Note the improvement in success rate when adding CORDIAL to SYNC. This is accompanied by a drop in Invalid prob. from to , which signifies better coordination of actions. Similar improvements are seen for the central model. In ‘Challenge 2’ (sec:method) we mathematically laid out why marginal models gain little from CORDIAL. We substantiate this empirically with a 22.9% drop in success rate when training the marginal model with CORDIAL.
Effect of more agents. The final three rows of tab:quant show the test-set performance of SYNC, marginal, and central models trained to accomplish a 3-agent variant of our Gridworld-FurnMove task. In this task the marginal model fails to train at all, achieving a 0% success rate. SYNC, on the other hand, successfully completes the task 57.8% of the time. Notice that SYNC’s success rate drops by nearly 20 percentage points when moving from the 2- to the 3-agent variant of the task: clearly increasing the number of agents substantially increases the task’s difficult. Surprisingly, the central model performs worse than SYNC in this setting. A discussion of this phenomenon is deferred to sec:quant-eval-extra-details of the supplement.
6.4 Qualitative evaluation
We present three qualitative results on FurnMove: joint policy summaries, analysis of learnt communication, and visualizations of agent trajectories.
Joint policy summaries. In fig:matrices we show summaries of the joint policy captured by the central, SYNC, and marginal models. These matrices average over action steps in the test-set episodes for FurnMove. Other tasks show similar trends, see sec:qualitative-extra-details of the supplement. In fig:matrices_central, the sub-matrices corresponding to and are diagonal-dominant, indicating that agents are looking in the same direction ( relative orientation in fig:coordination_matrix). Also note the high probability associated to (Pass, RotateX) and (RotateX, Pass), within the block. Together, this means that the central method learns to coordinate single-agent navigational actions to rotate one of the agents (while the other holds the TV by executing Pass) until both face the same direction. They then execute the same action from () to move the lifted object. Comparing fig:matrices_sync fig:matrices_marginal, shows the effect of CORDIAL. Recall that the marginal model doesn’t support CORDIAL and thus suffers by assigning probability to invalid action pairs (color outside the block-diagonal submatrices). Also note the banded nature of fig:matrices_marginal resulting from its construction as an outer product of marginals.
Communication analysis. A qualitative discussion of communication follows. Agent are colored red and green. We defer a quantitative treatment to sec:qualitative-extra-details of the supplement. As we apply SYNC on the TBONE backbone introduced by Jain [jain2019CVPRTBONE], we use similar tools to understand the communication emerging with SYNC policy heads. In line with [jain2019CVPRTBONE], we plot the weight assigned by each agent to the first communication symbol in the reply stage. fig:matrices_communication_analysis strongly suggests that the reply stage is directly used by the agents to coordinate the modality of actions they intend to take. In particular, note that a large weight being assigned to the first reply symbol is consistently associated with the other agent taking a Pass action. Similarly, we see that small reply weights coincide with agents taking a MoveWithObject action. The talk weights’ interpretation is intertwined with the reply weights, and is deferred to sec:qualitative-extra-details of the supplement.
Agent trajectories. Our supplementary video includes examples of policy roll-outs. These clips include both agents’ egocentric views and a top-down trajectory visualization. This enables direct comparisons of marginal and SYNC on the same test episode. We also allow for hearing patterns in agents’ communication: we convert scalar weights (associated with reply symbols) to audio.
We introduce FurnMove, a collaborative, visual, multi-agent task requiring close coordination between agents and develop novel methods that allow for moving beyond existing marginal action sampling procedures, these methods lead to large gains across a diverse suite of metrics.
This material is based upon work supported in part by the National Science Foundation under Grants No. 1563727, 1718221, 1637479, 165205, 1703166, Samsung, 3M, Sloan Fellowship, NVIDIA Artificial Intelligence Lab, Allen Institute for AI, Amazon, and AWS Research Awards. UJ is thankful to Thomas & Stacey Siebel Foundation for Siebel Scholars Award. We thank Mitchell Wortsman and Kuo-Hao Zeng for their insightful suggestions on how to clarify and structure this work.
Appendix 0.A Supplementary Material
This supplementary material provides:
The conditions for a collection of actions to be considered coordinated.
An example showing that standard independent multi-agent action sampling makes it impossible to, even in principle, obtain an optimal joint policy.
Training details including hyperparameter choices, hardware configurations, and reward structure. We also discuss our upgrades to AI2-THOR.
Additional discussion, tables, and plots regarding our quantitative results.
Additional discussion, tables, and plots of our qualitative results including a description of our supplementary video as well as an in-depth quantitative evaluation of communication learned by our agents.
0.a.1 Action restrictions
We now comprehensively describe the restrictions defining when actions taken by agents are globally consistent with one another. In the following we will, for readability, focus on the two agent setting. All conditions defined here easily generalize to any number of agents. Recall the sets , and defined in sec:task. We call these sets the modalities of action. Two actions are said to be of the same modality if they both are an element of the same modality of action. Let and be the actions chosen by the two agents. Below we describe the conditions when and are considered coordinated. If the agents’ actions are uncoordinated, both actions fail and no action is taken for time . These conditions are summarized in fig:coordination_matrix_furnmove.
Same action modality. A first necessary, but not sufficient, condition for successful coordination is that the agents agree on the modality of action to perform. Namely that both and are of the same action modality. Notice the block diagonal structure in fig:coordination_matrix_furnmove.
No independent movement. Our second condition models the intuitive expectation that if one agent wishes to reposition itself by performing a single-agent navigational action, the other agent must remain stationary. Thus, if , then are coordinated if and only if one of or is a Pass action. The entries of the matrix in fig:coordination_matrix_furnmove show coordinated pairs of single-agent navigational actions.
Orientation synchronized object movement. Suppose that both agents wish to move (with) the object in a direction so that or . Note that, as actions are taken from an egocentric perspective, it is possible, for example, that moving ahead from one agent’s perspective is the same as moving left from the other’s. This condition requires that the direction specified by both of the agents is consistent globally. Hence are coordinated if and only if the direction specified by both actions is the same in a global reference frame. For example, if both agents are facing the same direction this condition requires that while if the second agent is rotated 90 degrees clockwise from the first agent then will be coordinated if and only if . See the multicolored 44 blocks in fig:coordination_matrix_furnmove.
Simultaneous object rotation. For the lifted object to be rotated, both agents must rotate it in the same direction in a global reference frame. As we only allow the agents to rotate the object in a single direction (clockwise) this means that requires . See the (9, 9) entry of the matrix in fig:coordination_matrix_furnmove.
While a pair of uncoordinated actions are always unsuccessful, it need not be true that a pair of coordinated actions is successful. A pair of coordinated actions will be unsuccessful in two cases: performing the action pair would result in (a) an agent, or the lifted object, colliding with one another or another object in the scene; or (b) an agent moving to a position more than 0.76m from the lifted object. Here (a) enforces the physical constraints of the environment while (b) makes the intuitive requirement that an agent has a finite reach and cannot lift an object when being far away.
0.a.2 Challenge 1 (rank-one joint policies) example
We now illustrate how requiring two agents to independently sample actions from marginal policies can result in failing to capture an optimal, high-rank, joint policy.
Consider two agents and who must work together to play rock-paper-scissors (RPS) against some adversary . In particular, our game takes place in a single timestep where each agent , after perhaps communicating with the other agent, must choose some action . During this time the adversary also chooses some action . Now, in our game, the pair of agents lose if they choose different actions (, ), tie with the adversary if all players choose the same action (, ), and finally win or lose if they jointly choose an action that beats or losses against the adversary’s choice following the normal rules of RPS (, win if , lose if ).
Moreover, we consider the challenging setting where
communicate in the open so that the adversary can view their joint policy
before choosing the action it wishes to take. Notice that we’ve dropped the subscript on as there is only a single timestep. Finally, we treat this game as zero-sum so that our agents obtain a reward of 1 for victory, 0 for a tie, and -1 for a loss. We refer to the optimal joint policy as . If the agents operate in a decentralized manner using their own (single) marginal policies, their effective rank-one joint policy equals .
Optimal joint policy: It is well known, and easy to show, that the optimal joint policy equals , where
is the identity matrix of size. Hence, the agents take multi-action (), (), or () with equal probability obtaining an expected reward of zero.
Optimal rank-one joint policy: (the optimal joint policy) is of rank three and thus cannot be captured by (an outer product of marginals). Instead, brute-force symbolic minimization, using Mathematica [Mathematica], shows that an optimal strategy for and is to let with
The expected reward from this strategy is , far less than the optimal expected reward of .
0.a.3 Training details
0.a.3.1 Centralized agent.
fig:central_model provides an overview of the architecture of the centralized agent. The final joint policy is constructed using a single linear layer applied to a hidden state. As this architecture varies slightly when changing the number of agents and the environment (, AI2-THOR or our gridworld variant of AI2-THOR) we direct anyone interested in exact replication to our codebase.
0.a.3.2 AI2-THOR upgrades.
As we described in sec:experiments we have made several upgrades to AI2-THOR in order to make it possible to run our FurnMove task. These upgrades are described below.
Implementing FurnMove methods in AI2-THOR’s Unity codebase. The AI2-THOR simulator has been built using C# in Unity. While multi-agent support exists in AI2-THOR, our FurnMove task required implementing a collection of new methods to support randomly initializing our task and moving agents in tandem with the lifted object. Initialization is accomplished by a randomized search procedure that first finds locations in which the lifted television can be placed and then determines if the agents can be situated around the lifted object so that they are sufficiently close to the lifted object and looking at it. Implementing the joint movement actions (recall ) required checking that all agents and objects can be moved along straight-line paths without encountering collisions.
Top-down Gridworld Mirroring AI2-THOR. To enable fast prototyping and comparisons between differing input modalities, we built an efficient gridworld mirroring AI2-THOR. See fig:grid_vision_thor_comparison for a side-by-side comparison of AI2-THOR and our gridworld. This gridworld was implemented primarily in Python with careful caching of data returned from AI2-THOR.
0.a.3.3 Reward structure.
Rewards are provided to each agent individually at every step. These rewards include: (a) whenever the lifted object is moved closer, in Euclidean distance, to the goal object than it had been previously in the episode, (b) a constant step penalty to encourage short trajectories, and (c) a penalty of whenever the agents action fails. The minimum total reward achievable for a single agent is corresponding to making only failed actions, while the maximum total reward equals where is the total number of steps it would take to move the lifted furniture directly to the goal avoiding all obstructions. Our models are trained to maximize the expected discounted cumulative gain with discounting factor .
0.a.3.4 Optimization and learning hyperparameters.
For all tasks, we train our agents using reinforcement learning, particularly the popular A3C algorithm [MnihEtAlPMLR2016]. For FurnLift, we follow [jain2019CVPRTBONE] and additionally use a warm start via imitation learning (DAgger [RossAISTATS2011]). When we deploy the coordination loss (CORDIAL), we modify the A3C algorithm by replacing the entropy loss with the coordination loss CORDIAL defined in eq:loss.
In our experiments we anneal the parameter from a starting value of to a final value of over the first episodes of training. We use an ADAM optimizer with a learning rate of , momentum parameters of and , with optimizer statistics shared across processes. Gradient updates are performed in an unsynchronized fashion using a HogWild! style approach [RechtNIPS2011]. Each episode has a maximum length of total steps per agent. Task-wise details follow:
FurnMove: Visual agents for FurnMove are trained for episodes, across TITAN V or TITAN X GPUs with workers and take approximately three days to train.
Gridworld-FurnMove: Agents for Gridworld-FurnMove are trained for 1,000,000 episodes using workers. Apart from parsing and caching the scene once, gridworld agents do not need to render images. Hence, we train the agents with only G4 GPU, particularly the g4dn.16xlarge virtual machine on AWS. Agents (, two) for Gridworld-FurnMove take approximately 1 day to train.
Gridworld-FurnMove-3Agents: Same implementation as above, except that agents (, three) for Gridworld-FurnMove-3Agents take approximately 3 days to train. This is due to an increase in the number of forward and backward passes and a CPU bottleneck. Due to the action space blowing up to (169 for two agents), positive rewards become increasingly sparse. This leads to grave inefficiency in training, with no learning for 500k episodes. To overcome this, we double the positive rewards for the RL formulation for all methods within the three agent setup.
FurnLift: We adhere to the exact training procedure laid out by Jain [jain2019CVPRTBONE]. Visual agents for FurnLift are trained for 100,000 episodes with the first 10,000 being warm started with a DAgger-styled imitation learning. Reinforcement learning (A3C) takes over after the warm-start period.
0.a.3.5 Integration with other MARL methods.
As mentioned in sec:related_work, our contributions are orthogonal to the RL method deployed. Here we give some pointers for integration with a deep Q-Learning and a policy gradient method.
QMIX. While we focus on policy-gradients and QMIX [rashid2018qmix] uses Q-learning, we can formulate a SYNC for Q-Learning (and QMIX). Analogous to an actor with multiple policies, consider a value head where each agent’s Q-function is replaced by a collection of Q-functions for . Action sampling is done stage-wise, i.e. agents jointly pick a strategy as communications, and then individually choose action . These in turn can incorporated into the QMIX mixing network.
COMA/MADDPG. Both these policy gradient algorithms utilize a centralized critic. Since our contributions focus on the actor head, we can directly replace their per-agent policy with our SYNC policies and thus benefit directly from the counterfactual baseline in COMA [FoersterAAAI2018] or the centralized critic in MADDPG [LoweNIPS2017].
0.a.4 Quantitative evaluation details
0.a.4.1 Confidence intervals for metrics reported.
In the main paper, we mentioned that we mark the best performing decentralized method in bold and highlight it in green if it has non-overlapping % confidence intervals. In this supplement, particularly in tab:quant_95conf, tab:quant-furnlift-95conf, tab:mixture_95conf, and tab:cl_study_95conf we include the 95% confidence intervals for the metrics reported in tab:quant, tab:quant-furnlift, tab:mixture, and tab:cl_study.
0.a.4.2 Hypotheses on 3-agent central method performance.
In tab:quant and sec:quantitative of the main paper, we mention that the central method performs worse than SYNC for the Gridworld-FurnMove-3Agent task. We hypothesize that this is because the central method for the -3Agent setup is significantly slower as its actor head has dramatically more parameters requiring more time to train. In numbers – the central’s actor head alone has parameters, where is the dimensionality of the final representation fed into the actor (please see fig:central_model for central’s architecture). Note, for our architecture means the central’s actor head has 1,124,864 parameters. Contrast this to SYNC’s parameters for a mixture component. Even for the highest in the mixture component study (tab:mixture), , , this value is parameters. Such a large number of parameters makes learning with the central agent slow even after M episodes (this is already more training episodes than used in [jain2019CVPRTBONE]).
0.a.4.3 Why MD-SPL instead of SPL?
SPL was introduced in [anderson2018evaluation] for evaluating single-agent navigational agents, and is defined as follows:
where denotes an index over episodes, equals the number of test episodes, and is a binary indicator for success of episode . Also is the length of the agent’s path and is the shortest-path distance from agent’s start location to the goal. Directly adopting SPL isn’t pragmatic for two reasons:
Coordinating actions at every timestep is critical to this multi-agent task. Therefore, the number of actions taken by agents instead of distance (say in meters) should be incorporated in the metric.
Shortest-path distance has been calculated for two agent systems for FurnLift [jain2019CVPRTBONE] by finding the shortest path for each agent in a state graph. This can be done effectively for fairly independent agents. While each position of the agent corresponds to 4 states (if 4 rotations are possible), each position of the furniture object corresponds to
This leads to 404,480 states for an agent-object-agent assembly. We found the shortest path algorithm to be intractable in a state graph of this magnitude. Hence we resort to the closest approximation of Manhattan distance from the object’s start position to the goal’s position. This is the shortest path, if there were no obstacles for navigation.
Minimal edits to resolve the above two problems lead us to using actions instead of distance, and leveraging Manhattan distance instead of shortest-path distance. This leads us to defining, as described in Section sec:metrics of the main paper, the Manhattan distance based SPL (MDSPL) as the quantity
0.a.4.4 Defining additional metrics used for FurnLift.
Jain [jain2019CVPRTBONE] use two metrics which they refer to as failed pickups (picked up, but not ‘pickupable’) and missed pickups (‘pickupable’ but not picked up). ‘Pickupable’ means when the object and agent configurations were valid for a PickUp action.
0.a.4.5 Plots for additional metrics.
See Fig. 8, 9, and 10 for plots of additional metric recorded during training for the FurnMove, Gridworld-FurnMove, and FurnLift tasks. fig:all-train-plots-vision-furnlift in particular shows how the failed pickups and missed pickups metrics described above are substantially improved when using our SYNC models.
0.a.4.6 Additional 3-agent experiments.
In the main paper we present results when training SYNC, marginal, and central models to complete the 3-agent Gridworld-FurnMove task. We have also trained the same methods to complete the (visual) 3-agent FurnMove task. Rendering and simulating 3-agent interactions in AI2-THOR is computationally taxing. For this reason we trained our SYNC and central models for 300k episodes instead of the 500k episodes we used when training 2-agent models. As it showed no training progress, we also stopped the marginal model’s training after 100k episodes. Training until 300k episodes took approximately four days using eight 12GB GPUs ( GPU hours per model).
After training, the SYNC, marginal, and central obtained a test-set success rate of %, %, and % respectively. These results mirror those of the 3-agent Gridworld-FurnMove task from the main paper. Particularly, both the SYNC and central models train to reasonable success rates but the central model actually performs worse than the SYNC model. A discussion of our hypothesis for why this is the case can be found earlier in this section. In terms of our other illustrative metrics, our SYNC, marginal, and central respectively obtain MDSPL values of , , and , and Invalid prob values of , , and .
|Marginal w/o comm [jain2019CVPRTBONE]||0.032||0.164||224.1||2.143||0.815||0|
|Marginal w/o comm [jain2019CVPRTBONE]||0.111||0.484||172.6||1.525||0.73||0|
|FurnLift [jain2019CVPRTBONE] (‘constrained’ setting with no implicit communication)|
|in SYNC||MD-SPL||Success||Ep len||
0.a.5 Qualitative evaluation details and a statistical analysis of learned communication
0.a.5.1 Discussion of our qualitative video.
We include a video of policy roll-outs in the supplementary material. This includes four clips, each corresponding to the rollout on a test scene of one of our models trained to complete the FurnMove task.
Clip A. Marginal agents attempt to move the TV to the goal but get stuck in a narrow corridor as they struggle to successfully coordinate their actions. The episode is considered a failure as the agents do not reach the goal in the allotted 250 timesteps. A top-down summary of this trajectory is included in fig:clipa.
Clip B. Unlike the marginal agents from Clip A., in this clip two SYNC agents successfully coordinate actions and move the TV to the goal location in steps. A top-down summary of this trajectory is included in fig:clipb.
Clip C. Here we show SYNC agents completing the Gridworld-FurnMove in a test scene (the same scene and initial starting positions as in Clip A and Clip B). The agents complete the task in 148 timesteps even after an initial search in the incorrect direction.
Clip D (contains audio). This clip is an attempt to experience what agents ‘hear.’ The video for this clip is the same as Clip B showing the SYNC method. The audio is a rendering of the communication between agents in the reply stage. Particularly, we discretize the value associated with the first reply weight of each agent into 128 evenly spaced bins corresponding to the 128 notes on a MIDI keyboard (0 corresponding to a frequency of 8.18 Hz and 127 to 12500 Hz). Next, we post-process the audio so that the communication from the agents is played on different channels (stereo) and has the Tech Bass tonal quality. As a result, the reader can experience what agent 1 hears (, agent 2’s reply weight) via the left earphone/speaker and what agent 2 hears (, agent 1’s reply weight) via the right speaker. In addition to the study in sec:qualitative and sec:supp-comm, we notice a higher pitch/frequency for the agent which is passing. We also notice lower pitches for MoveWithObject and MoveObject actions.
0.a.5.2 Joint policy summaries.
These provide a way to visualize the effective joint distribution that each method captures. For each episode in the test set, we log each multi-action attempted by a method. We average over steps in the episode to obtain a matrix (which sums to one). Afterwards, we average these matrices (one for each episode) to create a joint policy summary
of the method for the entire test set. This two-staged averaging prevents the snapshot from being skewed towards actions enacted in longer (failed or challenging) episodes. In the main paper, we included snapshots forFurnMove in fig:matrices. In fig:matrices-more we include additional visualizations for all methods including (Marginal w/o comm model) for FurnMove and Gridworld-FurnMove.
0.a.5.3 Communication analysis.
, there is very strong qualitative evidence suggesting that our agents use their talk and reply communication channels to explicitly relay their intentions and coordinate their actions. We now produce a statistical, quantitative, evaluation of this phenomenon by fitting multiple logistic regression models where we attempt to predict, from the agents communications, certain aspects of their environment as well as their future actions. In particular, we run 1000 episodes on our test set using our mixture model in the visual testbed. This produces a dataset of 159,380 observations where each observation records, for a single step by both agents at time:
The two weights where is the weight agent assigns to the first symbol in the “talk” vocabulary.
The two weights where is the weight agent assigns to the first symbol in the “reply” vocabulary.
The two values where equals 1 if and only if agent sees the TV at timestep (before taking its action).
The two values where equals 1 if and only if agent ends up choosing to take the Pass action at time (, after finishing communication).
The two values where equals 1 if and only if agent ends up choosing to take some MoveWithObject action at time .
In the following we will drop the subscript and consider the above quantities as random samples drawn from the distribution of possible steps taken by our agents in randomly initialized trajectories. As and share almost all of their parameters they are, essentially, interchangeable. Because of this our following analysis will be solely taking the perspective of agent , similar results hold for . We consider fitting the three models:
where is the usual logistic function. Here eq:tv_vis attempts to determine the relationship between what communicates and whether or not is currently seeing the TV, eq:willpass probes whether or not any communication symbol is associated with choosing to take a Pass action, and finally eq:willmwo considers whether or not will choose to take a MoveWithObject action. We fit each of the above models using the glm function in the R programming language [RCoreTeam2019]. Moreover, we compute confidence intervals for our coefficient values using a robust bootstrap procedure. Fitted parameter values can be found in table:glm-fits.
From table:glm-fits we draw several conclusions. First, in our dataset, there is a somewhat complex association between agent seeing the TV and the communication symbols it sends. In particular, for a fixed reply weight , a larger value of
is associated with higher odds of the TV being visible tobut if then larger values of are associated with smaller odds of the TV being visible. When considering whether or not will pass, the table shows that this decision is strongly associated with the value of where, given fixed values for the other talk and reply weights, being larger by a unit of 0.1 is associated with larger odds of taking the pass action. This suggests the interpretation of a large value of as communicating that it wishes to pass so that may perform a single-agent navigation action to reposition itself. Finally, when considering the fitted values corresponding to Eq. (10) we see that while the talk symbols communicated by the agents are weakly related with whether or not takes a MoveWithObject action, the reply symbols are associated with coefficients with an order of magnitude larger values. In particular, assuming all other communication values are fixed, a smaller value of either or is associated with substantially larger odds of choosing a MoveWithObject action. This suggests interpreting an especially small value of as agent indicating its readiness to move the object.
Estimates, and corresponding robust bootstrap standard errors, for the parameters of communication analysis (sec:supp-comm).