Collaborative Visual Navigation

by   Haiyang Wang, et al.
Peking University

As a fundamental problem for Artificial Intelligence, multi-agent system (MAS) is making rapid progress, mainly driven by multi-agent reinforcement learning (MARL) techniques. However, previous MARL methods largely focused on grid-world like or game environments; MAS in visually rich environments has remained less explored. To narrow this gap and emphasize the crucial role of perception in MAS, we propose a large-scale 3D dataset, CollaVN, for multi-agent visual navigation (MAVN). In CollaVN, multiple agents are entailed to cooperatively navigate across photo-realistic environments to reach target locations. Diverse MAVN variants are explored to make our problem more general. Moreover, a memory-augmented communication framework is proposed. Each agent is equipped with a private, external memory to persistently store communication information. This allows agents to make better use of their past communication information, enabling more efficient collaboration and robust long-term planning. In our experiments, several baselines and evaluation metrics are designed. We also empirically verify the efficacy of our proposed MARL approach across different MAVN task settings.


page 1

page 14

page 15

page 16


TarMAC: Targeted Multi-Agent Communication

We explore a collaborative multi-agent reinforcement learning setting wh...

Multi-Agent Embodied Visual Semantic Navigation with Scene Prior Knowledge

In visual semantic navigation, the robot navigates to a target object wi...

Two Body Problem: Collaborative Visual Task Completion

Collaboration is a necessary skill to perform tasks that are beyond one ...

Building Generalizable Agents with a Realistic and Rich 3D Environment

Towards bridging the gap between machine and human intelligence, it is o...

An Analysis of Discretization Methods for Communication Learning with Multi-Agent Reinforcement Learning

Communication is crucial in multi-agent reinforcement learning when agen...

Multi-Agent Autonomy: Advancements and Challenges in Subterranean Exploration

Artificial intelligence has undergone immense growth and maturation in r...

Incorporating Pragmatic Reasoning Communication into Emergent Language

Emergentism and pragmatics are two research fields that study the dynami...

Code Repositories

1 Introduction

Human intelligence would collectively solve the problem that otherwise is unthinkable by a single person. For instance, Condorect’s Jury Theorem [landemore2008democratic]

indicates that, under certain assumption, adding more voters in a group would increase the probability that the majority chooses the right answer. Thus it is believed that a next grand challenge of AI is to answer how multiple intelligent agents could learn human-level collaborations 

[goertzel2007artificial, peng2017multiagent]

. With the recent rapid advances of machine learning and robotic technologies, the collaboration of AI agents gathers significantly increasing attention from researchers, and finds great potential in a wide range of real-world applications, such as search-and rescue 

[liu2016multirobot], area coverage [stergiopoulos2015distributed, kantaros2015distributed], harbour protection [miah2014nonuniform], and so on. In short, such multi-agent system (MAS) [stone2000multiagent] is of central importance to AI and still in its infancy.

Figure 1:  Three collaborative visual navigation task settings in CollaVN: (a) CommonGoal, (b) SpecificGoal, and (c) Ad-hoCoop.

Multi-agent reinforcement learning (MARL) [busoniu2008comprehensive] and embodied systems [Peng_2018] are the main driving power behind MAS. MARL field is rapidly expanding, and some of approaches even show super-human level performance in various games [jaderberg2019human], such as Magent [zheng2017magent], StarCraft [samvelyan2019starcraft], and Dota2 [openai2019dota]. However, most of existing algorithms [lowe2017multi, rashid2018qmix, foerster2016learning, lazaridou2017multiagent] operate on low-dimensional observations, , grid-world like or game environments. As for embodied systems, the availability of large-scale, visually-rich 3D datasets [armeni20163d, chang2017matterport3d, wu2018building, song2017semantic] and community interest in active perception led to a recent surge of simulation platforms [shen2020igibson, chang2017matterport3d, kolve2017ai2, savva2019habitat, xia2018gibson]. Based on these high-performance 3D simulators, various active vision tasks are investigated, such as embodied navigation [chaplot2020neural, deitke2020robothor], instruction following [anderson2018vision], embodied question-answering [das2018embodied] and perception [liu2020when2com], . However, most of them either do not support multi-agent setting [chaplot2020neural, deitke2020robothor, anderson2018vision, das2018embodied], or little consider planning [das2018embodied, liu2020when2com].

Thus there is still a huge gap between MAS and realistic visual environments. To close this gap, we introduce a collaborative visual navigation (MAVN) task for image-goal based, multi-agent navigation, under visually realistic environments, and develop a corresponding dataset, called CollaVN. Autonomous navigation forms a core building block in embodied systems, and long received research interest across many fields, such as computer vision [chaplot2020neural], robotics [kim1999symbolic], linguistics [striegnitz2011report], and cognition [kuipers1978modeling]. In MAVN, agents cooperate to reach a target position (determined by a corresponding image), with their individual perception. Three task settings are explored to cover more challenges in MAS/MARL: i) CommonGoal (Fig. 1(a)), where agents are assigned a same goal picture; ii) SpecificGoal (Fig. 1(b)), where agents are assigned different goal pictures; and iii) Ad-hoCoop (Fig. 1(c)), where the agent numbers in training and testing are different. CommonGoal serves as a basic setting where agents learn how to cooperate to reach a goal faster. SpecificGoal captures more complex scenarios where agents need to collaborate with individual targets. Ad-hoCoop is built upon CommonGoal but agents are required to adapt to different team sizes during deployment, simulating a more open MAS situation [barrett2011empirical, canaan2019diverse].

Our CollaVN dataset is built upon iGibsonV1 simulator [xia2020interactive]. Besides inheriting the advantages of iGibsonV1 in high-quality rendering, rich diversity and good accessibility, CollaVN has the following unique characteristics:

  • [leftmargin=*]

  • Multi-agent operation: CollaVN supports MAVN with arbitrary group sizes (2-4 agents used in current setting). Agents can be safely initialized with collision detection.

  • Multi-task setup: All the three task variants are involved and companied with standard evaluation tools.

  • Flexible target generation: CollaVN allows to generate different target pictures flexibly, well supporting the three MAVN settings and future dataset extension.

  • Panoramic observation: A panoramic camera is equipped for rendering agent egocentric panorama in real-time.

We further propose a novel memory-augmented communication approach for MAVN. Communication is viewed as one of the most desired and efficient ways of multi-agent interaction/coordination [kim2019learning, lazaridou2017multiagent]. Communication is especially crucial when a set of cooperative agents are placed in a partially observable environment, such as our MAVN setting – the agents need to exchange information (, past experience, current perception, future plan, ), coordinate their actions, behave as a group and achieve the target goal better [fox2000probabilistic, Hernandez_Leal_2019]. Thus learning communication has become a center topic in MARL field. From earlier natural-language-based [lazaridou2017multiagent] to recent hidden-state-based [peng2017multiagent] communication forms and pre-defined communication protocols [tan1993multi] to learnable strategies [foerster2016learning, peng2017multiagent, sukhbaatar2016learning], communication based MARL made great advance. Some most recent ones are concerned with practical communication-bandwidth-limited situations [singh2018learning, das2019tarmac, liu2020when2com, liu2020who2com] and conduct studies mainly around the theme of learning what, when and who to communicate. However, in previous methods, agents simply discard all the information after each communication round. This is problematic, as some communication information, even not useful in present, may be valuable in the future. This raises the risk of losing essential information and easily leads to suboptimal behaviors driven by short-termism. To address these limitations, in our memory-augmented communication approach, each agent is equipped with a private, compartmentalized memory, for safely storing and accurately recalling its past communication information. During each communication round, each agent can send a more reasonable request to other agents, by considering its current states and stored information. Each of other agents will check its current state and past communication information to give a better response. The agents can make full use of the rich information stored in the memory during navigation. Overall, our approach helps agents conduct more efficient communication and enhances their long-term planning ability, eventually leading to better collaboration and navigation performance.

In a nutshell, our contributions are three-fold:

  • [leftmargin=*]

  • To narrow the gap between MAS and visual perception, we develop a large-scale and publicly-accessible dataset, CollaVN, for collaborative visual navigation (MAVN), in photo-realistic, multi-agent environments.

  • Diverse MAVN task settings are explored to cover many core challenges in MAS/MARL, with packaged evaluation tools and baselines.

  • A novel memory-augmented communication approach is proposed to address more efficient multi-agent collaborations and robust long-term planning.

Several baselines are evaluation tools are developed for comprehensive benchmarking. Experimental results show that our memory-augmented communication framework achieves promising performance over the three MAVN settings on CollaVN dataset. This work is expected to provide insights into the critical role of MAS in embodied perception tasks, and foster research on the open issues raised.

2 Related Work

Vision-based Navigation. As a crucial step towards building intelligent robots, autonomous navigation has been long studied in robotics community. Recently vision-based navigation [chaplot2020neural] attained growing attention in computer vision community and was explored in many other forms, such as point-goal based or object-oriented (, reaching a specific position or finding an object) [zhu2016targetdriven, chaplot2020neural, chaplot2020object, deitke2020robothor], natural-language-guided [anderson2018vision, wang2020active], audio-assisted navigation [chen2020soundspaces], and from indoor environments [anderson2018vision] to street scenes [chen2019touchdown]. Prominent early methods rely on a pre-given/-computed global map [borenstein1991vector, kim1999symbolic] for path planing, while later ones refer to simultaneous localization and mapping (SLAM) techniques [thrun2002probabilistic, hartley2003multiple] that reconstruct map on the fly. Some more recent methods learn planning and mapping jointly [gupta2017cognitive, parisotto2017neural, zhang2017neural, chaplot2020learning, lee2018gated, Wang_2021_CVPR].

Although great advance has been achieved, visual navigation in multi-agent setting has remained largely unexamined. Most existing studies [anderson2018vision, chen2020soundspaces, anderson2018vision, chen2019touchdown] focused on single-agent navigation, excepting [jain2020cordial, jain2019two] considering two-agent collaboration in finding and lifting bulky items. In this work, we develop a large-scale dataset for indoor, collaborative, multi-agent visual navigation. The agents are required to work together to perform same or different tasks, or even without any assumption on the number of involved agents. Thus our task setting is more novel and general.

Embodied AI Environments. The recent surge of research in embodied perception is greatly driven by new 3D environments [xia2018gibson, armeni20163d, chang2017matterport3d, brodeur2017home, song2017semantic] and simulation platforms, such as iGibsonV1 [xia2020interactive], AI2-THOR [kolve2017ai2], and Habitat [savva2019habitat]. Compared with grid-like or game environments [zheng2017magent, samvelyan2019starcraft, openai2019dota, jaderberg2019human], these open accessible, photo-realistic platforms bring perception and planning in a close loop and make the training and testing of embodied agents reproducible [chen2020soundspaces]. Based on these platforms, numerous embodied vision datasets were proposed [deitke2020robothor, anderson2018vision, chen2020soundspaces, deitke2020robothor], while most of them are designed for single-agent tasks. We build our dataset upon iGibsonV1 [xia2020interactive] which is featured by large scale, rich visual diversity and strong extensibility. Although iGibsonV2 [shen2020igibson] also involves multiple agents, it addresses obstacle avoidance in social scenes. It does not investigate collaboration among agents, while which is the core nature and research focus in MAS/MARL. Moreover, our dataset supports diverse essential MARL task settings, making it unique in the filed of visual navigation.

Multi-Agent Reinforcement Learning (MARL). MARL tackles the sequential decision-making problem of multiple agents in a common environment, each of which aims to achieve its own long-term goal by interacting with the environment and other agents [busoniu2008comprehensive, Hernandez_Leal_2019]. Basically, MARL algorithms can be placed into four categories [Hernandez_Leal_2019]. i) Analysis of emergent behaviors. Some early studies [raghu2018can, bansal2018emergent] analyze single-agent RL algorithms in multi-agent environments. ii) Learning communication. Methods in this category [mordatch2018emergence, lazaridou2017multiagent, foerster2016learning] address collaboration through explicit communication, which attracts increasing attention recently. iii) Learning cooperation. Many other efforts [panait2005cooperative, matignon2012independent, palmer2017lenient] indirectly arrive at cooperation via, for example, policy parameter sharing [gupta2017cooperative] or experience replay buffer [foerster2017stabilising]. iv) Agents modeling agents. These methods [raileanu2018modeling, hong2017deep] build models to reason the behaviors of other agents for better decision-making.

These algorithms mostly focus on grid-world like scenes, despite the role of perception to some degree. Our CollaVN dataset provides a visually-rich platform for fostering studies about MARL in computer vision. We consider MAVN is of a cooperative nature and agents are situated in a partially observable environment, and concern with learning efficient communication for long-term and collaborative planning.

Learn to Communicate. Communication is a fundamental aspect of intelligence, enabling agents to behave as a group, rather than a collection of individuals. Along with the direction of communication based collaboration, MARL researchers first used predefined communication protocols [tan1993multi], and then learnable strategies [foerster2016learning, peng2017multiagent, jiang2018learning, sukhbaatar2016learning] for information exchanging. However, sharing information among all agents is problematic, as communication bandwidth is limited the real world. To reduce bandwidth wasting, some recent methods adopt pre-defined communication groups [jiang2018learning, Jiang2020Graph] or set “what, when and/or who to communicate” as a part of policy learning [foerster2016learning, singh2018learning, das2019tarmac, liu2020when2com, liu2020who2com].

As previous methods only have an “immediate memory” during communication, they suffer from a bias towards short-sighted behaviors. We instead equip each agent with a private, external memory that persistently stores the information in all past communication rounds and enables the reuse of these information during future navigation. This enhances the long-term planning ability of our agents. This idea is also distinctively different from [peng2017multiagent, pesce2020improving], where all the agents need to transmit messages to a central memory, incurring a major communication bottleneck.

Ad-Hoc Cooperation. The problem of ad-hoc team play in multi-agent cooperative games was raised in the early 2000s [bowling2005coordination] and is mostly studied in the robotic soccer domain [hausknecht2016half]. In many close MARL environments agents learn a fixed and team-dependent policy [zhang2020multi]. While in ad-hoc setting agents are desired to assess and adapt to the capabilities of others to behave optimally in an open environment. See [albrecht2018autonomous] for a survey. Existing work in ad-hoc MARL mainly focuses on game settings, such as hanabi (card game) [canaan2019diverse]. Our work makes a further step towards exploring ad-hoc setting in visually-rich MARL; the navigation agents are required to adapt to different team sizes during deployment.

3 Multi-Agent Visual Navigation Task

3.1 Our CollaVN Dataset

CollaVN is a visual MARL dataset, created for the development of intelligent, cooperative navigation agents.

iGibsionV1 [xia2020interactive].

CollaVN is built upon iGibsonV1, a recently released open-source 3D simulator. iGibsonV1 supports fast visual rendering in 400 fps and allows researchers to easily train and evaluate embodied navigation agents.

New Modules. To support multi-agent navigation task, three modules are developed: i) An agent initialization module is responsible for “safely” placing multiple agents at the beginning of each navigation episode, , randomly placing agents but being aware of constraints of scene and physics, such as avoiding collision. ii) As the panoramic visual observation space is widely adopted in current embodied vision tasks [chaplot2020neural, deng2020evolving], a panoramic camera is equipped to render agents 360 egocentric views in real-time. iii) A target generation module is built for generating a target picture as the goal of navigation. This module adopts a front-view camera, taking pictures around the agent height (, 0.9 m) under the constraints of scene and physics.

Scenes. All the 572 full buildings in iGibsonV1 are involved in CollaVN, covering a total area of 211K . The area of a physical navigation space considered in CollaVN is from 10 m 10 m to 30 m 30 m.

Agent. In CollaVN, we use Locobot, a widely used simulation agent [pyrobot2019]. Its body width is 0.36 m, controlled by two wheels. The maximum translational velocity is 70 cm/s and rotational velocity is 180 deg/s.

Data Generation. Following Gibson-tiny split [xia2020interactive], we use 25/5/5 scenes for creating train/val/test data. We build three sub-dasets for different MAVN tasks, , CommonGoal, SpecificGoal, Ad-hoCoop. A total of 1M/60K/120K episodes/samples are generated for train/ val/test. According to initial agent-target distance at the beginning of each episode, fine-grained annotations for val/ test data are given: easy (1.5-3 m), medium (3-5 m), hard (5-10 m). We provide more details in §3.2.

Comparison to Previous Datasets. As shown in Table 1, Habitat [savva2019habitat] only supports the single-agent setting. AI2THOR [kolve2017ai2] just supported multi-agent setting recently. However, its main focus is task planning of sequential actions to change object states and its environment space is small (only 9 m 4 m), making it not suitable for studying challenging, long-range navigation. Note that [jain2020cordial, jain2019two] recently extended AI2THOR to involve a collaborative task, , finding and lifting furniture. They are limited to highly-correlated actions between two agents and cannot explore complex multi-agent collaborative behaviors under more general situations (, larger team sizes, different goals). iGibsonV2 [shen2020igibson], also built upon iGibsonV1, mainly focuses on obstacle avoidance in social scenes, , only one agent is required to execute point-goal navigation while other agents are moving around (acting as pedestrians). Thus iGibsonV2 does not address collaboration among multiple navigators.


Dataset Pub. Year MA CM LS CA PA
AI2THOR [kolve2017ai2] arXiv 2017
Habitat [savva2019habitat] ICCV 2019
iGibsonV1 [xia2020interactive] RAL 2020
iGibsonV2 [shen2020igibson] arXiv 2020
CollaVN - 2021
Table 1: Point-to-point comparison to closely relevant 3D platforms/datasets. MA: Multi-agent. CM: Multi-agent communication. LS: Large scale. CA: Continuous action. PA: Panorama.

3.2 Task Setup

Basic Setting. MAVN assumes there are autonomous agents situated in a CollaVN environment. Each agent is required to navigate the environment to reach a target location (indicated by a goal image), under a cooperative setting. Formally, at the beginning of each navigation episode, each agent will be assigned a target goal image , , a RGB image. Each agent does not know other agents’ positions. At each time step , each agent receives its visual observation , , a first-person panoramic-view RGB image. As in many communication based MARL settings [jiang2018learning, foerster2016learning, kim2019learning], agents operate in a partially observable environment and perceive the environment from their own view. They can exchange information via a low bandwidth communication channel. Sensing and control errors and communication delay are not considered.

Action Space. The agents can control its wheels, and the maximum allowed translational and rotational velocities are 70 cm/s and 180 deg/s, respectively. At each time step , each agent takes a continuous action , corresponding to the normalized velocities. For example, means the agent will move at the maximum translational velocity (, 70 cm/s) and half of the maximum rotational velocity (, 90 deg/s) during [, ]. The interval between and is 1 s.

MAVN Task Settings. To better cover challenges in MARL and MAS, three different MAVN task settings are designed:

  • [leftmargin=*]

  • CommonGoal: robot agents are assigned a common goal, , . Agents need to collaboratively navigate to the target goal within a maximum time step length . This is the most basic MAVN task setting with the purpose of investing how several agents can learn, from visual environment, to communicate so as to effectively and collaboratively solve a same given task.

  • SpecificGoal: The difference from CommonGoal is that, SpecificGoal assigns agents with different goal images, , . In this setting, we are particularly interested in how agents learn communication and collaboration with different long-term targets.

  • Ad-hoCoop: In CommonGoal and SpecificGoal, the team size remains fixed during both training and deployment phases. The learned agents may not generalize to new configurations of teams. Ad-hoCoop is a more open world setting, in which agents must adapt to different team sizes. We test the ad-hoc team play by adding or removing agents at test time. Note that in Ad-hoCoop agents are assigned a same goal in each episode.

Task-Specific Sub-Dataset. Our CollaVN has three sub-datasets, correspond to the three MAVN tasks:

  • [leftmargin=*]

  • CommonGoal-CollaVN: Three subsets are created for different agent numbers (, ) and agents in each episode are assigned a same target image. For each subset, we generate 5K episodes per training scene. For each val/test scene, we generate 0.5K/1K episodes per configuration, , easy (1.5-3 m), medium (3-5 m), hard (5-10 m). Finally, each subset has 125K/7.5K/15K samples for train/val/test.

  • SpecificGoal-CollaVN: Three subsets are created for different agent numbers (, ) and agents in each episode are assigned different target images. Similarly, each subset has 125K/7.5K/15K samples for train/val/ test.

  • Ad-Hoc-CollaVN: Two subsets are created for different ad-hoc team size changing situations (, and ), and each has 125K/7.5K/15K samples for train/val/test. Agents in each episode are assigned a same target image. For , its train set is the one in CommonGoal-CollaVN with , but its val and test sets are new generated. Each training sample in contains 2 agents while each val/test sample has 3 agents. The subset of is built in a similar way.

3.3 Evaluation Measures

For performance evaluation, we use four metrics, where the former three are multi-agent augmented versions of SR (Success Rate), DTS (Distance to Success) and SPL (Success weighted by Path Length [anderson2018evaluation]), and the last one, named SSR (Success weighted by Step Ratio), is new proposed.

SR. Following [xia2020interactive], we consider an episode is successful for an agent if it reaches the target location within 1 m radius. Let be a binary indicator of success in episode for agent . SR is given as: .

DTS. Let denote the geodesic distance between agent and its goal location at the end of episode , and the success threshold (, 1 m). .

SPL. Let be the geodesic distance from starting position to the goal of agent in episode , and the length of the path actually taken by agent in episode . We have: .

SSR. Let denote the allowed maximum time step and the number of navigation steps used by agent in episode . . As MAVN is a very difficult task, in our experiments (§5.3), we found agents usually need to take many steps to reach the targets, , , making SPL very small and less discriminative. So we design SSR as a complementary.

4 Methodology


Figure 2: (a) Overview of our method. (b) Detailed illustration of our memory-augmented communication strategy.

4.1 Problem Setup

MARL Background. We consider a system of agents operating in a common environment as a

-agent extension of partially observable Markov decision processes 

[littman1994markov]. It assumes a set [pesce2020improving], , containing all the states characterising the environment; action spaces , where is a set of possible actions for the agent; and observation spaces where contains the observations available to the agent. Each agent receives a private observation (, a partial characterisation of the current state ), and chooses an action according to a stochastic policy , parametrized by . The next state is produced according to a transition function . Each agent obtains reward as a function of the state and its action and aims to maximize its expected return , where is a discount factor and is the time horizon.

Task Statement. With our three MAVN tasks, we explore how agents learn from visually-rich environments to achieve collaborative navigation. In CommonGoal (SpecificGoal), agents learn to cooperatively navigate to same (different) locations, determined by target images . In Ad-hoCoop, agents are even expected to learn more robust policies that are adaptive to the team size .

4.2 Memory-Augmented Communication for MAVN

Overview. Our agents are connected by a low-bandwidth communication network without any central controller, through which they can coordinate their actions and behave as a group. As shown in Fig. 2(a), each agent mainly has two components: i) Map Building Module4.2.1) and ii) Memory-Augmented Communication Module4.2.2). For i), map building is a crucial step in navigation; an environment representation

is estimated for path planning. For ii), each agent

learns to generate useful information for communication and gather messages from other agents simultaneously. It is equipped with a private memory for accurately storing and recalling past communication information. For each agent , based on its private observation, estimated local map, communication information and target goal, its policy becomes . See §3.2 for the detailed definitions of , , and

. Our whole system is built as trainable end-to-end; neural networks are used as approximations for policies and modules. Corresponding descriptions are presented below.

4.2.1 Map Building Module

Learning environment layouts is crucial for autonomous navigation [parisotto2017neural, zhang2017neural]. Inspired by [chaplot2020object, chaplot2020learning], we equip each agent with a map building module , which online estimates environment structures from agent’s past percepts. For each agent , we denote its pose as , where () indicates its xy co-ordinate and represents its orientation in radians [chaplot2020object]. We set and let each agent build map on its ego-centric coordinate system.

For each agent at time step , takes its local observation , current and last pose , prior predicted map as input, and outputs an updated map:


is a matrix where denotes the map size. The first 256 channels store visual features over all the explored locations. To save the map size, the grid size is set as in the physical world. The last four channels include a probability map of obstacle and three binary maps storing explored area, agent past trajectory and current location, with gride size. At the beginning of an episode, is initialized with all zeros and the agent is put in the center of .

4.2.2 Memory-Augmented Communication Module

In current communication based MARL algorithms [jiang2018learning, foerster2016learning, kim2019learning], all the information generated in -round communication are directly abandoned at the beginning of time step and new messages are generated for -round communication. However, some discarded information may be useful in the future. In addition, the communication is mainly about agents’ current observations, failing to involving their past experience explicitly. It is difficult if an agent wants to know another agent’s status in a previous time step. We instead propose a memory-augmented communication strategy, that allows agent to accurately store and recall their past communication information. It is instantiated as a handshake communication based MARL [das2019tarmac, liu2020who2com]

Handshake Strategy. Three stages are involved in vanilla handshake communication, , request, match, and select, and each agent acts both requester and supporter:

  • [leftmargin=*]

  • request. At the beginning of each time/communication step , each agent first compresses its local observation to a query , key , and value :


    where and

    are compact vectors while

    is a high-dimension vector to fully preserve the information of . Then the agent broadcasts to other agents. As is small-size, this only causes little bandwidth transmission.

  • match. Each of other agents derives a matching score between the received query and its own key :


    where is a learnable similarity function. can be intuitively viewed as the importance of the information provided by the supporter for the requester .

  • select. Some connections can be removed if their matching scores are small. Then the requester collects information from its temporally connected supporters at time step :


    where is the value vector returned from the supporter , and the integrated message is used to help the requester take an action decision .

Memory-Augmented Communication. As shown in Fig. 2(b), each agent maintains a private, external memory , which stores all its past generated communication information: . Then our memory-augmented handshake communication has the following four stages:

  • [leftmargin=*]

  • request. At the beginning of each time step , each agent generates query, key, and value vectors:


    where P is avg-pooling and both private observation and map are considered. In addition, past generated communication information, stored in , are also involved for generating a more reasonable .

  • match. Then the agent broadcasts to other agents. Each of other agents looks up its current key , and all the past self-generated keys stored in :


    Then the final matching score and returned message are:


    The agent also computes the correlations between the query and all its current and past keys stored in :


    Similar to Eq. 7, we have:

  • select

    . To filter out some less connections, we apply an activation function

    for each score :


    As in [liu2020when2com], only the supporters with non-zero scores are allowed to send messages to the requestor (, learning who to communicate). Moreover, when , that means the requestor thinks there is no need to establish any communication at this time step , (, learning when to communicate). We set following [liu2020when2com]. If , the requester will collect information from itself and other connected supporters (with ):

  • store. Next, the agent will store its value and key information in the memory, getting for the next round communication. In our current implementation, we simply assume the private memory for each agent is large enough to store all its generated communication information over the whole navigation episode. There are several ways to improve the efficiency. One can perform store operation in a certain interval of time, and/or build the memory as a stack with fixed capacity, and/or even learn a specific write operation to see if it is necessary to store current communication information by comparing its uniqueness with existing memory slots. We put this as a part of our future work.

After -round communication, each agent takes an navigation action through:


4.3 Fully Decentralized Learning

At each time step , each agent executes an individual action , with the joint goal of maximizing the average rewards of all the agents, , . We adopt an actor-critic model [mnih2016asynchronous] for training and consider it under a fully decentralized MARL setting [zhang2019multi, suttle2019multi, zhang2018fully] in which agents are connected via a time-varying communication network. Standard actor-critic methods consist of two models, an actor proposes an action and a critic gives a feedback/evaluation accordingly. In PPO [schulman2017proximal], neural networks are used as approximations of both the actor, represented by a policy function , and its corresponding critic, represented by a value function . In order maximize , we have the following policy loss (without clip operation):


where the advantage is estimated from and  [schulman2015high], and indicates the expected discounted return [williams1992simple].

In this formulation, as there is no interaction between agents, the policies are learned independently. Previous MARL works [lowe2017multi, foerster2018counterfactual, yu2021surprising, iqbal2019actor, pesce2020improving] mainly address this by training multiple interacting agents in a Centralized Train and Decentralized Execution manner. They learn decentralised policies with a centralized critic which collects all the observations to estimate joint state of all the agents. However, our communication based MARL method is essentially under a fully decentralized setting [zhang2019multi, suttle2019multi, zhang2018fully], in which agents are connected via a time-varying communication network. As agents can exchange information through communication, we learn an individual value function for each agent, dependent on both local observation and received message:



Easy (1.5-3 m) Medium (3-5 m) Hard (5-10 m) Overall
1 Random 8.31 1.26 0.10 0.02 0.90 3.01 0.01 0.00 0.00 6.49 0.00 0.00 3.07 3.59 0.04 0.01
IL 37.10 1.22 2.64 0.13 10.10 2.75 0.25 0.05 1.80 5.81 0.07 0.01 16.33 3.26 0.98 0.06
2 IL 37.30 1.22 2.66 0.13 10.00 2.76 0.24 0.05 1.81 5.80 0.07 0.01 16.37 3.26 0.99 0.06
MARL w/o com 39.10 1.19 2.98 0.19 12.10 2.61 0.53 0.06 1.89 5.76 0.07 0.01 17.69 3.18 1.11 0.08
MARL w/o mem 40.21 1.15 3.13 0.19 12.71 2.56 0.32 0.06 1.98 5.71 0.07 0.01 18.29 3.14 1.17 0.09
Ours 45.71 1.05 3.53 0.22 16.55 2.42 0.41 0.08 2.31 5.54 0.09 0.01 21.52 3.00 1.34 0.10
3 MARL w/o com 39.00 1.19 2.97 0.19 12.31 2.58 0.31 0.06 1.87 5.75 0.09 0.01 17.72 3.17 1.11 0.08
MARL w/o mem 40.81 1.14 3.21 0.20 12.92 2.54 0.34 0.06 2.01 5.71 0.07 0.01 18.57 3.13 1.20 0.08
Ours 47.03 1.03 3.69 0.23 17.50 2.36 0.45 0.08 2.65 5.45 0.10 0.01 22.39 2.94 1.41 0.11
4 MARL w/o com 39.40 1.19 2.96 0.19 11.91 2.60 0.32 0.06 1.87 5.73 0.07 0.01 17.72 3.17 1.12 0.09
MARL w/o mem 41.02 1.14 3.23 0.20 13.10 2.52 0.34 0.06 2.01 5.71 0.07 0.01 18.70 3.12 1.21 0.09
Ours 48.01 1.01 3.75 0.24 18.40 2.31 0.48 0.09 2.91 5.40 0.11 0.01 23.10 2.90 1.45 0.11
Table 2: Performance on CommonGoal task with different team sizes () over our CommonGoal-CollaVN test.


Easy (1.5-3 m) Medium (3-5 m) Hard (5-10 m) Overall
2 MARL w/o com 39.02 1.19 2.99 0.19 12.00 2.59 0.30 0.06 1.88 5.75 0.07 0.01 17.63 3.18 1.12 0.09
MARL w/o mem 39.86 1.17 3.05 0.19 12.40 2.58 0.31 0.06 1.94 5.72 0.07 0.01 18.06 3.16 1.14 0.09
Ours 44.50 1.07 3.49 0.21 15.95 2.45 0.40 0.08 2.26 5.58 0.08 0.01 20.90 3.03 1.32 0.10
3 MARL w/o com 39.08 1.19 2.98 0.19 12.12 2.58 0.30 0.06 1.87 5.74 0.07 0.01 17.69 3.17 1.12 0.09
MARL w/o mem 40.20 1.16 3.08 0.19 12.65 2.56 0.32 0.06 1.96 5.71 0.07 0.01 18.27 3.14 1.16 0.09
Ours 45.83 1.05 3.61 0.22 17.00 2.38 0.43 0.08 2.57 5.49 0.09 0.01 21.80 2.97 1.38 0.10
4 MARL w/o com 39.14 1.19 2.95 0.19 12.01 2.59 0.31 0.06 1.88 5.73 0.07 0.01 17.68 3.17 1.11 0.09
MARL w/o mem 40.60 1.15 3.13 0.20 12.85 2.54 0.33 0.06 1.98 5.71 0.07 0.01 18.47 3.13 1.18 0.09
Ours 46.79 1.03 3.71 0.23 17.82 2.33 0.46 0.09 2.80 5.42 0.10 0.01 22.47 2.93 1.42 0.11
Table 3: Performance on SpecificGoal task with different team sizes () over our Specific-CollaVN test.


Easy (1.5-3 m) Medium (3-5 m) Hard (5-10 m) Overall
23 MARL w/o com 39.10 1.19 2.97 0.19 12.3 2.58 0.32 0.06 1.88 5.75 0.07 0.01 17.76 3.17 1.12 0.09
MARL w/o mem 40.50 1.15 3.17 0.20 12.87 2.55 0.34 0.06 1.99 5.72 0.07 0.01 18.45 3.14 1.19 0.09
Ours 46.66 1.03 3.68 0.22 17.30 2.37 0.45 0.08 2.60 5.47 0.10 0.01 22.17 2.96 1.41 0.11
32 MARL w/o com 39.00 1.19 2.98 0.19 12.20 2.60 0.30 0.06 1.89 5.76 0.07 0.01 17.70 3.18 1.12 0.09
MARL w/o mem 39.90 1.17 3.10 0.19 12.40 2.58 0.31 0.06 1.96 5.71 0.07 0.01 18.08 3.15 1.16 0.09
Ours 45.20 1.06 3.51 0.22 16.10 2.43 0.41 0.08 2.27 5.58 0.08 0.01 21.19 3.02 1.33 0.10
Table 4: Performance on Ad-hoCoop task with two ad-hoc team-play settings ( and ) over our Ad-Hoc-CollaVN test.

5 Experiments

We conduct experiments on the three MAVN tasks, , CommonGoal, SpecificGoal, Ad-hoCoop on the corresponding subdatasets of CollaVN, , CommonGoal-CollaVN, SpecificGoal-CollaVN, and Ad-Hoc-CollaVN.

5.1 Experimental Setup

Network Details. The embedding of local visual observation is from a pretrained ResNet18 [he2016deep] (compressed in to 1024-d). The implementation of map building module 4.2.1) mainly follows [chaplot2020object, chaplot2020learning] and the map size is in the physical world. Query , key , and value in §4.2.2 are 256-d, 256-d, 2048-d vectors, respectively. All the functions in §4.2.2 are one-layer MLPs. The policy and value function are GRUs [chung2014empirical].

Evaluation. For different MALN tasks, we adopt test sets of corresponding sub-datasets in CollaVN for benchmarking, measured by (§3.3) SR, DTS, SPL and SSR. The maximum episode length is set as steps.

Training. Our model was trained using dense reward of geodesic distance reduced to the goal location and constant -0.05 step penalty to collision.


Our model is implemented in PyTorch and trained on four NVIDIA Tesla V100 GPUs with a 32GB memory per-card. To reveal full details of our algorithm, our implementations are released.

5.2 Baseline

Following baselines are used in our experiments (their sub-modules are the same with ours unless specified):

  • [leftmargin=*]

  • Random walk

    is the simplest heuristic for navigation: agent randomly selects an action at each step.

  • IL

    learns an off policy by imitation learning (IL). One agent is trained and several copies are used for testing.

  • MARL w/o com learns target-driven MARL policy without explicit communication, using IPPO [schroeder2020independent].

  • MARL w/o mem can be viewed as a variant of our model without using memory.

5.3 Experimental Results

Performance on CommonGoal Task. For the settings with different agent numbers, , on the CommonGoal-CollaVN sub-dataset, all the benchmarked models are trained on the corresponding train sets and tested on the corresponding test sets. The results are summarized in Table 2. As seen, our model outperforms over all other baselines across all the metrics. Interestingly, we can also observe IL based baseline, , IL-M, even shows comparable performance with a MARL baseline, , MARL w/o com, revealing the difficulty of this task.

Performance on SpecificGoal Task. Also, for different agent numbers on the SpecificGoal-CollaVN sub-dataset, we train and evaluate the models on the corresponding train and test sets. From Table 3 we can find that our method sill gets better performance. However, compared with the results in Table 2, all the methods suffer from performance drop, which indicates that it is harder to learn communication when agents perform different tasks.

Performance on Ad-hoCoop Task. For setting in the Ad-Hoc-CollaVN dataset, each of train scenes contains two agents while each of test scenes have three agents. Similarly, for setting, each of train scenes contains three agents while each of test scenes have two agents. For the two settings, all the methods are trained and test on the corresponding train and test sets. The results in Table 4 show again our model performs better on the two ad-hoc team play settings, mainly due to our fully decentralized learning strategy.

Effect on Memory-Based Communication. Compared with MARL w/o mem, our model shows significantly better performance over all the three MAVN settings. MARL w/o mem only shows limited improvements over MARL w/o com. These observations suggest the importance of long-term memory in MAVN.

6 Conclusion

A new dataset was introduced for multi-agent collaborative navigation in complex environments. A memory-based communication model was presented to address the reuse of past communication information and facilitate cooperation. Our experiments showed that there is still huge room for further improvement. This work is expected to inspire more future efforts towards this promising direction.


Appendix A Details of CollaVN

Our CollaVN dataset has three subdatasets, corresponding to the three MAVN tasks: CommonGoal, SpecificGoal and Ad-HoCoop.

CommonGoal-CollaVN: It contains three subsets, which are created for different agent numbers (, ), and agents in each episode are assigned a same target image. For each subset of different agent number, we randomly sample 5K episodes per training scene. For each scene in val/test split, we generate 0.5K/1K episodes per configuration, , easy (1.5-3 m), medium (3-5 m), hard (5-10 m). The distance range means that the initial start and target position of all agents are within this limit. Finally, each subset has 125K/7.5K/15K samples for train/val/test, , . Detailed statistics of CommonGoal-CollaVN subdataset are shown in Table 5.

SpecificGoal-CollaVN: Three subsets are created for different agent numbers (, ), but agents in each episode are assigned different target images. Similarly, each subset has 125K/7.5K/15K samples for train/val/ test. Detailed statistics of SpecificGoal-CollaVN subdataset are shown in Table 6.

Ad-Hoc-CollaVN: Two subsets are created for different ad-hoc team size changing situations (, and ), and each has 125K/7.5K/15K samples for train/val/test. Agents in each episode are assigned a same target image. For , its train set is the one in CommonGoal-CollaVN with , but its val and test sets are new generated. Each training sample in contains 2 agents while each val/test sample has 3 agents. The subset of is built in a similar way. Detailed statistics of Ad-Hoc-CollaVN subdataset are shown in Table 7.


Difficulty Train Val Test
2 Easy (1.5-3m) 125K 2.5K 5K
Medium (3-5m) 125K 2.5K 5K
Hard (5-10m) 125K 2.5K 5K
3 Easy (1.5-3m) 125K 2.5K 5K
Medium (3-5m) 125K 2.5K 5K
Hard (5-10m) 125K 2.5K 5K
4 Easy (1.5-3m) 125K 2.5K 5K
Medium (3-5m) 125K 2.5K 5K
Hard (5-10m) 125K 2.5K 5K
Table 5: Dataset Statistics on CommonGoal-CollaVN.


Difficulty Train Val Test
2 Easy (1.5-3m) 125K 2.5K 5K
Medium (3-5m) 125K 2.5K 5K
Hard (5-10m) 125K 2.5K 5K
3 Easy (1.5-3m) 125K 2.5K 5K
Medium (3-5m) 125K 2.5K 5K
Hard (5-10m) 125K 2.5K 5K
4 Easy (1.5-3m) 125K 2.5K 5K
Medium (3-5m) 125K 2.5K 5K
Hard (5-10m) 125K 2.5K 5K
Table 6: Dataset Statistics on SpecificGoal-CollaVN.


Difficulty Train Val Test
Add () Easy(1.5-3m) 125K 2.5K 5K
Medium(3-5m) 125K 2.5K 5K
Hard(5-10m) 125K 2.5K 5K
Remove () Easy(1.5-3m) 125K 2.5K 5K
Medium(3-5m) 125K 2.5K 5K
Hard(5-10m) 125K 2.5K 5K
Table 7: Dataset Statistics on Ad-Hoc-CollaVN.

Appendix B Implementation Details

b.1 Reward Structure

During training, rewards are provided to all agents individually at every step. The reward includes two terms: i) , where is the geodesic distance between agent’s current location and target location, and refers to the distance change after adapting a navigation action at this step. is the weight of goal-driven reward. ii) A penalty of whenever the agent collides with environment or other agents. The maximum total reward achievable for a single agent is , where is the initial distance between agent and goal, , reaching the goal through the shortest path without any collision. Our models are trained to maximize the expected discounted cumulative gain with a discounting factor .

Figure 3: Trajectory of two or three agents on CommonGoal task (Top: two agents  Down: three agents).


Indicator 2 (3-5m) 3 (3-5m) 4 (3-5m)
SR 16.55 17.50 18.40
SPL 0.08 0.08 0.09
SSR 0.41 0.45 0.48
Table 8: SPL vs SSR. Performance of our model on CommonGoal task with medium difficulty and different team sizes ().
Figure 4: Trajectory of single agent (same initialization with two agents of CommonGoal task).
Figure 5: Trajectory of two or three agents on Specific task (Top: two agents  Down: three agents).
Figure 6: Trajectory of add and remove on Ad-HoCoop task (Top: remove ()  Down: add ()).

b.2 Hyperparameter Details

We use PyTorch for implementing and training our model. As for map building, we follow [chaplot2020learning], maintain a FIFO memory of size 5000 for training. After one step in each thread, we perform 10 updates to the map building module with a batch size of 64.

During training, we train our agents with 25 parallel threads, with each thread using one scene in the training set. We use Proximal Policy Optimization (PPO) [schulman2017proximal]

, with 5 mini-batches, and 8 epochs in each PPO update. Our PPO implementation is based on  

[pytorchrl]. We use Adam optimizer with a learning rate of 0.00001, a discount factor of = 0.99, an entropy coefficient of 0.001, value loss coefficient of 0.5 for training.

Appendix C Evaluation Details

Why SSR instead of SPL? SPL was introduced in [anderson2018evaluation]:


where is the agent number, equals the number of test episodes, is a binary indicator of success in episode for agent , is the geodesic distance from starting position to the goal, and the length of the path actually taken by agent.

The value of is always less than one (1). If we perform a long-term or difficult navigation task, the success rate is usually low and the navigation path is very long. In this case, SPL will be very small and lose discriminability. This phenomenon can be observed in Table 8, which shows the performance of our model on CommonGoal task with medium difficulty and different team sizes. SPL basically stays still with the success rate changing.

To resolve this problem, we introduce SSR (Success weighted by Step Ratio):


denotes the allowed maximum time step and the number of navigation steps used by agent in episode . As shown in Table 8, SSR is more discriminative than SPL.

Appendix D Qualitative Results

In this section, we provide more intuitive results of our model.

d.1 Trajectory Result

The visualization of navigation trajectories of three task settings with different team size () are shown in Fig. 3, Fig. 5, and Fig. 6. The start and target locations are represented by blue circles and orange circles, respectively. From these figures, we can observe that the agents trained with our method is able to learn a robust policy to collaboratively explore the environment and successfully reaches the target location. From Fig. 6, we find that our method can adapt to different team sizes at test times well.

d.2 Effect of Communication

To show the effectiveness of communication in MVAN task, we demonstrate the trajectory of a single agent system in Fig. 4. We compare the effect by deploying this agents for a particular initialization of an episode, , the scene, agents’ start location and target location are same as CommonGoal task with team size of two (Top figure in Fig. 3). As shown in Fig. 4, the single agent failed to reach the target position within maximum time step (it spends too much time on exploring a wrong room). However, in CommonGoal task, we find that implicit communication can allow agents to avoid this wrong exploration as much as possible based on the experience of other agents and achieve the task faster.