Navigating in a previously unseen environment requires different abilities, among which is mapping, i.e. the capacity to build a representation of the environment and its affordances. The agent can then reason on this map and act efficiently towards its goal. How biological species map their environment is still an open area of research [peer2020structuring, warren2017wormholes]. In robotics, spatial representations have taken diverse forms, for instance metric maps [elfes1989using, bresson2017simultaneous, durrant2006simultaneous] or topological maps [shatkay1997learning, thrun1998learning], allocentric or egocentric. Most of these variants have lately been presented in neural variants — metric neural maps [DBLP:conf/iclr/ParisottoS18, DBLP:conf/pkdd/BeechingD0020, Henriques_2018_CVPR, gupta2017cognitive] or neural topological maps [beeching2020learning, savinov2018semiparametric, Chaplot_2020_CVPR] learned from RL or with supervision.
In this work, we explore the question whether the emergence of mapping and spatial reasoning capabilities can be favored by the use of spatial auxiliary tasks that are related to a downstream objective. We target the problem of Multi-Object Navigation, where an agent must reach a sequence of specified objects in a particular order within a previously unknown environment. Such a task is interesting because it requires an agent to recall the position of previously encountered objects it will have to reach later in the sequence.
We take inspiration from the methodology in behavioral studies of human spatial navigation [ekstrom2018human]. Experiments with human subjects aim at evaluating the spatial knowledge they acquire when navigating a given environment. In [ekstrom2018human], two important measures are referred as the sense of direction and judgement of relative distance. Regarding knowledge of direction, a well-known task is scene- and orientation- dependent pointing (SOP), where participants must point to a specified location that is not currently within their field of view. Being able to assess its relative position compared to other objects in the world is critical to navigate properly, and disorientation is considered a main issue. In addition to direction, evaluating the distance to landmarks is also of high importance.
We conjecture, that an agent able to estimate the location of target objects relative to its current pose will implicitly extract more useful representations of the environment and navigate more efficiently. Classical methods based on RL rely on the capacity of the learning algorithm to extract mapping strategies from reward alone. While this has been shown to be possible in principle [DBLP:conf/pkdd/BeechingD0020], we will show that the emergence of a spatial mapping strategy is significantly boosted through auxiliary tasks, which require the agent to continuously reason on the presence of targets w.r.t. to its viewpoint — see Figure 1.
To this end, we propose two auxiliary tasks, namely estimating the relative direction and the Euclidean distance to the current target object, conditioned on whether it has already been observed since the beginning of the episode. If an object is visible in the current observation, it will be helpful for training the agent to recognize it (discover its affordance) and estimate its relative position; spatial memory will be built up when the target was seen in the past.
We propose the following contributions: (i) we show that our proposed auxiliary tasks improve the performance of previous baselines by a large margin, which even allows to reach the performance of (incomparable) agents using ground-truth oracle maps as input; (ii) we show the consistency of the gains over different agents with multiple inductive biases, reaching from simple recurrent models to agents structured with projective geometry. This raises the question whether spatial inductive biases are required or whether spatial organization can be learned; (iii) the proposed method reaches SOTA performance on the Multi-ON task, and corresponds to the winning entry of the CVPR 2021 Multi-ON challenge.
2 Related Work
Visual navigation — has thus been extensively studied in robotics [bonin2008visual, thrun2005probabilistic]. An agent is placed in an unknown environment and must solve a specified task based on visual input, where [bonin2008visual] distinguish map-based and map-less navigation. Recently, many navigation problems have been posed as goal-reaching tasks [DBLP:journals/corr/abs-1807-06757]. The nature of the goal, it’s regularities in the environment and how it is communicated to the agent have a significant impact on required reasoning capacities of the agent [DBLP:conf/icpr/BeechingD0020]. In Pointgoal [DBLP:journals/corr/abs-1807-06757], an agent must reach a location specified as relative coordinates, while ObjectGoal [DBLP:journals/corr/abs-1807-06757] requires the agent to find an object of a particular semantic category. Recent literature [DBLP:conf/icpr/BeechingD0020, DBLP:conf/nips/WaniPJCS20] introduced new navigation tasks with two important characteristics, (i) their sequential nature, i.e. an episode is composed of a sequence of goals to reach, and (ii) the use of external objects as target objectives. Multi-Object Navigation (MultiON) [DBLP:conf/nips/WaniPJCS20] is a task requiring to sequentially retrieve objects, but unlike the Ordered K-item task [DBLP:conf/icpr/BeechingD0020], the order is not fixed between episodes. A sequential task is interesting as it requires the agent to remember and to map potential objects it might have seen while exploring the environment, as reasoning on them might be required in a later stage. Moreover, using external objects as goals prevents the agent from leveraging knowledge about the environment layouts, thus focusing solely on memory. Exploration is another targeted capacity as objects are placed randomly within environments. For all these reasons, our work thus focuses on the new challenging Multi-ON task [DBLP:conf/nips/WaniPJCS20].
Learning-free navigation — A recurrent pattern in methods tackling visual navigation [bonin2008visual, thrun2005probabilistic] is modularity, with different computational entities solving a particular sub-part of the problem. A module might map the environment, another one localize the agent within this map, a third one performing planning. Low-level control is also often addressed by a specialized sub-module. Known examples are based on Simultaneous Localization and Mapping (SLAM) [bresson2017simultaneous, durrant2006simultaneous].
Learning-based navigation — The task of navigation can be framed as a learning problem, leveraging the abilities of deep networks to extract regularities from a large amount of training data. Formalisms range from Deep Reinforcement Learning (DRL) [DBLP:conf/iclr/MirowskiPVSBBDG17, DBLP:conf/iclr/JaderbergMCSLSK17, DBLP:conf/icra/ZhuMKLGFF17]
to (supervised) Imitation Learning[DBLP:conf/nips/DingFAP19, watkins2019learning, wu2020towards]. Such agents can be reactive [DBLP:conf/iclr/DosovitskiyK17, DBLP:conf/icra/ZhuMKLGFF17], but recent work tends to augment agents with memory, which is a key component, in particular in partially-observable environments [SDMIA15-Hausknecht, pmlr-v48-oh16]. It can take the form of recurrent units [DBLP:journals/neco/HochreiterS97, cho-etal-2014-learning], or become a dedicated part of the system as in [DBLP:journals/corr/GravesWD14]. In the context of navigation, memory can full-fill multiple roles: holding a latent map-like representation of the spatial properties of the environment, as well as general high-level information related to the task (“did I already see this object?”). Common representations are metric [DBLP:conf/iclr/ParisottoS18, DBLP:conf/pkdd/BeechingD0020, Henriques_2018_CVPR, gupta2017cognitive], or topological [beeching2020learning, savinov2018semiparametric, Chaplot_2020_CVPR]. Other work reduces inductive biases by using Transformers [vaswani2017attention] as a memory mechanism on episodic data [ritter2021rapid, Fang_2019_CVPR].
In contrast to end-to-end training, engineering stack approaches decompose the learning pipeline into sub-modules [Chaplot2020Learning, Chaplot_2020_CVPR]
trained simultaneously with supervised learning[Chaplot_2020_CVPR] or a combination of supervised, reinforcement and imitation learning [Chaplot2020Learning]. Somewhat related to our work, in [Chaplot_2020_CVPR], a dedicated semantic score prediction module is proposed, which estimates the direction towards a goal and is explicitly used to decide which previously unexplored ghost node to visit next inside a topological memory. In contrast, in oour work we propose to predict spatial metrics such as relative direction as an auxiliary objective to shape the learnt representations, instead of explicitly using those predictions at inference time.
Learning vs. learning-free — The differences in navigation performance between SLAM-based and learning-based agents has been studied before [DBLP:journals/corr/abs-1901-10915, DBLP:journals/corr/abs-1907-11770, Savva_2019_ICCV]. Even though trained agents begin to perform better than classical methods in recent studies, arguments regarding efficiency of SLAM-based methods in still hold [DBLP:journals/corr/abs-1901-10915, DBLP:journals/corr/abs-1907-11770]. Frequently hybrid methods are suggested [Chaplot2020Learning, Chaplot_2020_CVPR]. In contrast, we explore the question, whether mapping strategies can emerge naturally in end-to-end training through additional pretext tasks.
Auxiliary tasks — can be combined with any downstream objective to guide a learning model to extract more useful representations as proposed in [DBLP:conf/iclr/MirowskiPVSBBDG17, DBLP:conf/iclr/JaderbergMCSLSK17] to improve, both, data efficiency and overall performance. [DBLP:conf/iclr/MirowskiPVSBBDG17] predict loop closure and reconstruct depth observations; Lample et al. [lample2017playing] also augment the DRQN model [SDMIA15-Hausknecht] with predictions of game features in first-person shooter games. A potential drawback is the need for privileged information, which, however, is readily available in simulated environments [kempka2016vizdoom, beattie2016deepmind]. This is also the case in our work, where we access information during training on explored areas, positions of objects and of the agent, which, of course, is also used for reward generation in classical RL methods.
In [DBLP:conf/iclr/JaderbergMCSLSK17], unsupervised objectives are introduced, such as pixel or action features and reward prediction. [aux_pointgoal_custom] introduce self-supervised auxiliary tasks to speed up the training on PointGoal. They augment the base agent from [wijmans2019dd] with an inverse dynamics estimator as in [pathak2017curiosity], a temporal distance predictor, and an action-conditional contrastive module, which must differentiate between positives, i.e. real observations that occur after the given sequence, and negatives, i.e. observations sampled from other timesteps. [ye2021auxiliary] introduce auxiliary tasks for ObjectGoal, building on top of [aux_pointgoal_custom] and introduce the action distribution prediction and generalized inverse dynamics tasks and coverage prediction.
Our work belongs to the group of supervised auxiliary tasks, with an application to 3D complex and photo-realistic environments, which was not the case of concurrent methods. We also specifically target the learning of mapping and spatial reasoning through additional supervision, which has not been the scope of previous approaches.
3 Learning to map
We target the Multi-ON task [DBLP:conf/nips/WaniPJCS20], where an agent is required to reach a sequence of target objects in a certain order, and which was used for a recent challenge organized in the context of the CVPR 2021 Embodied AI Workshop. Compared to much easier tasks like PointGoal or (Single) Object Navigation, Multi-ON requires more difficult reasoning capacities, in particular mapping the position of an object once it has been seen. The following capacities are necessary to ensure optimal performance: (i) mapping the object, i.e. storing it in a suitable latent memory representation; (ii) retrieving this location on request and using it for navigation and planning. This also requires to decide when to retrieve this information, i.e. solving a correspondence problem between sub goals and memory representation.
The agent deals with sequences of objects that are randomly placed within the environment. At each time step, it only knows about the next object to find, which is updated when reached. The episode lasts until either the agent has found all objects in the correct order or the time limit is reached.
Inductive agent biases — Our contribution is independent of the actual inductive biases used for agents. We therefore explored different baseline agents with different architectures, as selected in [DBLP:conf/nips/WaniPJCS20]. The considered agents share a common base shown in Fig. 2
, which extracts information from the current RGB-D observation. Variants also keep a global map that is first transformed into an egocentric representation centered around the agent’s position, and possibly embeddings of the target object class and the previous actions. The vector representations are concatenated and fed to a GRU[cho-etal-2014-learning] unit that integrates temporal information, and whose output serves as input to an actor and a critic heads. These two modules respectively predict a distribution over actions conditioned on the current state and the state-value function , i.e. expected cumulative reward starting in the current state and following policy . The Actor-Critic algorithm is a baseline RL approach [sutton2018reinforcement]. We consider different variants which have been explored in [DBLP:conf/nips/WaniPJCS20], but which have been introduced in prior work (numbers ➀➁➂➃ correspond to choices in Figure 2):
NoMap ➀ — is a recurrent GRU baseline without any spatial inductive bias.
ProjNeuralMap ➀➁ [Henriques_2018_CVPR, DBLP:conf/pkdd/BeechingD0020] —
is a neural network structured with spatial information and projective geometry, in particular inverse 3D projection of the observed image features using a calibrated camera and depth information. Note that the notion of a “map” in this model refers to a network structure only, i.e. the map puts constraints on how input pixels are mapped to feature cells.
OracleMap ➀➂ — has access to a ground-truth grid map of the environment with channels dedicated to occupancy information and others to the presence of objects and their classes. As shown in Fig. 2, the map is cropped and centered around the agent to produce an egocentric map as input to the model.
OracleEgoMap ➀➂➃ — gets the same egocentric map as OracleMap with only object channels, and revealed in regions that have already been within its field of view since the beginning of the episode. This variant corresponds to an agent capable of perfect mapping — no information gets lost, but only observed information is used.
3.1 Learning to map objects with auxiliary tasks
We introduce auxiliary tasks, additional to the classical RL objectives, and formulated as classification problems, which require the agent to predict information on object affordances, which were in its observation history in the current episode. To this end, the base model is augmented with two classification heads (Fig. 2) taking as input the contextual representation produced by the GRU unit:
Direction — the agent predicts the relative direction of the current target object, only if it has already been within the agent’s field of view in the observation history of the current episode (Figure 1 left). The ground-truth direction towards the goal is first computed as follows,
where (“ego”) are the coordinates of the agent on the grid and are the coordinates of the center of the target object at time . As the ground-truth grid is egocentric, the position of the agent is fixed, i.e. at the center of the grid, while the target object gets different coordinates with time. The angles are kept in the interval and then discretized into bins, giving the angle class. The ground-truth one-hot vector is denoted . At time instant
, the probability distribution over classesis predicted from the GRU hidden state through an MLP as with parameters .
Distance — The second task requires the prediction of the Euclidean distance in the egocentric map between the center box, i.e. position of the agent, and the mean of the grid boxes containing the target object (Figure 1 right),
Again, distances are discretized into bins, with as ground-truth one-hot vector, and at time instant
, the probability distribution over classesis predicted from the hidden state through an MLP as with parameters .
Training — Following [DBLP:conf/nips/WaniPJCS20], all agents are trained with PPO [schulman2017proximal] and a reward composed of three terms,
where is the indicator function whose value is if the found action was called while being close enough to the target, and otherwise. is a reward shaping term equal to the decrease in geodesic distance to the next goal compared to previous timestep. Finally, is a negative slack reward to force the agent to take paths as short as possible.
PPO alternates between sampling and optimization phases. At sampling time , a set of trajectories with length are collected using the latest policy, where is smaller than the length of a full episode. The base PPO loss is then given as,
where is an estimate of the advantage function at time , and is the probability ratio between the updated and old versions of the policy. We did not make the dependency of states and actions on the trajectory explicit in the notation.
Both direction and distance predictions are supervised with cross-entropy losses from ground truth values and , respectively, as
where is the binary indicator function specifying whether the current target object has already been seen in the current episode (), or not ().
The auxiliary losses and are added as follows,
where and weight the relative importance of both auxiliary losses.
|51.3 7.8||65.5 5.8||35.9 5.3||46.0 4.0|
|36.0 6.0||51.2 5.0||24.2 3.9||34.9 3.4|
4 Experimental Results
We focus on the 3-ON version of the Multi-ON task, where the agent deals with sequences of objects. The time limit is fixed to environment steps, and there are object classes. The agent receives a RGB-D observation and the one-in-K encoded class of the current target object within the sequence. The discrete action space is composed of four actions: move forward , turn left , turn right , and found, which signals that the agent considers the current target object to be reached. As the aim of the task is to focus on evaluating the importance of mapping, a perfect localization of the agent was assumed as in the protocol proposed in [DBLP:conf/nips/WaniPJCS20].
Dataset and metrics — we used the standard train/val/test split over scenes from the Matterport [chang2018matterport3d] dataset. episodes are sampled from the val and test splits for model validation and testing, respectively. We consider standard metrics of the field as given in [DBLP:conf/nips/WaniPJCS20]:
Success: percentage of successful episodes (the agent reaches all the three objects in the right order in the time limit).
Progress: percentage of objects successfully found in an episode.
SPL: Success weighted by Path Length. This extends the original SPL metrics from [DBLP:journals/corr/abs-1807-06757] to the sequential multi-object case.
PPL: Progress weighted By Path Length.
Note that for an object to be considered found, the agent must take the found action while being within m of the current goal. The episode ends immediately if the agent calls found in an incorrect location. For more details, we refer to [DBLP:conf/nips/WaniPJCS20].
Implementation details — training and evaluation hyper-parameters, as well as architecture details have been taken from [DBLP:conf/nips/WaniPJCS20]. All reported quantitative results are obtained after training runs for each model, during steps (increased from in [DBLP:conf/nips/WaniPJCS20]). Ground-truth direction and distance measures are respectively split into and classes. Indeed, angle bins span , and distance bins span a unit distance on the egocentric map, that is (the maximum distance between center and a grid corner is thus ). Training weights and are both fixed to .
|Agent/Method||— Test Challenge —||— Test Standard —|
|Ours (Auxiliary losses)||55||67||35||44||57||70||36||45|
|ProjNeuralMap (Challenge baseline)||12||29||6||16|
|NoMap (Challenge baseline)||5||19||3||13|
Do the auxiliary tasks improve the downstream objective? — in Table 1, we study the impact of both auxiliary tasks on the 3-ON benchmark when added to the training objective of ProjNeuralMap, and their complementarity. Direction prediction significantly improves performance, adding distance prediction further increases the downstream performance by a large margin. Both losses have thus a strong impact and are complementary, confirming the assumption that sense of direction and judgement of relative distance are two key skills for spatially navigating agents.
Table 2 presents results on the test set, confirming the significant impact on each of the considered metrics. ProjNeuralMap with auxiliary losses matches the performance of (incomparable!) OracleMap on Progress. OracleMap has higher PPL and SPL, but has also access to very strong privileged information.
Can an unstructured recurrent agent learn to map? —
we explore whether an agent without spatial inductive bias can be trained to learn a mapping strategy, to encode spatial properties of the environment into its unstructured hidden representation. As shown in Table2, NoMap indeed strongly benefits from the auxiliary supervision (Success for instance jumping from 7.4% to 22.4%). Improvement is significant, closing the gap with ProjNeuralMap trained with vanilla RL. The quality of extra supervision can thus help to guide the learnt representation, mitigating the need for incorporating inductive biases into neural networks. When both are trained with our auxiliary losses, ProjNeural still outperforms NoMap, indicating that spatial inductive bias still provides an edge.
Comparison with the state-of-the-art — our method corresponds to the winning entry of the CVPR 2021 Multi-On Challenge organized with the Embodied AI Workshop, shown in Table 3. Compared to the method described above, the challenge entry contained a third additional auxiliary loss, which required the agent to predict whether an object had been seen or not in the observation history. Post-challenge analysis however showed, that this third loss did not have an impact. The official challenge ranking is done with PPL, which evaluates correct mapping (quicker and more direct finding of objects), while mapping does not necessarily have an impact on success rate, which can be obtained by pure exploration.
Visualization — Fig. 3 illustrates an example trajectory from the agent trained with the auxiliary supervision in the context of the CVPR 2021 Multi-On Challenge. The agent starts the episode (Step ) seeing the white object, which is not the first target to reach. It thus starts exploring the environment (Step ), until seeing the pink target object (Step ). Its prediction of the goal distance immediately improves, showing it is able to recognize the object within the RGB-D input. The agent then reaches the target (Step ). The new target is now the white object (that was seen in Step ). While it is still not within its current filed of view, the agent can localize it quite precisely (Step ), and go towards the goal (Step ) to call the found action (Step ). The agent must then explore again to find the last object (Step ). When the yellow cylinder is seen, the agent can estimate its relative position (Step ) before reaching it (Step ) and ending the episode.
In this work, we propose to guide the learning of mapping and spatial reasoning capabilities by augmenting vanilla RL training objectives with auxiliary tasks. We show that learning to predict the relative direction and distance of already seen target objects improves significantly the performance on various metrics and that these gains are consistent over agents with or without spatial inductive bias. We reach SOTA performance on the Multi-ON benchmark. Future work will investigate additional structure, for instance predicting multiple objects.