Hierarchical Image-Goal Navigation in Real Crowded Scenarios

by   Qiaoyun Wu, et al.

This work studies the problem of image-goal navigation, which entails guiding robots with noisy sensors and controls through real crowded environments. Recent fruitful approaches rely on deep reinforcement learning and learn navigation policies in simulation environments that are much simpler in complexity than real environments. Directly transferring these trained policies to real environments can be extremely challenging or even dangerous. We tackle this problem with a hierarchical navigation method composed of four decoupled modules. The first module maintains an obstacle map during robot navigation. The second one predicts a long-term goal on the real-time map periodically. The third one plans collision-free command sets for navigating to long-term goals, while the final module stops the robot properly near the goal image. The four modules are developed separately to suit the image-goal navigation in real crowded scenarios. In addition, the hierarchical decomposition decouples the learning of navigation goal planning, collision avoidance and navigation ending prediction, which cuts down the search space during navigation training and helps improve the generalization to previously unseen real scenes. We evaluate the method in both a simulator and the real world with a mobile robot. The results show that our method outperforms several navigation baselines and can successfully achieve navigation tasks in these scenarios.



There are no comments yet.


page 1

page 3

page 4

page 8

page 9

page 10

page 11


Long-Range Indoor Navigation with PRM-RL

Long-range indoor navigation requires guiding robots with noisy sensors ...

Embodied Visual Navigation with Automatic Curriculum Learning in Real Environments

We present NavACL, a method of automatic curriculum learning tailored to...

A Data-Efficient Framework for Training and Sim-to-Real Transfer of Navigation Policies

Learning effective visuomotor policies for robots purely from data is ch...

Cooperation without Coordination: Hierarchical Predictive Planning for Decentralized Multiagent Navigation

Decentralized multiagent planning raises many challenges, such as adapti...

DIViS: Domain Invariant Visual Servoing for Collision-Free Goal Reaching

Robots should understand both semantics and physics to be functional in ...

HiTMap: A Hierarchical Topological Map Representation for Navigation in Unknown Environments

The ability to autonomously navigate in unknown environments is importan...

Building Intelligent Autonomous Navigation Agents

Breakthroughs in machine learning in the last decade have led to `digita...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Target-driven navigation in unstructured environments remains an open problem in the robotics community. This problem is challenging, especially in settings where it is necessary to proceed without accurate target position information and with only a goal image. Furthermore, if the image-goal navigation is long-range and occurring in crowded scenarios, the agent needs to learn an effective exploration strategy and a robust collision avoidance module in addition to the navigation policy. The ability of a robot to bypass obstacles safely and navigate to specified goals efficiently without a preset map, would have a great impact on robotic applications, including surveillance, inspection, delivery, and cleaning. However, although the navigation problem has been well studied in robotics and related areas for several decades [1, 2], the mobility of robots nowadays is still limited.

Most existing image-goal navigation approaches use end-to-end learning to tackle this problem  [3, 4]

, which combine deep convolutional neural networks (CNNs) with reinforcement learning (RL) to manage the relationship between vision and motion in a natural way. These map-less methods have presented encouraging results for image-goal navigation, and in the meantime, have shown a great tendency to overfit in the domain in which they are trained. Therefore, it is always necessary to train, or at least fine-tune, these methods for new navigation targets or environments. In addition, several recent works learn navigation policies in maze-like environments 

[5] or synthetic indoor scenes [6], which are both much simpler in size and complexity than real indoor environments. Directly transferring these trained policies to real environments can be extremely challenging and impractical.

This difficulty has motivated one important set of works [7, 8, 9], called hierarchical RL, which feature hierarchical planners combined with RL. These works insist on building spatial maps [7] or topological representations [10] of free space combined with a local RL-based policy executing navigation at run-time. Hierarchical RL leverages the regularities of real-world layouts, resulting in competitive performance over both geometry-based methods and recent learning-based methods. However, as we increase the complexity of the problems by requiring navigation in real scenarios with many obstacles, these RL-based methods become harder to train and do not consistently generalize well to new environments.

In this paper, we investigate alternative formulations of employing learning for image-goal navigation that does not suffer from the drawbacks of end-to-end learning and that transfers well to real crowded scenarios. Our key conceptual insights lie in leveraging structural regularities of indoor environments for long-range navigation and learning a local sensor-to-controls planner for reactive obstacle avoidance. This motivates the use of learning in a modular and hierarchical fashion. More specifically, our proposed image-goal navigation architecture comprises of an automatic mapping module, a long-term goal prediction module, a reactive motion planning module, and a robust navigation ending prediction module. The mapping module builds explicit obstacle maps to maintain episodic memory via a learning-based system (e.g., Active Neural SLAM [7]) or a geometric-based method (e.g., Gmapping [11]). The long-term goal prediction module consumes the maps with agent navigation tasks and employs learning to exploit structural regularities in layouts of real-world environments to produce distant goals for navigation. These distant goals are used to explore the environment efficiently and finally approach the positions of navigation targets over the maps. The motion planning module uses learning to transfer raw sensor inputs to a collision-free steering command for navigating to a long-term goal [12]. The navigation ending prediction module learns to distinguish between the current observation and the navigation target observation, which are temporally close to or distant from one another. When the two observations are close, the navigation will be ended. The use of mapping during image-goal navigation provides a feasible way to exploit regularities in layouts of real-world environments. Learned long-term goal prediction can support long-range indoor navigation, while learned motion planning policies can use sensor feedbacks to exhibit effective and safe navigation behaviors. The navigation ending judgement stops an agent properly near the image goal.

In summary, our contributions are as follows: (1) We present a hierarchical framework for image-goal navigation in real crowded scenarios. Hierarchical decomposition and modular learning significantly cuts down the search space during training, leading to better navigation performance as well as sample complexity. (2) We propose combining long-range planning with local motion control for image-goal navigation. To the best of our knowledge, our approach is the first to decouple collision avoidance from long-range planning on the task of image-goal navigation. This drives the robot towards the eventual goal image, while successfully avoiding some static or dynamic obstacles in the scenes. (3) We propose using continuous actions for image-goal navigation, embodied as linear and angular velocities, taking both robot dynamics and path feasibility into account and leading to better approximate optimal paths than discrete actions.

We demonstrate our proposed approach in a visually and physically realistic simulator (e.g., Habitat  [13]) for image-goal navigation tasks. Our experiments in the simulation environments show that the proposed approach outperforms some end-to-end navigation methods significantly. We also demonstrate that our approach can more easily transfer from simulation to the real world, including a increase in navigation success and a decrease in navigation collision overall compared with [14], while maintaining good performance despite increasing obstacles. We show that our mapping module built from laser-based SLAM (Simultaneous Localization and Mapping)  [11] can diminish the simulation to reality gap and achieve promising performance in the real world when transferred to a mobile robot platform. This reveals the robustness of our proposed approach towards the target types, the obstacles, and the scene layouts.

The remainder of this paper is organized as follows. We review the relevant background literature in Section II. Section III describes the image-goal navigation problem and presents the proposed hierarchical framework for solving the problem. Section IV provides an exhaustive experimental validation of our approach. We conclude in Section V with a summary and a discussion of future work.

Ii Related Works

Autonomous navigation is a core requirement in building intelligent robots, and target-driven navigation is a relatively new task in the field of intelligent robotics research. We survey related works below.

Ii-a End-to-End Navigation System

Recently, end-to-end systems have been specifically developed to address the target-driven navigation problems in unseen scenes, which can be divided into three main categories based on the nature of the navigation goal. Point-goal navigation agents can navigate to a user-defined location while avoiding collisions [15]. However, long-range navigation tasks seem especially difficult for these agents. Area-goal navigation agents can navigate to an area of a specified category, which relies on prior knowledge about the appearance and layout of different areas [16]. Object-goal navigation agents can navigate to an object of a specific category, which involves semantic and common sense reasoning in order to efficiently approach the object in previously unseen environments. The object can be an image [3] or a label [17], involving different challenges. In this work, we focus on image-goal navigation in novel indoor environments, where no target position is available, except for the target image.

Recent learning-based approaches use end-to-end reinforcement or imitation learning for training these navigation policies, which do not build a geometric map of the area. Instead, they learn the direct mapping from visual inputs to motion. The work of 

[6] is the first to address the image-goal navigation problem, which designs scene-specific layers to capture the layout characteristics of a scene. Yang et al. [18] extend this work by incorporating a graph convolutional network into a deep reinforcement learning framework for encoding semantic priors, which shows great improvements in generalization to unseen synthetic scenes. Wortsman et al. [17] propose a meta-reinforcement learning navigation approach, which allows an agent to automatically learn in testing environments while completing navigation tasks effectively. Mousavian et al. [19] use a deep network to learn the navigation policy based on semantic segmentation and detection masks of visual observations. Wu et al. [14] enhance the cross-target and cross-scene generalization of target-driven visual navigation by introducing an information-theoretic regularization term into an RL objective. The regularization maximizes the mutual information between navigation actions and visual observation transforms of an agent, thus promoting more informed decisions. In all of the above methods, the navigation models need to learn navigation planning, obstacle avoidance, and scene layouts implicitly and simultaneously, which is extremely challenging and can easily result in memorizing object locations and appearance in training environments. Consequently, they always suffer from poor generalization in the real world.

Fig. 1: Hierarchical image-goal navigation method overview. (a) In Habitat, at each time step , the Neural SLAM module [7] predicts a map and an agent pose based on the visual observation and the sensor reading . When reaches a time scale, the map , the observation and the goal image are used by a long-term goal prediction module to output a distant goal, which is then converted to a set of navigation actions by using the Fast Marching Method [20]. A navigation ending prediction module is trained to stop the agent properly. (b) In real scenarios, the Gmapping [11] is used to provide the online map and the CrowdMove [12] is used for the motion planning to long-term goals.

Ii-B Hierarchical and Modular Navigation System

Hierarchical and modular navigation is an active area of research, aimed at exploiting hierarchies to navigate an agent in a scene. Classical strategies break the navigation problem into two parts: mapping and path planning. Mapping is done via simultaneous localization and construction of semantic and metric or topological maps with different sensors [21, 22, 23]. The work of [1] provides an extensive review of SLAM methods. Given the maps and goal locations, the state-of-the-art path planning and control strategies can be found in [24]. However, SLAM methods are not specifically designed for image-goal visual navigation and thus additional efforts are required to combine SLAM methods.

There has been a recent interest in using techniques from deep learning in a modular and hierarchical fashion inside navigation systems. For example, Chaplot et al. 

[25] propose a fully-differentiable model for active global localization. Devo et al. [26]

design a novel architecture composed by two main deep neural networks. The first one has the objective of exploring the environment and approaching the target, while the second one aims at recognizing the specified target in the view of an agent. The work is exclusively trained in maze simulation, which is considerably simpler in complexity than real scenes. Other works design deep models to reason spatial representations, semantic representations or topological representations, based on which navigation tasks can be executed. The work of 

[7] presents a Neural SLAM module for free spatial space mapping and shows strong generalization while achieving point-goal navigation tasks.  [8] extends the proposed model to complex semantic tasks, such as object-goal navigation by designing a semantic Neural SLAM module that captures semantic properties of the objects in the environment. While our work builds upon the existing literature [7] on spatial metric maps, the resemblance is only at the map structures. Our work focuses on seeking a balance between scene exploration and image-goal navigation. Chaplot et al. [10] study the problem of image-goal navigation by building topological representations for space that leverage semantics and afford approximate geometric reasoning. Beeching et al. [9] design a hierarchical model for long-term image-goal navigation under uncertainty in novel environments, which combines high-level graph-based planning with a local point goal policy. Mezghani et al. [25] propose a novel three-stage algorithm for learning to navigate to goal images using only RGB vision. The first stage focuses on learning visual representations and the second one explores the environment to maintain a scene memory buffer module, based on which the third stage guides an agent along a shortest path by which the agent is likely to succeed.

However, all of the above methods learn about navigation planning and collision avoidance at once, which can be prohibitively expensive. Additionally, these methods do not consider the navigation in the real-world scenarios, which are characterized by heavy clutters and noises. In contrast, we consider navigation and collision avoidance separately in the real world, and our design is inspired by the work of [27], which proposes a hierarchical planning method for long-range point-goal navigation that combines sampling-based path planning with an RL agent as a local controller for solving obstacle avoidance. In contrast, we achieve this via designing a global planner for long-term goal prediction followed by a local controller for reactive obstacle avoidance. This strategy has never before been applied to the image-goal navigation.

Iii Hierarchical Image-goal Navigation

In the image-goal navigation, the objective is to learn a navigation controller, which can produce shortest sequences of actions for an agent to navigate to specified goal images in a fixed time budget. The agent is initialized at a random location in the environment and receives the goal image as input. At each time step , the agent receives visual observations and sensor pose readings and takes navigational actions from the controller. In our experimental setup, all images are panoramas, including agent observations and goal images.

We propose a hierarchical framework to tackle the image-goal navigation problem, which consists of four components including an automatic mapping module, a long-term goal prediction module, a reactive motion planning module and a robust navigation ending prediction module. The mapping module builds a geometric map over time and the goal prediction module produces a long-term goal based on the map to approach the given goal image efficiently. The motion planning module navigates an agent to the long-term goal while avoiding environmental obstacles. The navigation ending module stops the agent properly near the goal image. Figure 1 provides an overview of the proposed framework.

Automatic Mapping. To enable an agent to do goal planning based on an online map, we propose using some state-of-the-art SLAM methods (e.g., ORB-SLAM2 [28], Neural-SLAM [7], or the laser-based SLAM [11]) to produce the map. In the experiments, ORB-SLAM2 does not detect enough reliable keypoints in low-texture scenes at the cost of great likelihood of losing tracking, while Neural-SLAM can not close the simulation to reality gap. In this work, considering that we have to use simulators for the long-term goal prediction learning, we use the pre-trained Neural-SLAM to produce an online map at each time step during navigation. Alternately, in real crowded scenarios, we use Gmapping [11] to provide the online map and we show that the map built from the laser-based SLAM helps the agent perform almost as well as it does in simulation environments.

Long-term Goal Prediction. At each time scale , given the navigation goal image , the current map , and the visual observation , the long-term goal prediction module decides a long-term goal on the map to approach the given goal image. This process requires learning to exploit structural regularities in layouts of environments. We use reinforcement learning (RL) with carefully designed rewards to train the module. We predict the long-term goal at a coarse time-scale (e.g., every steps), which reduces the time-horizon for navigation in RL exponentially and consequently, thus reducing the sample complexity.

Reactive Motion Planning. While any sensor-to-controls obstacle-avoidance agent could be used to take collision-free steering commands to reach the long-term goal , we choose CrowdMove [12] for its simplicity of training. Additionally, thanks to its excellent generalization, the module trained in a simulator (e.g., Stage111http://rtv.github.io/Stage/) can be effectively deployed on different mobile platforms in the real world without tedious fine-tuning. On the other hand, due to the lack of a laser sensor in Habitat, a deterministic local motion planning based on analytical planners is used instead. It computes the shortest path from the current agent location to the long-term goal using the Fast Marching Method [20] based on the map .

Robust Navigation Ending Prediction. Given the navigation goal image and the current visual observation of an agent, we train the navigation ending prediction module to distinguish between the two observations. We use the spatial distance between observations as a surrogate similarity measure and train a siamese neural network to predict the navigation ending indicator.

Iii-a The Automatic Mapping Module

Our mapping module is strongly decoupled from the construction and training of the rest modules. We explore this with two different mapping technologies: Neural-SLAM [7] and Gmapping [11].

Iii-A1 Neural SLAM

The Neural SLAM model, designed for maximizing scene exploration coverage, internally maintains a spatial map and a pose of an agent at each time step . Please refer to [7] for the implementation details. The Neural SLAM model takes in the current visual observation , the current and last sensor reading of the agent pose

, the last agent pose and map estimates

, and outputs an updated map and the current agent pose estimate .

The spatial map is a matrix, where denotes the map size. Each element in the map corresponds to a cell of size

in the physical world. The number of channels on the map is two. The elements in the first channel represent the probabilities of obstacles at corresponding locations. The elements in the second channel describe the explored space in general. The map is initialized with all zeros at the beginning of navigation, namely

. The pose includes both the D coordinates and the orientation of the agent at time step . The agent always starts at the center of the map facing east at the beginning of each long-term goal prediction. By incorporating previous map estimates and ground-truth poses, we augment the map to eight channels for the long-term goal prediction module, which contains the obstacle area, the explored area, the current agent location and the past agent locations.

Fig. 2: Automatic mapping results. (a) The scene image. (b) The Gmapping result. (c) The converted obstacle map.

Iii-A2 Gmapping

We use Gmapping [11] to produce online maps for the navigation in real crowded scenarios. Gmapping is based on the Rao-Blackwellized particle filter (RBPF) with a single laser sensor, which incrementally processes the sensor observations and the odometry readings as they are available.

The algorithm can be summarized by four steps, including sampling, importance weighing, adaptive resampling and map estimation. Each particle carries a state with the measurements obtained by the laser scanner. The initial set of particles is drawn by using the odometry information, which is then improved by incorporating the most recent observations to evaluate how good the particles fit with the map obtained so far. An individual importance weight is assigned to each particle according to an importance sampling principle. Particles are drawn with replacement proportional to their importance weight. Subsequently, the map carried by each particle is updated by taking into account the history of poses and observations. Finally, we convert the map to be consistent with the results of Neural SLAM in form. Figure 2 presents both the Gmapping result and the converted obstacle map.

Iii-B The Long-term Goal Prediction Module

In this section, we focus on learning an RL-based policy for predicting a long-term goal over an online maintained map to seek a balance between exploring an environment and approaching a navigation goal. Our problem is: given a goal image , at each time scale , the agent receives as inputs the current map and the visual observation to predict a distant goal that will guide the robot to approach the viewpoint from which is taken.

Iii-B1 Learning Setup

Before introducing our policy, we first describe the key ingredients of the learning setup, including datasets, inputs and output, and reward design.

Datasets. We conduct our learning on the Habitat simulator [13] with the visually realistic Gibson [29] dataset. We split the set of scenes from [29] into sets of scenes for training/testing, respectively. Each environment corresponds to a different apartment or house, generally including a kitchen, a bedroom, a living room and a bathroom in the layout. In the experiments, we further transfer the learned policies from Habitat to some real-world crowded scenes based on a mobile robotic platform (e.g., TurtleBot2). These scenes have never been encountered before.

Inputs and output. The long-term goal prediction policy takes a goal image, a current map, and a visual observation as inputs. At each time scale , the goal image and the visual observation are both panoramas. Each panorama consists of four images, which are collected from , , and orientations of the agent, respectively. The image resolution is . The current map is drawn from our mapping module, which consists of eight channels containing the obstacle area, the explored area, the current agent location and the past agent locations. Our model learns to analyze the connection between the current observation and the goal image and to exploit structural regularities of the current map to produce the possible position (namely, the long-term goal ) over the map where the goal image could be.

Reward design. The objective of a reinforcement learning policy is to collect as many rewards as possible, and an informative reward function becomes a critical foundation on which a successful RL policy relies. The reward function evaluating the behaviors drawn from the RL policy, always needs to be provided beforehand. However, in practice, defining the reward function can be challenging, since an informative reward function may be very hard to specify and exhaustive to tune for large and complex problems [26].

Our purpose during the long-term prediction policy learning is to let an agent explore the environment effectively while moving towards the goal image and to avoid collisions during navigation. Therefore, when the predicted long-term goal is near , high values of reward are published. Colliding with obstacles is assigned a penalty. In addition, to realize efficient navigation in unseen scenes, we encourage the agent to explore the environment before reaching the goal and penalize new explorations once is located in the explored area of the agent. We use the explored area increments to design the reward function, which results in faster convergence. Formally, the total reward collected by the agent at time scale can be given as:


denotes the reward when the distance between the predicted long-term goal and the final goal is less than some threshold (e.g., ). represents the penalty when the predicted long-term goal is located in the unreachable area of the environment. We set and in our formulation. evaluates the agent exploration, which is given as:


where denotes the size of the explored area at time scale ; denotes the size of the explored area at the next time scale (namely, ); represents is located in the explored area at time scale .

Iii-B2 Model

We focus on learning the long-term goal prediction policy function via deep reinforcement learning, where the long-term goal at time scale can be drawn by . The main task of the RL-based agent is to maximize the expected sum of future discounted rewards:


where is the reactive reward provided by the environment at each time scale and is a discount factor. We extend the Proximal Policy Optimization (PPO) [30] to our parallel training framework. The policy is trained with experiences collected by all threads in parallel. The parallel training framework not only dramatically reduces the time cost of the sample collection but also makes the algorithm suitable for training in various environments. Training in multiple environments simultaneously enables robust performance when generalizing to unseen scenes.

Fig. 3: The architecture of the long-term goal prediction module.

Network architecture. The long-term goal prediction policy demands an understanding of the relative spatial positions between the current observation and the goal image, as well as a holistic sense of the scene layout. To reason about the spatial arrangement between the current observation and the goal image , we use a two-stream deep siamese network [31]

for discriminative embedding learning. The backbone of the siamese layer is the ImageNet-pretrained ResNet-

 [32] (truncated the last layer) that produces a -d feature on a RGB image. The fusion layer takes a -d concatenated embedding of the state and the target, finally generating a -d joint representation. The current map is fed into five convolutional layers with filters with

kernel and stride

, all of them with rectified linear units (ReLU) as activation and finally producing a

-d representation. The two representations are concatenated together and processed by two fully connected layers, and the second fully connected layer connects to three heads to output the value, the mean, and the variance of a Gaussian distribution, from which our

-d long-term goal is sampled. The architecture is shown in Figure 3.

Training details. We train this network with an Adam optimizer of learning rate , exponential decay rates , and stability parameter . The training time is about hours on a high performance computing cluster with two Intel Xeon Cascade Lake Gold CPUs, Hz, cores and eight NVIDIA Tesla PCIE V GPUs. During training, we estimate the discounted accumulative rewards and back-propagate through time for every unrolled time scales with navigation episodes executed at each time. In addition, the number of navigation steps between two adjacent time scales is . Each episode terminates when the RL-based agent navigates time steps (namely, time scales) in an environment.

Iii-C The Reactive Motion Planning Module

The motion planning module takes as inputs the current sensor observation and the long-term goal at each navigation time step and outputs a navigational action. This module is also decoupled from the other modules. We explore this with two different motion planning technologies: Fast Marching Method [20] and CrowdMove [12].

Iii-C1 Fast Marching Method

In Habitat, we use the deterministic local motion planning algorithm, Fast Marching Method [20], to plan a collision-free path to a long-term goal. A navigation agent simply takes deterministic actions along the path to reach the long-term goal , where . is a discrete set defined as: . Each move distance is about and each turn angle is about degrees. We use this motion planning method in Habitat to facilitate the learning of our long-term goal prediction module. Note that, although the goal prediction policy acts at a coarse time scale (e.g., ), the motion planning module acts at a fine time scale. At each time step , we replan the motion to the long-term goal over the updated map .

Iii-C2 CrowdMove

In real crowded scenarios, we use CrowdMove [12] to take collision-free steering commands to reach a long-term goal. CrowdMove is trained on the Stage simulator based on a multi-robot, multi-scenario, and multi-stage training strategy. At each time step , it takes as inputs the readings of the D laser sensor , the relative goal position and the last velocity from the odometry, and produces a navigation command in continuous space. Specifically, contains the measurements of the last three consecutive frames from a -degree laser sensor which provides distance values per scanning (i.e. ). The relative goal position is a

-d vector representing the long-term goal

in the polar coordinate (distance and angle) with respect to the current position of the navigation agent. The velocity and the navigation command both include the translational and rotational velocity. In our experiments, CrowdMove can always find collision-free paths for motion planning and can be generalized effectively to various crowded scenarios.

Iii-D The Robust Navigation Ending Prediction Module

In this section, we train the navigation ending prediction module to issue a stop action at a correct location. With the module, an agent can figure out if a goal image is reached during the navigation in real unseen scenes. The module takes as inputs the navigation goal image and the current visual observation , and outputs an ending indicator to determine whether the navigation is over.

This is essentially a binary classification problem and we design a siamese neural network like the one used in the long-term goal prediction module, including a pretrained ResNet- backbone and then a fusion layer (with the yellow background in Figure 3). The -d feature from the fusion layer is then fed into a fully connected layer to produce the -d ending indicator. The navigation ending prediction module is jointly trained with our long-term goal prediction module in Habitat, although no parameters are shared between the two modules. When the agent navigates in Habitat, we collect the navigation trajectories including the observation images and corresponding spatial positions. We construct the dataset for navigation ending prediction learning by selecting observation pairs randomly on these trajectories. The spatial distance between two observations in a pair is used as a surrogate similarity measure. We define a label for each observation pair and the label is equal to if and otherwise. We train the navigation ending prediction module (NEPM) to predict the ending indicator from the input pair with a binary cross-entropy loss as:


where is a delta function at . We update the training dataset and the network every two navigation time scales. The batch size is set to for each back-propagation and each batch has positives and negatives equally for training data balance. We also Adam optimizer of learning rate , exponential decay rates , and stability parameter .

Iv Experiments and Results

In this section, we measure our image-goal navigation performance on both simulator and real-world navigation tasks. The key characteristic of a good navigation method is that it explores unseen environments effectively and reaches some designated goal images without any collision. In particular, we present our navigation performances in the context of the different choices we made in our design, as well as comparing with some state-of-the-art methods. We demonstrate superior performances in both simulator and real-world environments with respect to these baselines.

Models Easy Medium Hard Overall
Ours-RP 0.65 0.32 0.51 0.28 0.39 0.21 0.51 0.27
Ours-SR 0.37 0.21 0.23 0.10 0.09 0.01 0.23 0.11
Ours-LP 0.61 0.37 0.49 0.26 0.31 0.19 0.47 0.27
Ours 0.73 0.49 0.56 0.33 0.43 0.24 0.57 0.35
Random Agent 0.36 0.15 0.31 0.09 0.11 0.03 0.26 0.09
TD-VNM [6] 0.56 0.22 0.17 0.06 0.06 0.02 0.26 0.10
G-LSTM-RL-BC [16] 0.53 0.31 0.19 0.07 0.04 0.02 0.25 0.13
TD-RL-ITR [14] 0.68 0.36 0.45 0.23 0.19 0.09 0.44 0.22
NTS [10] 0.80 0.60 0.47 0.31 0.37 0.22 0.55 0.38
TABLE I: Average navigation performance (SR and SPL) comparisons on unseen scenes from the Habitat simulator.

Iv-a Baselines and Metrics

We compare our method with the following baselines:

  • Random Agent (RA) randomly picks an action from a discrete set of actions at each time step, where .

  • TD-VNM represents the target-driven visual navigation model from [6] and is trained using standard reinforcement learning. We evaluate the navigation generalization to unknown scenes and thus we do not use the scene-specific design during the training.

  • G-LSTM-RL-BC is adapted from [16], which incorporates a Gated-LSTM structure with an RL-based framework (e.g., A3C) and is trained using behavioral cloning (BC) for improving the navigation performance.

  • TD-RL-ITR is an end-to-end image-goal visual navigation model [14], which designs an information-theoretic regularization to facilitate the RL policy leaning.

  • NTS is the abbreviation of ‘Neural Topological SLAM’ [10], which studies the image-goal navigation in a hierarchy of three modules. This work first maintains the topological map based on agent observations followed by selecting a node in the map as the long-term goal based on a global policy. A local policy finally navigates the agent to the long-term goal based on visual observations.

All the baselines are trained on Habitat. Please refer to Section III-B for the training data description. For evaluation, we sample navigation tasks in the testing scenes to create three levels of difficulty based on the distances between the image goal locations and the starting locations, as in [10]: Easy , Medium and Hard . An agent succeeds in a navigation task if it predicts the stop action before the time limit (e.g., time steps) and the distance between the agent’s current location and the image goal location is within a threshold (e.g., ). The agent fails if it takes stop action anywhere else or runs up to steps.

To measure the navigation performance, we use two metrics, success rate (SR) and success weighted by (normalized inverse) path length (SPL), as defined in [33]. For each of the navigation tasks, let be a binary indicator for successful navigation or unsuccessful navigation. and denote the length of the shortest path and the actual executed path of the -th task, respectively. Success rate is the fraction of tasks in which the agent reaches the target successfully within limited time steps: . SPL considers both the success indicator and the length of the executed path: . The higher this value, the faster, on average, the agent approaches the target.

Iv-B Navigation in the Habitat Simulator

In the Habitat simulator, our method uses pre-trained Neural-SLAM [7] to maintain an on-line map at each time step during navigation, followed by producing a long-term goal on the map at each time scale with the proposed goal prediction module. Next, our method uses the Fast Marching Method [20], to plan a collision-free path to the long-term goal and uses the navigation ending prediction module to stop the navigation. We denote the whole architecture as Ours. The architecture is similar to the variant of Active Neural SLAM model from [10] and the differences lie in the design of the long-term goal prediction module and the navigation ending prediction module. In addition, to avoid the errors in pose prediction, we directly use the poses from the sensor of a navigation agent leading to relatively noiseless maps. This is reasonable since our work is designed for the image-goal navigation in the real-world scenarios, and we always use Gmapping [11] to provide the noiseless map of the real world. The mapping module is important, but not our main innovation. Our goal in this subsection is to evaluate the effectiveness of the proposed long-term goal prediction module during navigation. We analyze the navigation performance of the proposed method and all the baselines in the Habitat simulator. The evaluation involves three difficulty levels and each difficulty level contains different navigation tasks randomly sampled from unknown scenes, as in [10].

Iv-B1 Ablation

We first provide ablation results to gain insight into how navigation performance is affected by changing the structure. As developed in Section III-B, our method includes a long-term goal prediction module, which allows an agent to explore the environment efficiently and finally determine the locations of image goals. We first use a random policy (RP) to replace the module, denoted as Ours-RP. The random policy predicts a location randomly on an online map at each time scale. In addition, we ablate the effect of our proposed continuous rewards in Section III-B by using the sparse reward (SR) design in [6], denoted as Ours-SR. The motion planning module and the navigation ending prediction module are both crucial for our navigation method. We exhibit the importance of the design over [10], which uses FMM [20] to convert a long-term goal to a short-term goal and trains an end-to-end local policy (LP) for both motion planning and navigation ending prediction. The local policy takes as inputs the current visual observation and the short-term goal and outputs a discrete navigational action from . We use this design for an ablation and denote the variant as Ours-LP.

As shown in Table I, our navigation pipeline shows improvement in average SR, and improvement in average SPL over Ours-RP, which indicates the proposed long-term goal prediction module has a significant impact on improving the navigation capability of an agent. Dealing with sparse rewards is especially challenging during navigation learning. Ours-SR exhibits no learning even after millions of training frames and it does not seem to follow any specific strategy during the evaluation. The comparison results (Ours vs. Ours-SR) suggest our continuous reward design manages to explore the scene properly and guide the navigation efficiently. Ours-LP considers motion planning and navigation ending prediction simultaneously, and the performance of this ablation deteriorate as the difficulty of navigation tasks increases. Our model decouples the two processes and is able to tackle the hard level relatively well.

Fig. 4: A navigation task from the Habitat testing scenes. (a) The front-view of the starting position. (b) The result from the Neural SLAM given the starting observation (a). (c) The front-view goal image for navigation.

Iv-B2 Comparison

Table I also summarizes the comparison results with some navigation baselines in the RGB setting. We show the generalization abilities of these navigation agents to transfer the learned navigation skills to previously unseen scenes. In detail, the performance is intensely unfavorable when an agent applies random walk or is trained only by reinforcement learning (e.g., TD-VNM [6]). TD-VNM [6] originally designs different scene-specific layers for different scenes and thus the model lacks the generalization ability to unseen scenes unless it is fine-tuned. We adapt this model to a single policy network to evaluate its navigation performance in unseen scenes without fine-tuning. Due to the huge searching space, the pure RL-based agent is hard to yield sensible navigation results. G-LSTM-RL-BC [16] and TD-RL-ITR [14] are both end-to-end RL-based baselines, which are trained with expert data in different ways. The two baselines encourage learning of useful features for both navigation tasks and static collision avoidance and thus demonstrate much weaker performance than the hierarchical navigation agents (e.g., NTS [10] and Ours). We believe that it is more challenging for end-to-end RL networks to tackle the higher-order complicated control tasks. Chaplot et al. [10] propose the Neural Topological SLAM (NTS) for the image-goal navigation problem, which uses hierarchical decomposition and modular training to sidestep the high sample complexities associated with training end-to-end policies. The topological map-based baseline presents impressive navigation results across all difficulty settings. In contrast to this work, our method uses a metric spatial map and can also achieve approximately equal navigation performance in the Habitat simulator. Figure 4 shows a sample navigation task from the Habitat testing scenes and Figure 5 visualizes the trajectories of the proposed method and some baselines for the task. Note that we also present the automatic mapping results for these compared baselines. As can be seen, the agent based on TD-VNM [6] seems to get stuck in the corner and thus the mapping area is small. The agents based on G-LSTM-RL-BC [16] and TD-RL-ITR [14] both present limited exploration capabilities and fail to approach the goal image. Our method explores the scene effectively and successfully navigates the agent to the goal.

Fig. 5: Navigation visualization. We show trajectories of the proposed method and some baselines in the navigation task from Figure 4. Top: Front-view observations seen by the agent. Bottom: Local maps and trajectories. The ground-truth maps and poses are shown in grey. Trajectories generated by these navigation methods are shown in red. Map predictions from the Neural SLAM module are overlaid in green. Long-term goals selected by our goal prediction module are shown with blue circles. The image goal positions are shown with blue spots. Our agent can successfully stop in the yellow circle, which means the distance between the agent and the goal image is within .
Models (b) (c) (d) (e) (f) Overall
TD-VNM [6] 0.14 0.46 0.06 0.60 0.24 0.36 0.10 0.50 0.12 0.48 0.13 0.48
G-LSTM-RL-BC [16] 0.18 0.42 0.08 0.52 0.20 0.34 0.06 0.46 0.08 0.42 0.12 0.43
TD-RL-ITR [14] 0.28 0.38 0.22 0.54 0.32 0.30 0.14 0.38 0.18 0.34 0.23 0.39
Ours 0.50 0.14 0.32 0.20 0.58 0.10 0.26 0.20 0.34 0.16 0.40 0.16
TABLE II: Average navigation performance (SR and SPL) comparisons on unseen scenes from the real world.

Iv-C Navigation in the Real World

To tackle the image-goal navigation in real settings, as mentioned in Section III, our method uses the Gmapping [11] to update a navigation map in real time and uses the trained long-term goal prediction module to produce a long-term goal on the map at each time scale. CrowdMove [12] is exploited for the motion planning to the long-term goal and the navigation ending prediction module stops the navigation agent at a proper location. We denote the whole architecture as Ours.

Fig. 6: (a) The robot system setup. (b-f) The physical indoor environments with different configurations.

Iv-C1 Robotic Setup

We demonstrate the proposed navigation system using the TurtleBot2 robot. The hardware configuration of TurtleBot2 is shown in Figure 6 (a), which consists of a Kobuki base, an onboard laptop, an RPLIDAR A1 laser, and four monocular cameras. The moving base executes the steering command output from the navigation system, which is deployed on the laptop. The laser sensor and the camera sensors provide the inputs to the proposed system. The output space of the system is a set of permissible velocities in continuous space, as described in Section III-C.

Iv-C2 Indoor Experiments

Experiments are conducted on two physical indoor environments (e.g., a laboratory and an office room, approx. m in total) with different configurations, as shown in Figure 6 (b-f). All of the tested navigation models are trained purely in simulation and the real-world environments are unknown to these navigation agents. As for the experiments in simulation, we test the image-goal navigation capabilities, including the average success rate and the collision rate (CR, the rate of collision cases to all navigation cases). The robot and the goal image are both placed randomly in these environments and the objective of the robot is to approach the goal image as fast as possible. We evaluate navigation tasks in each configuration. The quantitative analysis of the navigation performance is provided in Table II.

The pure RL-based model (e.g., TD-VNM [6]) tends to be more cautious, and it thrashes around in space without making progress, resulting in a lower success rate. G-LSTM-RL-BC [6] and TD-RL-ITR [6] are both end-to-end navigation models, which are both trained with supervision from the shortest paths of navigation tasks. The two models collide more often, as learning about navigation and collision avoidance simultaneously is especially difficult for the end-to-end training strategy. As expected, Ours generalizes well to the unseen real-world environments and shows the best performances in both success rate and collision rate. This can be explained by the fact that we decouple the learning of image-goal navigation and collision avoidance by designing two mutually independent modules and thus each module can demonstrate more concentrated learning capabilities. In addition, the three navigation baselines all predict discrete action commands, leading to significantly oscillatory and jerky motions during robot navigation. Using our proposed navigation system, the mobile robot is able to move continuously in most cases. This is an important navigation property in a physical environment. Furthermore, moving to the real-world scenarios evaluates the generalization capabilities of these navigation methods and also their robustness against sensor noise and actuation delays. In practice, our approach works well in complex, occluded scenarios and results in smooth trajectories. Figure 7 visualizes three trajectories of TurtleBot generated by our system. Please see the video in the supplementary material for additional results obtained from the experiment in the real world as well as the qualitative simulation results.

Fig. 7: Real-world Transfer. We show three successful trajectories of the proposed method transferred to the real world. Sample images seen by the robot are shown on the top and the local maps in green and the trajectories in red are shown below. The long-term goals selected by the goal prediction module are shown in blue.

V Conclusion

In this work, we proposed a novel navigation method to tackle the image-goal navigation task in real crowded scenarios. The core of the method is the hierarchical decomposition and modular learning, which decouples the learning of navigation planning, collision avoidance, and navigation ending prediction. This cuts down the search space during training and helps achieve state-of-the-art performance on the image-goal navigation. We evaluate the method in both the Habitat simulator and the real world with a differential mobile robot. Ablation studies show that the proposed method learns the structural regularities of indoor environments, which leads to more efficient goal-driven navigation. The comparisons show that our model outperforms some end-to-end baselines by a large margin across all difficulty settings and the relative improvement of our method over these baselines increases as the difficulty increases. Finally, we show that a direct sim-to-real transfer of the image-goal navigation capability is possible. The proposed hierarchical framework generalizes well in real crowded scenarios.

However, although the results are promising, there are still a number of open problems. Both the reliability and the robustness of the proposed decoupled modules including the navigation goal planning, the motion planning, and the ending prediction, should be improved. In future work, we will explore effective techniques to prevent particularly bad behaviors and improve overall performance. For example, semantic properties of environmental objects can be used during the navigation goal planning, as in [8]. Using an attention mechanism to encode useful contextual information [34] could facilitate the navigation ending prediction. Incorporating D environmental perception [35, 36] for better navigation generalization is also a great topic for future work.


  • [1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,” IEEE Transactions on robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
  • [2] F. Zeng, C. Wang, and S. S. Ge, “A survey on visual navigation for artificial agents with deep reinforcement learning,” IEEE Access, vol. 8, pp. 135 426–135 442, 2020.
  • [3] X. Ye, Z. Lin, H. Li, S. Zheng, and Y. Yang, “Active object perceiver: Recognition-guided policy learning for object searching on mobile robots,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2018, pp. 6857–6863.
  • [4] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell, “Zero-shot visual imitation,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    , 2018, pp. 2050–2053.
  • [5] M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu, “Reinforcement learning with unsupervised auxiliary tasks,” in 5th International Conference on Learning Representations, ICLR 2017, Conference Track Proceedings, 2017.
  • [6] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in Proc. ICRA, 2017, pp. 3357–3364.
  • [7] D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov, “Learning to explore using active neural slam,” arXiv preprint arXiv:2004.05155, 2020.
  • [8] D. S. Chaplot, D. Gandhi, A. Gupta, and R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” arXiv preprint arXiv:2007.00643, 2020.
  • [9] E. Beeching, J. Dibangoye, O. Simonin, and C. Wolf, “Learning to plan with uncertain topological maps,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16.   Springer, 2020, pp. 473–490.
  • [10] D. S. Chaplot, R. Salakhutdinov, A. Gupta, and S. Gupta, “Neural topological slam for visual navigation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 875–12 884.
  • [11] G. Grisetti, C. Stachniss, and W. Burgard, “Improved techniques for grid mapping with rao-blackwellized particle filters,” IEEE transactions on Robotics, vol. 23, no. 1, pp. 34–46, 2007.
  • [12] T. Fan, X. Cheng, J. Pan, D. Manocha, and R. Yang, “Crowdmove: Autonomous mapless navigation in crowded scenarios,” arXiv preprint arXiv:1807.07870, 2018.
  • [13] M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik et al., “Habitat: A platform for embodied ai research,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9339–9347.
  • [14] Q. Wu, K. Xu, J. Wang, M. Xu, X. Gong, and D. Manocha, “Reinforcement learning-based visual navigation with information-theoretic regularization,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 731–738, 2021.
  • [15] A. J. Sathyamoorthy, J. Liang, U. Patel, T. Guan, R. Chandra, and D. Manocha, “Densecavoid: Real-time navigation in dense crowds using anticipatory behaviors,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020, pp. 11 345–11 352.
  • [16] Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian, “Building generalizable agents with a realistic and rich 3d environment,” in 6th International Conference on Learning Representations, ICLR 2018.   OpenReview.net, 2018.
  • [17] M. Wortsman, K. Ehsani, M. Rastegari, A. Farhadi, and R. Mottaghi, “Learning to learn how to learn: Self-adaptive visual navigation using meta-learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6750–6759.
  • [18] W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, “Visual semantic navigation using scene priors,” in 7th International Conference on Learning Representations, ICLR 2019, 2019.
  • [19] A. Mousavian, A. Toshev, M. Fišer, J. Košecká, A. Wahid, and J. Davidson, “Visual representations for semantic target driven navigation,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 8846–8852.
  • [20] J. A. Sethian, “A fast marching level set method for monotonically advancing fronts,” Proceedings of the National Academy of Sciences, vol. 93, no. 4, pp. 1591–1595, 1996.
  • [21] S. Thrun, “Probabilistic robotics,” Communications of the ACM, vol. 45, no. 3, pp. 52–57, 2002.
  • [22] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon, “Kinectfusion: Real-time dense surface mapping and tracking,” in 2011 10th IEEE international symposium on mixed and augmented reality.   IEEE, 2011, pp. 127–136.
  • [23] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accurate monocular slam system,” IEEE transactions on robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
  • [24] A. Faust, K. Oslund, O. Ramirez, A. Francis, L. Tapia, M. Fiser, and J. Davidson, “Prm-rl: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2018, pp. 5113–5120.
  • [25] L. Mezghani, S. Sukhbaatar, A. Szlam, A. Joulin, and P. Bojanowski, “Learning to visually navigate in photorealistic environments without any supervision,” arXiv preprint arXiv:2004.04954, 2020.
  • [26] A. Devo, G. Mezzetti, G. Costante, M. L. Fravolini, and P. Valigi, “Towards generalization in target-driven visual navigation by using deep reinforcement learning,” IEEE Transactions on Robotics, 2020.
  • [27] A. Francis, A. Faust, H.-T. L. Chiang, J. Hsu, J. C. Kew, M. Fiser, and T.-W. E. Lee, “Long-range indoor navigation with prm-rl,” IEEE Transactions on Robotics, vol. 36, no. 4, pp. 1115–1134, 2020.
  • [28]

    R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,”

    IEEE transactions on robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
  • [29] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese, “Gibson env: Real-world perception for embodied agents,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9068–9079.
  • [30] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [31] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1.   IEEE, 2005, pp. 539–546.
  • [32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [33] P. Anderson, A. X. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir, “On evaluation of embodied navigation agents,” CoRR, vol. abs/1807.06757, 2018.
  • [34] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention network for image classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3156–3164.
  • [35] C. R. Qi, O. Litany, K. He, and L. J. Guibas, “Deep hough voting for 3d object detection in point clouds,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9277–9286.
  • [36] X. Chen, X. Yan, F. Zheng, Y. Jiang, S.-T. Xia, Y. Zhao, and R. Ji, “One-shot adversarial attacks on visual tracking with dual attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 176–10 185.