Log In Sign Up

Probabilistic Visual Navigation with Bidirectional Image Prediction

by   Noriaki Hirose, et al.

Humans can robustly follow a visual trajectory defined by a sequence of images (i.e. a video) regardless of substantial changes in the environment or the presence of obstacles. We aim at endowing similar visual navigation capabilities to mobile robots solely equipped with a RGB fisheye camera. We propose a novel probabilistic visual navigation system that learns to follow a sequence of images with bidirectional visual predictions conditioned on possible navigation velocities. By predicting bidirectionally (from start towards goal and vice versa) our method extends its predictive horizon enabling the robot to go around unseen large obstacles that are not visible in the video trajectory. Learning how to react to obstacles and potential risks in the visual field is achieved by imitating human teleoperators. Since the human teleoperation commands are diverse, we propose a probabilistic representation of trajectories that we can sample to find the safest path. Integrated into our navigation system, we present a novel localization approach that infers the current location of the robot based on the virtual predicted trajectories required to reach different images in the visual trajectory. We evaluate our navigation system quantitatively and qualitatively in multiple simulated and real environments and compare to state-of-the-art baselines.Our approach outperforms the most recent visual navigation methods with a large margin with regard to goal arrival rate, subgoal coverage rate, and success weighted by path length (SPL). Our method also generalizes to new robot embodiments never used during training.


page 1

page 3

page 9

page 14


Deep Visual MPC-Policy Learning for Navigation

Humans can routinely follow a trajectory defined by a list of images/lan...

VRChain: A Blockchain-Enabled Framework for Visual Homing and Navigation Robots

Visual homing is a lightweight approach to robot visual navigation. Base...

Qualitative vision-based navigation based on sloped funnel lane concept

A new visual navigation based on visual teach and repeat technique is de...

Predicting Topological Maps for Visual Navigation in Unexplored Environments

We propose a robotic learning system for autonomous exploration and navi...

Vision-Based Goal-Conditioned Policies for Underwater Navigation in the Presence of Obstacles

We present Nav2Goal, a data-efficient and end-to-end learning method for...

Scaling Local Control to Large-Scale Topological Navigation

Visual topological navigation has been revitalized recently thanks to th...

I Introduction

Imagine you are tasked with reaching a location described by an image in a novel environment. This task can range from difficult to impossible, depending on the size and aspect of the environment. Now imagine you are given the same task but also a sequence of images depicting what you would see if you would navigate correctly to the goal location. This is a task that most humans are able to perform based only on visual information. In this work we propose a mechanism to enable robots to reach a destination by following a sequence of images.

Previous approaches have studied the problem of robot navigation following a sequence of images [8, 9, 25]. However, most of these methods suffer if the environment changes from when the sequence of images was acquired, or if they have to deviate from the path to avoid a risk. Compared to robots, most humans can follow a visual trajectory even if the environment changed significantly due to changes in illumination, objects that moved or people or other dynamic agents that are placed differently in the scene [35]. They can also robustly deviate from the path to avoid obstacles and come back later. To do that, cognitive scientists argue that humans perform mental operations on predictive visual models [11, 4]. Taking inspiration from these insights we developed a navigation algorithm that generates virtual future images based on possible navigation strategies, and matches these images to the current view so as to find the best path to move safely. Concretely, our method will search for optimal navigation velocities to progress along a sequence of images towards a goal, even when there are significant changes in the environment, while keeping the robot safe from colliding with obstacles and other risks in the environment.

Fig. 1: Our method allows the robot to navigate towards a goal given by an image using bidirectional image prediction from both A) current image and B) the goal image . The bidirectional prediction avoids large obstacles. A probabilistic representation of the distribution of human teleoperation demonstrations (, red curves) is used to sample multiple collision avoidance trajectories, and a selection module chooses the optimal trajectory.

Previous approaches have proposed comparable visual predictive models to find optimal commands or navigation and manipulation [14, 13, 20, 16, 23]. These approaches have shown that predicting the far future is generally difficult and often inaccurate, due to the increasing uncertainty in future states introduce by (stochastic) forward models. In the case of visual prediction, the predicted images often grow blurry with each step [15] hindering the search of an appropriate robot command. Furthermore, when an obstacle occludes the view, it may be difficult to recognize the direction to move to reach the next image in the sequence. We propose to take inspiration from classical planning [27], and perform bidirectional search, not in Cartesian but in image space. The main idea of bidirectional search is to perform the search from both the initial and the goal locations until a connection is found [26]. Our bidirectional image prediction is inspired by this method, and aims to improve search efficiency in image space while alleviating the deterioration in the images due to long-horizon predictions.

To learn to avoid obstacles and come back to the visual trajectory, we propose to imitate human demonstrations. We asked human teleoperators to guide a mobile robot while we collected images. This allowed us to learn the correlation between obstacles in the image and navigation strategies and distill them into a model-predictive architecture. Since the navigation strategies from humans present large variability (e.g., between aggressive and conservative teleoperators, or when multiple ways to avoid an obstacle are possible) we encode the possible trajectories with a probabilistic representation. This representation can be used to sample multiple navigation strategies towards a visual goal and select the best one, based on a visual-based traversability method [18].

The main contributions of our work are summarized as follows:

  • [leftmargin=*]

  • We propose a novel bidirectional image prediction method to virtually simulate possible velocities both from current and desired visual states enabling avoiding large obstacles and improving quality of the visual predictions.

  • We developed a probabilistic representation of possible future trajectories learned from imitating human teleoperators.

  • We combined the bidirectional predictions and the probabilistic representation into a visual model predictive architecture that allows the robot to navigate towards a goal, following a sequence of images even under significant changes between the visual trajectory and the current environmental conditions.

We evaluate our method and compare it with various baselines in a simulator and the real world. In total, we performed 16000 trials for about 572 hours in the simulator and 80 trials in the real environment. We test the robot’s behavior during navigation in challenging environments containing various static obstacles, some that were not present when the visual trajectory was generated, as well as dynamic obstacles. Moreover, to validate the capability of our method to transfer between different robot embodiments, we tested on a mobile robot that was never used to collect training data. Even in these conditions, our method performs well as the robot can successfully reach the goal. The performance of the mobile robots during the experiments can be seen in the supplementary video.

Ii Related Works

In the domain of visual navigation, the methods can roughly be categorized into model-based visual navigation and learning based visual navigation, and we summarize works from both categories as below.

Model-Based Visual Navigation: Researchers have proposed solutions to visual navigation based on visual servoing and visual SLAM. Visual servoing [21, 8, 9] controls an agent to minimize the difference between current state and a goal state. As the difference is defined in image space, the performance suffers when the environment changes or large obstacles occlude a large parts of the environment. Differently, our method copes with large obstacles using bidirectional predictions.

Navigation methods based on visual SLAM [30, 24] first construct a map using camera images that the robot can use to localize and compute actions to achieve the goal. The success of visual SLAM-based methods relies on acquiring an accurate metric model and their performance decay due to failures in the mapping. Our method does not rely on any metric map and only requires an input trajectory defined by a sequence of images.

Learning-Based Visual Navigation: Recently, multiple learning-based approaches have been proposed for visual navigation. Topological visual navigation [33, 10, 28]

involves localization and planning from a topological representation of the environment that represents the connectivity between regions. Latest advances of reinforcement learning 

[36, 29, 38, 22]

and imitation learning 

[12, 32, 23] have also pushed the state of the art in visual navigation. We will focus the discussion on the two families of learning-based methods that are closest to ours, visual predictive control methods and probabilistic control methods.

Visual MPC is used for manipulation and navigation. Finn and Levine [16]

proposed a visual MPC approach using an image-based predictive neural network model.

Pathak et al. [31] proposed visual navigation with one-step image prediction from raw camera images. Hirose et al. [20] proposed to train control policies for navigation by minimizing of the visual MPC objective while penalizing untraversable areas. In these works, the predictive horizon is relatively short and thus, they cannot deviate far from the original path to avoid obstacles. Compare to these methods, our method is also computationally lighter and can be used to online navigation.

In the navigation domain, Drews et al. [13, 14] proposed a navigation system with image-based localization using a map and a control method based on non-visual MPC with sampling-based optimization for a small race car. Amini et al. [1]

attempted training a probabilistic model of driver behavior using the Gaussian mixture model and simultaneous localization via a trained control policy. Although

[13, 14] could achieve high speed driving, they did not consider collision avoidance or overtaking of other race cars as dynamic obstacles.

Similar to ours, Kumar et al. [25] proposed using a probabilistic approach for a robot to navigate along consecutive subgoal images between the start and final positions. However, their method shows oscillatory behaviors because their control commands are discrete (turn left, turn right, and go straight), while our method uses continuous control and generates smoother navigation trajectory. Bansal et al. [5] also suggested using the visual MPC approach for navigation and collision avoidance with spline trajectories. Unlike our method, they can not take arbitrary collision avoidance motions and require the ideal goal position in Cartesian space instead of the goal view, which requires some metric mapping.

Compared with the most similar method, Hirose et al. [20], our work applies a probabilistic approach to understand non-deterministic teleoperator’s behavior as the distribution of trajectory instead of single deterministic one. In addition, we propose bidirectional prediction to double the predictive horizon. For the navigation system we also incorporate visual localization to improve the navigation performance.

As a result, our method can follow the visual trajectory to arrive at goal while having large deviation from the visual trajectory for collision avoidance.

Fig. 2: Proposed System: Our method contains three main modules: a subgoal selection module (blue), a control policy module (orange), and a trajectory selection module (green). The subgoal selection module determines the next subgoals to reach in the trajectory and feeds them into the control policy and trajectory selection modules. Each unit in the control policy module generates sequences of velocities that would move the robot towards a subgoal location. The trajectory selection module chooses the optimal sequence of velocities by scoring the reachability and traversability for the subgoals. If none of the generated trajectories is safe enough, the trajectory selection module outputs a pivoting trajectory that helps the robot to relocalize and continue moving.

Iii Method

In this section we present a system to navigate using images so as to follow a pre-recorded visual trajectory. A visual trajectory is defined by a sequence of images (subgoal images, ) ending at a final goal image, . The location where a subgoal image was acquired is called a subgoal location. While the method we present is independent of the type of images and the source of the trajectory, in our implementation (Sec. IV and Sec. V) we will use images captured from a 360 RGB fisheye camera and trajectories collected by teleoperating a robot. Fig. 2 depicts an overview of our proposed probabilistic control system with bidirectional prediction. Our system is composed of three main modules, a subgoal selection, a control policy, and a trajectory selection module, that we explain in the following.

The subgoal selection module (Fig 2, blue, Sec. III-C) stores the visual trajectory and infers the subgoal image, , that was acquire from the subgoal location closest to the current robot location (indicated in red in Fig. 2). The consecutive images after in the visual trajectory, , are passed to both the control policy and trajectory selection modules. Using these consecutive subgoal images, our method is able to transition to later subgoals even when the closest subgoals are occluded by large obstacles not present in the environment when the visual trajectory was acquired. This allows our method to deviate largely from the visual trajectory, executing maneuvers that will reach a further step in the path towards the goal.

To process the consecutive subgoal images, we propose a control policy module composed by units with shared weights (Fig 2, orange, Sec. III-A). All units use as input the current sensor image, , and a different subgoal image . The output of the unit processing is command trajectories , with . A command trajectory is a sequence of consecutive velocity commands, , that executed at a constant time interval would move the robot from its current location towards the location where subgoal image was acquired. The trajectories correspond to samples in the distribution of possible trajectories towards the subgoal, , learned from a dataset of human teleoperation maneuvers.

Finally, the trajectory selection module (Fig. 2, green, Sec. III-B) chooses the best command trajectory to execute, i.e., the best subgoal image to reach and the best sample trajectory to move towards it. The selection is based on the reachability of the subgoal and the traversability (safety) of the path. If none of the generated trajectories is safe enough, the selection module outputs a pivoting trajectory that helps the robot to relocalize and continue the navigation. The trajectory is passed to a controller that executes the first velocity command, and the inference process starts again with a newly acquired sensor image . In the following, we will describe these three modules in detail.

Iii-a Control Policy Module

The control policy module generates control velocity commands to move the robot. Instead of just generating possible velocities to reach the next subgoal location, our module attempt to reach the next subgoal locations using the current image and the consecutive subgoal images. This allows our method to avoid large obstacles that could be placed on a subgoal location by transitioning to the next one.

The control policy module is composed of units, each one considering a different subgoal image . These units have the same structure, and share the same weights. The units are recurrent: this allows them to generate multiple steps of commands (a trajectory) conditioned on the previous steps. The -th unit in the control policy module receives the -th subgoal image and the current image and generates trajectories (in the form of temporal sequences of velocities) towards it. This design is equivalent to a flexible predictive planning horizon allowing to avoid obstacles in the visual path by reaching the next subgoals.

Instead of a single sequence of velocities towards a subgoal, each unit generates alternative sequences. These alternatives are sampled from a probabilistic predictive model that learns to map images to possible paths as demonstrated by human teleoperators in the datasets we use for training (see Sec. IV). In this way, our model proposes different strategies to reach each subgoal image, as learned from human demonstrations. The entire control policy module with units generates in total sequences of velocities.

The -th generated velocity sequence for the -th subgoal image is defined as , and the sequence is long. are the velocity commands, pairs of linear and angular velocity, for the wheeled robot. The sequence of velocities are planned to be executed at a constant time interval .

The velocity commands are generated iteratively by a LSTM. To generate

we feeding into the LSTM unit the visual features extracted from the subgoal image and the current camera image,

, concatenated with the previous output velocity command

. However, as explained before, in order to generate multiple possible trajectories, the output of the LSTM is not a single velocity command, but a probability distribution of possible commands. We assume the distribution to be Gaussian: the LSTM outputs mean and covariance of the distribution of possible next commands (linear and angular velocity) given previous command, current and subgoal images:


Our method samples the distribution to get the linear and angular velocities for the -th step.


We iterate Eqn. 1 and Eqn. 2 times to obtain a sample trajectory of velocities, , towards the -th subgoal image.

The question now is how to train the model to output feasible trajectories of velocities to reach each subgoal image. We propose to learn from human teleoperation commands the best strategies to navigate to subgoal based on images. However, given the high-dimensionality of the image inputs, the training process would benefit of additional supervision. We propose to learn to output velocity commands by optimizing a combination of image loss, smoothness loss and trajectory likelihood loss. While the trajectory likelihood loss is the common loss to learn to imitate demonstrations, the two other losses provide supervision on the image and velocity spaces.

Thus, during training, and are optimized by minimizing the following cost function:


where is a bidirectional image loss, is a trajectory likelihood loss, is a smoothness loss, and are weighting factors to balance the objectives. We explain the loss design in the following paragraphs.

Iii-A1 Bidirectional Image Loss

The basic strategy to reach the subgoal involves moving the robot to a location where it can obtain an image similar to that of the subgoal image. However, to avoid collisions, our method needs to be able to deviate from the original path. A longer predictive horizon may help to plan larger deviations for collision avoidance. However, a longer predictive model will deteriorate the predictive performance, consequently hurting the search for optimal velocity and thus navigation performance, as shown in the experimental section.

We propose to apply the following bidirectional image loss:


where denotes the total pixel count of the image and denotes predicted images conditioned on from the current camera image .


Here, is the pretrained VUNet-360 neural network [19, 20]. in (Eqn. 4) denotes the images predicted using the subgoal image , as follows:


where the subgoal image are assumed to be the image in (Eqn. III-A1). Note that the chronological order and signs of the velocities are inverted because the chronological order of the predicted images is also inverted in (Eqn. III-A1). Using the subgoal image as , we can double the predictive horizon(= ) without deterioration in image prediction because the subgoal image can accept geometric and appearance information, which is occluded from . In addition, is designed under the assumption . As it is difficult to give a hard constraint to the optimization process during training, we give a soft constraint by setting a relatively large .

Iii-A2 Trajectory Likelihood Loss

The teleoperator’s velocity commands recorded in the dataset contain rich information about navigating the robot while avoiding collisions, e.g. slowing down when going through narrow passages, slowing down when detouring around obstacles. Such rich, view-specific velocity commands can be leveraged in training navigation policies. Hence, deterministic velocity commands cloning using the mean squared error (MSE) objective cannot make meaningful use of the teleoperator’s velocity commands for navigation.

Unlike deterministic behavior cloning, our method applies the following trajectory likelihood loss, which is the negative log likelihood loss function defined over samples of teleoperator trajectory and our virtual trajectory generation model Eqn. 



where is the mean robot pose at the -th step, which is calculated using the following equation and the mean of the generated velocity .


The wheeled robot travels on a 2D plane. Hence, is defined by the robot position and yaw angle . In addition, and can be written as and , respectively.

is the virtual robot trajectory using the sampled velocity according to the distribution in (Eqn. 2). is the robot trajectory controlled by the teleoperator velocity . The covariance matrix of can be defined as follows:


where is the Jacobian of (Eqn. 8) for , and is the constant diagonal matrix. The relationship between positions , , and are illustrated in the supplemental material. By minimizing , the control policy can generate velocity commands that roughly follow the distribution of the teleoperator’s command.

Iii-A3 Smoothness Loss

Continuous velocity is required for the actual control of the robot so that the motion is not jerky. We use as the objective to suppress discontinuous velocity commands. is designed as the sum of the mean squared error between the linear and angular velocities of consecutive steps, .

Iii-B Trajectory Selection Module

This module selects the best policy from the sets of sequences of velocities generated by the control policy module to control the robot. In the trajectory selection module, units are used to evaluate the velocities generated by the control policy module.

The trajectory selection module measures the score for all generated velocities and selects the minimum value . Here, is the optimal to minimize . is a reachability score for reaching the subgoal, and is a traversability score, which penalizes travelling in risky areas. In our method, a smaller score indicates a better path.

Iii-B1 Reachability Score

To ensure the reachability to the position of the subgoal image, the generated velocity vectors need to satisfy the condition

, because we use as in bidirectional prediction. is calculated as the similarity of and :



is a neural network with seven convolutional layers using batch normalization and the leaky ReLU function to extract the feature.

Iii-B2 Traversability Score

The velocity sequences corresponding to the untraversable path should be excluded as candidates to ensure the collision free motion. In order to penalize the untraversable path, we define as follows:



is the estimated traversable probability by feeding the representative predicted image

(namely, the front side of the predicted image at the -th step from the current image view) into the neural network GONet [18]. must be substantially larger than the reachability score to prioritize avoiding collision.

Hence, can be used to effectively remove the untraversable trajectories. We only choose one predicted image for the traversability estimation to reduce the computational load for online implementation.

In addition to the above selection process, we override the selected velocity with a pivot turning motion if the estimated traversable probability at the current robot pose is less than 0.5 to continue moving. The detail explanation is in the supplemental material.

Iii-C Subgoal Selection Module

The subgoal selection module selects the closest subgoal image from all subgoal images of the visual trajectories to feed the subsequent subgoal images into the control policy and trajectory selection modules. One practical approach to decide is to measure the similarity between and all the subgoal images such as  [31]. However, we do not want to increase the computational load as the system needs to run at real-time. Comparing with each subgoal will increase the computational cost of localization as the number of subgoals increases.

Instead of measuring the image similarity, we use the virtual velocities generated by the control policy module to decide

. Under the assumption that , the robot pose at the 2-th step estimated by approximately equals the pose of the subgoal image on the robot coordinate. We simply provide the thresholds for the distance and yaw angle to decide , as shown on Algorithm 1. Here, is the previous , is the threshold for the distance, and is the threshold for the yaw angle.

Input: generated velocities , previous subgoal image index
Output: new subgoal image index

1:  for  do
2:     for  do
3:         roll out to get
4:         if  then
5:             /* choose farthest subgoal within */
6:         end if
7:     end for
8:  end for
9:   /* For next process */
10:  return  
Algorithm 1 Subgoal Selection Module

Iv Training

There are trainable elements (neural networks) in the control policy module and in the selection module. To train them, we use the GO Stanford dataset 4 (GS4) [17]. GS4 was collected by teleoperating a TurtleBot2 mobile robot equipped with a RGB 360 camera. The dataset contains synchronized video sequences from the 360 camera and teleoperator’s commanded velocities when navigating in twelve buildings at the Stanford University campus for 10.3 hours. We chose eight buildings for training, two buildings for validation, and two other buildings for testing. In addition to real world data, GS4 contains data from the Gibson Simulation Environment [37], a photo-realistic simulator used to train navigation agents using models from Stanford 2D-3D-S [3] and Matterport3D [7]. The simulator data in GS4 was collected by teleoperating a simulated robot with a virtual RGB 360 camera in 36 different reconstructed buildings for a total of 3.6 hours. We separate the simulator data to use data from 26, 5, and 5 buildings for training, validation, and testing, respectively. When training the neural network models, we randomly mix the data from the real environment and the simulator to achieve better generalization.

Iv-a Training the Control Policy Module

To train and in the control policy module, we randomly choose two images from the same trajectory of the dataset, and , separated by steps (), so that . We assume the first image to be the current image, and the second image to be each of the subsequent subgoals, for . We resize the image to 128 128 pixels and feed them into to generate a feature, . We randomly sample with

according to the Gaussian distribution

of (Eq. 2) during the training.

is provided by the teleoperator’s velocity at the same time as to calculate the trajectory likelihood loss . We set 0.0 for if . Moreover, the initial pose, , , and are defined as zeros to calculate the trajectory on the robot coordinate.

Iv-B Training the Trajectory Selection Module

is the neural network used to extract the feature, and in (Eqn. 10) should take a bigger value when the Euclidean distance between the corresponding predicted images is bigger. We train using triplet loss [34]. First, we calculate the trained control policy module in the same manner as that used for training the control policy module. We generate predicted images conditioned on the obtained virtual velocities using (Eqn. 5) and (Eqn. III-A1). Then, we randomly choose one anchor image and two images from the predicted images. One of the images, which is closer to the anchor image on the Euclidean coordinate of robot position, is set as the positive image, and another is set as the negative image to calculate the triplet loss. Note that the Euclidean distance of robot position between two predicted images is calculated by using the virtual velocities generated by the control policy module. By iteratively updating to minimize the triplet loss, we train to extract the feature for the selection module.

Iv-C Parameters

The horizon is set as 8, which is same as one baseline [20], for a fair comparison. Hence, the control and predictive horizons of our method are 5.328 s (). The maximum predictive range can be 2.666 m and 5.328 rad, because the maximum linear and angular velocities are designed to be 0.5 m/s and 1.0 rad/s, respectively. In , the weighting factor is set as 1.0 except when . is used to give the soft constraint for . The weighting factors to balance each objective in are designed as and after some trial and error.

For the trajectory selection module, for is designed as 100.0 () to penalize the untraversable velocity sequence. Threshold values and in the subgoal selection module are set as 0.8 m and 0.4 rad, respectively. Larger and can improve the robustness of reaching the position of the subgoal image and avoiding collisions, because we can check the reachability and traversability for more cases ( cases). However, the calculation cost will be heavier for larger and . Hence, both and take the design value of 3 at the inference time although is used for training. We select three velocity vectors in (Eqn. 2) by repeatedly selecting the mean point and the 2 sigma point for 2 steps. The details about how to efficiently choose the three velocity vectors for collision avoidance at inference time are explained in supplemental material.

V Experiments

We conduct three sets of experiments to evaluate our method in a simulated and real environments. First, we conduct ablation studies in simulation to investigate the strengths and limitations of the three main novel components of our method: probabilistic control, bidirectional prediction, and subgoal selection based on the generated virtual velocities. Then, we validate our method by comparing its navigation results in simulation against those of state-of-the-art baseline methods that learn to follow subgoal images in a visual navigation. Finally, we test our method and the baselines on a real physical mobile robot, including one different to the one used to collect the training dataset. This robot platform presents a different motion dynamics, inertia, size, weight, camera position, …With this test we evaluate the generalizability and robustness of our method among different embodiments. We also include a qualitative and visual analysis of the trajectories followed to provide insights of the behavior of our method. In all our experiments, we evaluate the capacity of the navigation methods to move along a visual trajectory acquired before via teleoperation (in simulation or real world), and whether they can deviate and come back to the trajectory when there is an obstacle in the path that was not present during the collection of the visual trajectory.

goal arrival rate subgoal coverage rate SPL [2]
+P 0.728 0.445 0.877 0.236 0.725 0.445
+B 0.729 0.444 0.881 0.237 0.726 0.444
+S 0.709 0.454 0.861 0.255 0.709 0.454
+P+B 0.787 0.409 0.889 0.224 0.784 0.409
+P+S 0.777 0.416 0.892 0.232 0.772 0.415
+B+S 0.731 0.443 0.870 0.253 0.729 0.443
+P+B+S (our method) 0.807 0.394 0.905 0.222 0.803 0.394
TABLE I: Ablation studies studying the contribution of different sets of components: (P) Probabilistic control, (B) bidirectional prediction, and (S): subgoal selection (mean standard deviation)

V-a Ablation Studies

First, we evaluate the contribution of each of the three novel components of our method –probabilistic control, bidirectional prediction, and subgoal selection– through ablation studies in novel simulated environments not seen during training. In order to do that, we evaluate different combinations of these components and our proposed method with the three of them.

method step    goal arrival rate   subgoal coverage rate        SPL [2] mean calculated rate [Hz]
open loop 0.016 0.125 0.371 0.213 0.016 0.123
Imitation learning (IL) 8 0.346 0.476 0.695 0.263 0.342 0.471 13.703
16 0.208 0.406 0.632 0.245 0.196 0.385 10.068
Zero-shot visual imitation (ZVI) [31] 8 0.370 0.483 0.695 0.271 0.364 0.475 13.726
16 0.127 0.333 0.562 0.250 0.102 0.273 10.059
Stochastic optimization (SO) [16] 8 0.611 0.488 0.793 0.294 0.610 0.487 0.585
NOT calculated in real time 16 0.467 0.499 0.734 0.290 0.466 0.498 0.341
Deep visual MPC (DVMPC) [20] 8 0.700 0.458 0.850 0.264 0.700 0.458 13.498
16 0.469 0.499 0.765 0.273 0.468 0.498 9.967
Our method 16 0.807 0.394 0.905 0.222 0.803 0.394 5.342
TABLE II: Comparison with baseline methods. Navigation results with our method and the baseline methods in 10 environments and 1000 times each (Table shows mean standard deviation, speed is benchmarked on Nvidia RTX 2080 Ti GPU)

Detailed parameters and conditions the probabilistic control, bidirectional prediction, and subgoal selection are included in the supplementary material. For each combination of components, we evaluate the performance in the same 100 trajectories in 10 different environments and measure the goal arrival rate, subgoal coverage rate, and success weighted by path length (SPL) [2]. We define the goal region as a circle of around the goal location.

We evaluate the navigation performance of our method and the ablated variants to avoid an obstacle placed in the visual path. We randomly choose meshes from the YCB dataset [6] and scale it to 1, 3, 5, or 7 times the original size. If the agent collides with the obstacle or with the environment, we stop the trial and consider it finished (failed). To place the obstacle we randomly select a subgoal location in the visual trajectory, excluding the final subgoal or subgoals in very narrow areas, e.g., corridors or doors. Obstacles in very narrow areas would make impossible to navigate and reach the goal. We place the obstacle randomly around the selected subgoal, at a distance randomly picked among , , , or . The robot does not start at the exact initial location of the visual trajectory, but a distance to it chosen randomly between , , , and . We simulate realistic random noise in the controller execution by adding slippage. Per trajectory, we select randomly a slippage ratio between 0.0, 0.1, 0.2 and 0.3, that indicate the ratio of motion that is lost due to slippage. For example, a ratio of 0.1 implies that if the agent commands , the environment will only execute . This emulates different types of floors.

Table I includes the results (mean and the standard deviation) of the evaluation of our method and the ablated variants. The results indicate that only the combination of the three novel components of our approach (probabilistic control, bidirectional prediction, and subgoal selection) achieve more than 80% of goal arrival rate and the highest subgoal coverage and SPL. Results show all components are useful in terms of improving the navigation performance. With all components enabled, robot with our method navigates to the goal region with the highest goal arrival rate of 0.807.

V-B Comparison to Baselines in Simulation

Next, we compare the navigation performance of our method to the performance of state-of-the-art baselines approaches for visual trajectory following. To remove the advantage in our method provided by using a panoramic 360 camera, we retrain all the baselines using also 360 images from the GS4 dataset [17]. We evaluate the baselines in the same trajectories we generated for the ablation studies of Sec. V-A.

In our method, the predictive horizon for each direction is 8, adding up to 16 predictive steps. A fair comparison would be to use a predictive horizon of 16 for the baselines. However, due to their unidirectional prediction, the quality of the predictions degenerates for the baselines and the performance with such a long horizon degrades. We evaluate also a shorter predictive horizon of 8 steps.

We compare our method to the following five baselines:

  1. [leftmargin=*]

  2. Open loop control: The robot moves by directly replaying the teleoperator’s velocities in the simulator in an open loop control fashion.

  3. Imitation Learning (IL):

    We train a convolutional neural network of with the same structure and size as the feature extraction network (

    ) of our method to map images to teleoperator’s velocities by minimizing MSE (behavioral cloning). At the inference time, we query the trained model for robot velocities.

  4. Zero-shot Visual Imitation (ZVI) [31]: We retrain the method by Pathak et al. [31] with 360 images. Similar to our method, ZVI trains the control policy in an offline fashion using the predictive model, and implements the trained model in the online manner.

  5. Stochastic Optimization (SO) [16]: We develop and train a cross-entropy method (CEM), which is a stochastic optimization algorithm similar to the one used by Finn and Levine [16]. In our implementation, we first sample 50 linear and angular velocity vectors and compute the objectives defined for DVMPC [20]. Then, we select the 5 velocities with lowest cost and resample a new set of 50 velocities from a multivariate Gaussian distribution. We iterate 5 times the above process and query the velocities with the lowest cost.

  6. Deep Visual Model Predictive Control (DVMPC) [20]: In DVMPC, Hirose et al. [20] trained offline a control policy to minimize a MPC objective, including a penalization for collision avoidance. This process is less computationally expensive than the online stochastic optimization process by Finn and Levine [16]. However, in DVMPC the predictive model is not bidirectional and the control policy is deterministic.

Table II shows the evaluation results for the baseline methods and the proposed method in simulation, including the computation rate. We observe that a longer predictive horizon (16) degrades the performance of the baselines. This drop results from the increased difficulty of predicting far future commands due to occlusions and unforeseeable effects. Differently, our method with the 16-step predictive horizon (8 on each direction) shows the best performance of all approaches, indicating that the bidirectional prediction approach degrades less than unidirectional predictions. While the computation time of our method is larger than others (except SO), it is fast enough to run online on a real mobile robot on a Nvidia GeForce GTX 1070 at 3 fps.

Fig. 3: Navigation performance wrt. size and location of obstacles: Goal arrival rate as a function of a) the scale of the obstacle and b) distance between the center of the obstacle and the visual trajectory for our method and the baselines (with 8 and 16 predictive horizon). Our method is more robust and can deviate to avoid larger obstacles and obstacles in the middle of the trajectory, still reaching the goal.

Fig. 3 depicts the performance (goal arrival rate) of the methods with respect to the size of the obstacles (left) and the distance to the visual trajectory (right). As expected, all the baseline methods decrease their success rate for larger obstacle scale and lower distance to the trajectory. However, compare to all the baselines, the performance of our method degrades less. In the most difficult cases, with maximum obstacle scale and shortest distance to the visual trajectory our method achieves around 15 to 20% better results than the second best baseline method.

V-C Real Robot Experiments

We evaluate our method in the real world with a physical robot. We implement our method on a laptop with Nvidia GeForce GTX 1070 and mount it on TurtleBot2 with a 360 camera (Ricoh THETA S). We compare our method to the baselines defined in the previous section: open loop control, imitation learning, zero-shot visual imitation, stochastic optimization, and deep visual MPC. In our evaluation we include dynamic obstacles: pedestrians. In each environment, we conduct 10 trials by varying the obstacle position, initial robot position, lighting condition, and pedestrian positions. Table III shows the mean goal arrival rate and subgoal coverage rate in three different environments. Some example trajectories of our method are included in our supplemental video. We observe the same behavior as in simulation: our method results in a better goal arrival rate than the strongest baseline approach, DVMPC with the 8-step horizon.

Case 8.1 m Case 2: 13.7 m Case 3: 9.5 m
DVMPC (8 steps) 0.500  /  0.775 0.500  /  0.688 0.600  /   0.827
Our method 0.800  /  0.892 0.900  /  0.944 0.800  /   0.918
TABLE III: Navigation performance in a real environment (goal arrival rate/subgoal coverage rate)

Finally, we also experiment on a different robot platform which is not used to collect data. The data collection and real-world experiments shown above are performed on a Turtlebot 2 robot. To validate the robustness of our method to different robot embodiments, we mount the same laptop PC on a different built in-house mobile robot (shown in supplementary video) and navigate through a real environment. Compared to Turtlebot 2, this robot has a different mechanical structure, inertia, weight, and sensor position. We conduct 10 trials in 2 environments using this robot. Our method arrives at the goal region with a success rate of 80%.

V-D Qualitative Analysis

Fig. 4 illustrates the navigation behavior of our method in a top-down view of the simulation when the robot is close to a subgoal with (bottom) and without (top) obstacle in the trajectory. The figure includes the three next subgoals, , , and , whose location is depicted on the top-down view, and the optimal trajectory selected by the trajectory selection module of our method.

When there are no obstacles in the path (Fig. 4.a), our method generates sample trajectories to the three next subgoals (red, green and blue dotted lines) from the current robot location. The distribution of the samples is narrow for the closest subgoal and broader for further away subgoals. In this case, our trajectory selection module selects one of the three red dotted trajectories to reach the position of the first subgoal.

In the presence of an obstacle (Fig. 4.b) occluding the second subgoal, our method generates successful motions towards the first and third subgoals (red and blue dotted trajectories). In this case, the trajectory selection module selects a trajectory to the third subgoal ranked with higher score than the trajectories to the first subgoal, even though the third subgoal is not visible from the robot’s location. This is possible thanks to the bidirectional prediction that searches also from the third subgoal backwards toward the robot.

Lastly, Fig. 5 depicts whole robot trajectories as executed by our proposed approach. In the presence of one large-sized obstacle close to the goal (left), our method follows the original visual trajectory in the first part and deviates in the final part from the visual trajectory to avoid the collision and arrive at the goal region. Even when two obstacles are placed on the path (right), our method allows the robot to avoid them and reach the goal destination.

a) without obstacle
b) with obstacle
Fig. 4: Navigation behavior of our method without obstacles (a) and with obstacles (b). (i) Top-down view of the robot (black circle) and sample trajectories to the first (red), second (green) and third (blue) subgoal images. The positions of the upcoming and past subgoals are depicted with circles. (ii) 360 camera images at the current robot position, and first, second and third subgoal images. Without obstacles (a) our method selects the trajectory to transition to the next subgoal, but in the presence of an obstacle (b) it selects the trajectory that allows to go around it.
a) with one obstacle b) with two obstacles
Fig. 5: Robot trajectories with several obstacles. The trajectory of our method (red line) deviates from the original tele-operated trajectory (gray line) to avoid the obstacles (dark gray rectangles) and returns successfully to the path to reach the goal.

Vi Conclusion and Future work

We presented a novel control system to allow a mobile robot to navigate using only one 360 fisheye RGB camera. Our work makes three main contributions: 1) A novel bidirectional image prediction method that extends the image predicting horizon conditioned on possible velocities, 2) A probabilistic representation of velocities that encodes and imitates non-deterministic human demonstrations, and 3) A robust system that combines the bidirectional predictions and the probabilistic representation that allows the robot to navigate towards a goal, following a sequence of images even under significant changes between the visual trajectory and the current environmental conditions. Our experiments indicated that our method follows image trajectories robustly, avoiding unseen large obstacles in both simulated and real environments. We also showed that our method applies to different mobile robot, even when they were not used to collect the human demonstrations of our training dataset. However, our method still suffers from some limitations. First, the robot collides occasionally because of GONet, the network that assesses traversability occasionally generate wrong predictions of traversability. We plan to extend our approach with a long-life-learning procedure to continuously correct those mistakes.


  • [1] A. Amini, G. Rosman, S. Karaman, and D. Rus (2019) Variational end-to-end navigation and localization. In 2019 International Conference on Robotics and Automation (ICRA), pp. 8958–8964. Cited by: §II.
  • [2] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al. (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757. Cited by: §V-A, TABLE I, TABLE II.
  • [3] I. Armeni, S. Sax, A. R. Zamir, and S. Savarese (2017)

    Joint 2d-3d-semantic data for indoor scene understanding

    arXiv preprint arXiv:1702.01105. Cited by: §IV.
  • [4] A. E. Arnold, G. Iaria, and A. D. Ekstrom (2016) Mental simulation of routes during navigation involves adaptive temporal compression. Cognition 157, pp. 14–23. Cited by: §I.
  • [5] S. Bansal, V. Tolani, S. Gupta, J. Malik, and C. Tomlin (2019) Combining optimal control and learning for visual navigation in novel environments. Cited by: §II.
  • [6] B. Calli, A. Singh, J. Bruce, A. Walsman, K. Konolige, S. Srinivasa, P. Abbeel, and A. M. Dollar (2017) Yale-cmu-berkeley dataset for robotic manipulation research. The International Journal of Robotics Research 36 (3), pp. 261–268. Cited by: §V-A.
  • [7] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3d: learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158. Cited by: §IV.
  • [8] F. Chaumette and S. Hutchinson (2006) Visual servo control. i. basic approaches. IEEE Robotics & Automation Magazine 13 (4), pp. 82–90. Cited by: §I, §II.
  • [9] F. Chaumette and S. Hutchinson (2007) Visual servo control. ii. advanced approaches [tutorial]. IEEE Robotics & Automation Magazine 14 (1), pp. 109–118. Cited by: §I, §II.
  • [10] K. Chen, J. P. de Vicente, G. Sepulveda, F. Xia, A. Soto, M. Vázquez, and S. Savarese (2019) A behavioral approach to visual navigation with graph localization networks. arXiv preprint arXiv:1903.00445. Cited by: §II.
  • [11] F. Chersi, F. Donnarumma, and G. Pezzulo (2013) Mental imagery in the navigation domain: a computational model of sensory-motor simulation mechanisms. Adaptive Behavior 21 (4), pp. 251–262. Cited by: §I.
  • [12] F. Codevilla, M. Miiller, A. López, V. Koltun, and A. Dosovitskiy (2018) End-to-end driving via conditional imitation learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9. Cited by: §II.
  • [13] P. Drews, G. Williams, B. Goldfain, E. A. Theodorou, and J. M. Rehg (2019) Vision-based high-speed driving with a deep dynamic observer. IEEE Robotics and Automation Letters 4 (2), pp. 1564–1571. Cited by: §I, §II.
  • [14] P. Drews, G. Williams, B. Goldfain, E. A. Theodorou, and J. M. Rehg (2017-13–15 Nov) Aggressive deep driving: combining convolutional neural networks and model predictive control. In Proceedings of the 1st Annual Conference on Robot Learning, S. Levine, V. Vanhoucke, and K. Goldberg (Eds.),

    Proceedings of Machine Learning Research

    , Vol. 78, , pp. 133–142.
    External Links: Link Cited by: §I, §II.
  • [15] F. Ebert, C. Finn, S. Dasari, A. Xie, A. Lee, and S. Levine (2018) Visual foresight: model-based deep reinforcement learning for vision-based robotic control. arXiv preprint arXiv:1812.00568. Cited by: §I.
  • [16] C. Finn and S. Levine (2017) Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2786–2793. Cited by: §I, §II, item 4, item 5, TABLE II.
  • [17] GO stanford dataset 4. Note: 2020-01-31 Cited by: §IV, §V-B.
  • [18] N. Hirose, A. Sadeghian, M. Vázquez, P. Goebel, and S. Savarese (2018)

    Gonet: a semi-supervised deep learning approach for traversability estimation

    In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3044–3051. Cited by: §I, §III-B2, §VII-A, §VII-D1.
  • [19] N. Hirose, A. Sadeghian, F. Xia, R. Martín-Martín, and S. Savarese (2019) Vunet: dynamic scene view synthesis for traversability estimation using an rgb camera. IEEE Robotics and Automation Letters 4 (2), pp. 2062–2069. Cited by: §III-A1.
  • [20] N. Hirose, F. Xia, R. Martín-Martín, A. Sadeghian, and S. Savarese (2019) Deep visual mpc-policy learning for navigation. IEEE Robotics and Automation Letters 4 (4), pp. 3184–3191. Cited by: §I, §II, §II, §III-A1, §IV-C, item 4, item 5, TABLE II, §VII-D1, §VII-D3.
  • [21] S. Hutchinson, G. D. Hager, and P. I. Corke (1996) A tutorial on visual servo control. IEEE transactions on robotics and automation 12 (5), pp. 651–670. Cited by: §II.
  • [22] G. Kahn, A. Villaflor, B. Ding, P. Abbeel, and S. Levine (2018) Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §II.
  • [23] K. Kase, R. Nakajo, H. Mori, and T. Ogata (2019) Learning multiple sensorimotor units to complete compound tasks using an rnn with multiple attractors. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: §I, §II.
  • [24] A. Kim and R. M. Eustice (2013) Perception-driven navigation: active visual slam for robotic area coverage. In 2013 IEEE International Conference on Robotics and Automation, pp. 3196–3203. Cited by: §II.
  • [25] A. Kumar, S. Gupta, D. Fouhey, S. Levine, and J. Malik (2018) Visual memory for robust path following. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 765–774. External Links: Link Cited by: §I, §II.
  • [26] S. M. LaValle and J. J. Kuffner Jr (2001) Randomized kinodynamic planning. The international journal of robotics research 20 (5), pp. 378–400. Cited by: §I.
  • [27] S. M. LaValle (2006) Planning algorithms. Cambridge university press. Cited by: §I.
  • [28] X. Meng, N. Ratliff, Y. Xiang, and D. Fox (2019) Scaling local control to large-scale topological navigation. arXiv preprint arXiv:1909.12329. Cited by: §II.
  • [29] D. Mishkin, A. Dosovitskiy, and V. Koltun (2019) Benchmarking classic and learned navigation in complex 3d environments. arXiv preprint arXiv:1901.10915. Cited by: §II.
  • [30] R. Mur-Artal and J. D. Tardós (2017)

    Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras

    IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §II.
  • [31] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik, A. A. Efros, and T. Darrell (2018) Zero-shot visual imitation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 2050–2053. Cited by: §II, §III-C, item 3, TABLE II.
  • [32] A. Pokle, R. Martín-Martín, P. Goebel, V. Chow, H. M. Ewald, J. Yang, Z. Wang, A. Sadeghian, D. Sadigh, S. Savarese, et al. (2019) Deep local trajectory replanning and control for robot navigation. arXiv preprint arXiv:1905.05279. Cited by: §II.
  • [33] N. Savinov, A. Dosovitskiy, and V. Koltun (2018) Semi-parametric topological memory for navigation. In International Conference on Learning Representations, External Links: Link Cited by: §II.
  • [34] J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu (2014) Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1386–1393. Cited by: §IV-B.
  • [35] R. F. Wang and E. S. Spelke (2002) Human spatial representation: insights from animals. Trends in cognitive sciences 6 (9), pp. 376–382. Cited by: §I.
  • [36] E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra (2020) DD-ppo: learning near-perfect pointgoal navigators from 2.5 billion frames. In International Conference on Learning Representations, External Links: Link Cited by: §II.
  • [37] F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese (2018) Gibson env: real-world perception for embodied agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9068–9079. Cited by: §IV, §VII-E.
  • [38] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3357–3364. Cited by: §II.

Vii Supplemental Material

Vii-a Overriding in Trajectory Selection Module

Our method can detect the obstacle and try to avoid it by considering the traversability score based on GONet in the trajectory selection module. However, the traversability score occasionally fails for dynamic obstacles, e.g., pedestrians. If a pedestrian suddenly appears in front of the robot, it is difficult for the robot to find the direction for the subgoals because there is insufficient space between the robot and the obstacle to take collision-free motion. Moreover, adjacent obstacles occlude the appearance of the subgoal image from the current camera view. The trajectory selection module in Fig. 2 overrides the selected velocity to avoid the suddenly appeared obstacle as follows.


where , and traversable probability at the current position can be estimated by GONet[18].


The controller above makes the robot turn in place for at least three steps, corresponding to 90 degrees of rotation, when the area in front of the robot is untraversable. The turning direction of the robot is decided by the sign of , which is the yaw angle of the robot at the 2-th future step on the robot coordinates. Via this override, the robot tries to find the traversable area and capture the same appearance in the subgoal image.

Further, this controller limits the maximum linear velocity for the safety experiment at inference time. is limited within 0.2 m/s and is set as to keep the original turning radius when .

Vii-B Illustration of trajectory likelihood

Figure 6 indicates the relationship between positions , , and and velocities , , and to illustrate the trajectory likelihood. To learn the distribution of the teleoperator’s velocity commands recorded in the dataset, our control policy generates the virtual mean robot pose and its covariance matrix to fit with samples of the teleopeartor trajectory .

Vii-C Velocity Vector Selection

Our method provides a distribution of velocity at each step. At inference time, we generate several candidate velocities to select optimal velocities in the selection module. However, the calculation cost to generate and evaluate a larger number of velocity vectors will be heavier and will impede the online calculation in the real robot. Hence, we select only three velocity vectors , with for each subgoal to perform our method online. To obtain a wide variety of trajectory by three velocity vectors to avoid collision, we sample the following velocity at step according to and in (Eqn. 1).


where is the square root of , which is calculated by a solution such as a Cholesky decomposition. Here , and are the mean, sigma-points, and sigma-points of the velocity distribution, respectively. Note that only the sigma-points corresponding to the angular velocity are used, because the sigma-points corresponding to the linear velocity mostly cannot provide meaningful candidates for collision avoidance. We consider that by taking a wide candidate in the angular velocity direction, the possibility of avoiding large obstacles would increase.

Fig. 6: Illustration of trajectory likelihood. Relationship between positions , , and and velocities , , and .

As shown in Fig. 7, we repeatedly calculate (Eqn. VII-C) for steps to obtain three velocity vectors , with . Despite the mean and distribution of velocity at the first step being the same for all velocity vectors (), the three velocity vectors can cover a wide range of space to avoid large obstacles by repeatedly selecting the mean and 2 sigma-points for 2 steps in (Eqn. VII-C).

Fig. 7: Velocity vector selection at inference time. Double circles, red dots, and blue dots are position by mean velocity , sigma-point velocity , and sigma-point velocity , respectively. Grey, red, and blue lines indicate trajectories generated by velocity vectors , , and , respectively.
(P) (B) (S)
virt. vel. pix. diff.
TABLE IV: Condition of cost value and visual localization approach in ablation study

Vii-D Alternative Method in Ablation Study

In an ablation study, we evaluate all different sets of (P) probabilistic control, (B) bidirectional prediction, and (S) subgoal selection based on virtual velocities. In the main paper, we omitted the detailed explanation of alternative methods where not all components are used.

Vii-D1 Without probabilistic control (P)

The method without (P) has a deterministic control policy, which does not generate multiple velocities by setting . Here is the number of subgoal images fed into the control policy, and is the number of generated virtual velocity vectors in the control policy. Hence, there is no trajectory selection module in the method without (P) because we have no choices from which to select the best one. As a result, we remove the trajectory likelihood loss from and add the following simple imitation loss .


The minimization of has the effect of generating continuous virtual velocity, because the teleoperator’s command is continuous. Hence, we remove when giving . By minimization of cost value with , we train the model of control policy. At inference time, we calculate the trained model to generate the deterministic virtual velocity with .

In contrast, the method without (P) cannot check the traversability, because there is no trajectory selection module. To overcome this unfair condition, we give additional traversability loss using GONet[18] when training the control policy like [20].


Here is the kernelized traversable probability as . The kernelization helps the optimization to generate a virtual velocity that avoids areas of traversability smaller than [20].

Vii-D2 Without Bidirectional Prediction (B)

To avoid using bidirectional prediction, we train the control policy with the following image loss instead of bi-directional image loss in for the method without (B).


Here, except and are predicted by VUNet-360 as shown in (Eqn. 5). calculates the pixel difference between the subgoal images and the predicted image . By minimization of with , the method without (B) can go to the position of subgoal . In addition, we choose for the method without (B), because the method of is better than that of from Table II and Fig. 3.

Vii-D3 Without Subgoal Selection Based on Virtual Velocity (S)

The method without (S) evaluates the pixel difference between the current image view and the subgoal image to decide whether the robot is close to the subgoal or not. The smaller pixel difference roughly means that the Euclidean distance between the robot position and subgoal position is shorter. The following equation is the condition to update the subgoal to the next one.


Here is the same threshold value as [20]. The alternative method of (S) used in the ablation study is computationally as light as our method using virtual velocity.

Table IV summarizes the cost value to train the control policy and selected method of (S) for all different sets of (P), (B), and (S) in the ablation study.

Vii-E Examples of simulation environment

Figure 8 shows three examples of simulation environment[37]. Left photo depicts top view of the environment and visual trajectory. Right photo depicts six subgoal image examples selected from all subgoals on the visual trajectory. Map in left photo is taken by slicing the simulation environment at a certain height. Hence, some furniture are not appeared on the map. From Fig. 8, We can understand that our method needs to pass through narrow doors and corridors to go to the goal position. It means that environment even without obstacle occludes the appearance of the subgoal image view from the camera image view during navigation. In that situation, bidirectional prediction can help to go through the narrow area. The effectiveness of our method in the environment without presence of many obstacles is also shown in Fig.3. Our method shows small advantage against strongest baseline at = 1.0 and 1.5, where the obstacle is far from the subgoal image position, and achieve 0.992 goal arrival rate at = 1.0 and 1.5.

case A : path length 8.6 m
case B : path length 7.3 m
case C : path length 11.9 m
Fig. 8: Examples of simulation environment and rendered images. Left photo depicts top view of the environment. Grey line is the visual trajectory, and grey dots on the grey line are the position of the subgoals. Right photos depict 6 subgoal image examples by 360 degree camera.

Vii-F Overview of different mobile robot

Figure 9 indicates the overview of the in-house prototype mobile robot, which was not used to collect the dataset. This robot has a different mechanical structure, inertia, weight, sensor position, and so on compared to TurtleBot 2. To validate the capability of our method, we test on this robot in the evaluation section. Our method works well on the new robot as well, because it arrives at the goal region with a success rate of 0.8.

Fig. 9: Overview of the second prototype robot. [a] shows the front view of the robot, [b] shows its back view. Unlike TurtleBot2, this robot has two front wheels and one rear omnidirectional wheel. We use the same laptop PC with the new robot and place the 360-degree camera in the front.