Reinforcement Learning based Visual Navigation with Information-Theoretic Regularization

We present a target-driven navigation approach for improving the cross-target and cross-scene generalization for visual navigation. Our approach incorporates an information-theoretic regularization into a deep reinforcement learning (RL) framework. First, we present a supervised generative model to constrain the intermediate process of the RL policy, which is used to generate a future observation from a current observation and a target. Next, we predict a navigation action by analyzing the difference between the generated future and the current. Our approach takes into account the connection between current observations and targets, and the interrelation between actions and visual transformations. This results in a compact and generalizable navigation model. We perform experiments on the AI2-THOR framework and the Active Vision Dataset (AVD) and show at least 7.8 SPL, compared to the supervised baseline, in unexplored environments.


Vision-based Navigation Using Deep Reinforcement Learning

Deep reinforcement learning (RL) has been successfully applied to a vari...

Visual Navigation by Generating Next Expected Observations

We propose a novel approach to visual navigation in unknown environments...

Unsupervised Reinforcement Learning of Transferable Meta-Skills for Embodied Navigation

Visual navigation is a task of training an embodied agent by intelligent...

Towards Target-Driven Visual Navigation in Indoor Scenes via Generative Imitation Learning

We present a target-driven navigation system to improve mapless visual n...

NavTuner: Learning a Scene-Sensitive Family of Navigation Policies

The advent of deep learning has inspired research into end-to-end learni...

Ultrasound-Guided Robotic Navigation with Deep Reinforcement Learning

In this paper we introduce the first reinforcement learning (RL) based r...

Target driven visual navigation exploiting object relationships

Recently target driven visual navigation strategies have gained a lot of...

1 Introduction

Visual navigation is one of the basic components of autonomous robotic systems that perform a large variety of tasks in complex environments. It can be characterized as the ability of an agent to understand its surroundings and navigate efficiently and safely to a designated target solely based on the input from on-board visual sensors [52, 46, 43]. This process includes two key components, which have received much research attention for a long time. First, the agent should be able to analyze and infer the most relevant parts from the current observation and the target, to guide the current decision. Second, the agent should understand the correlation and causality between navigation actions and the changes in the surroundings resulting from the action. Despite significant progress in visual navigation [14, 36, 44, 51], the generalization to novel targets and unseen scenes is still a fundamental challenge.

Most existing robotic navigation approaches tend to perform 3D metric and semantic mapping before path planning and control [10, 42, 2, 50, 37, 6]

. Such approaches are sensitive to noisy sensory inputs and changes in the environment. Recently, there is increased interest in predicting navigation actions directly from pixels based on end-to-end learning-based approaches, including Imitation Learning 

[31, 34, 40] and Reinforcement Learning [17, 23, 46]

. In these tasks, CNNs are typically trained for feature extraction from observation images and fully connected layers map the features to a probability distribution over actions. Thus, these can learn navigation behaviors by leveraging extensive navigation experience in similar environments. However, most of approaches are based on carefully designed architectures and have only been demonstrated to perform well in simple synthetic scenes 

[25, 29, 11]. The latest work [43] proposes a self-adaptive visual navigation method (SAVN) and shows major improvements in adapting to new environments on the AI2-THOR, though it doesn’t take into account the adaptation to novel targets.

Main Results: In contrast, our goal is to design an approach for navigation targets in the form of an image; hence, the approach should be applicable to novel targets [52]. In addition, our approach is inspired by an information-theoretic regularization during navigation, which builds the interrelation between navigation actions and two adjacent visual observations. We integrate a reinforcement learning framework (Asynchronous Advantage Actor-Critic, A3C) [25] with the information-theoretic regularization by learning to first generate the next preferred observation among neighboring views and a target, and then to make action predictions by analyzing the difference between the current and the next observations. Our generation process is essentially building the connection between the current observation and the target to infer the most relevant part and the action prediction is based on the causality between navigation actions and visual transformations. Our work is in accordance with human navigation. For example, in a new environment, to go to an unfamiliar target, human favors to first exploit experience to judge which part around is important and then reach the part.

In summary, our contributions are as follows: (1) We incorporate an information-theoretic regularization into a reinforcement learning framework. This regularizes the intermediate process of navigation decision making and can significantly improve the navigation performance. (2) By considering building the connection between the current observation and the target and the causality between actions and visual transformations, we improve generalization to unseen environments and novel target objects. To the best of our knowledge, we are the first to adapt the information-theoretic regularization to guide the learning of an RL-based navigation agent. We conduct evaluations on datasets of both synthetic and real-world scenes, including AI2-THOR and AVD. We show that our model outperforms the supervised baseline [52] in terms of both success rate and SPL for cross-scene evaluation on AVD; success rate and SPL for both cross-target and cross-scene evaluation on AI2-THOR. Our source code has been submitted and will be made publicly available.

2 Related Works

Autonomous navigation in an unknown environment is one of the core problems in mobile robotics, which has been extensively studied. In this section, we provide a brief overview of some relevant works below.

Imitation learning. Imitation learning (IL) is the problem of learning a behavior policy by mimicking a set of demonstrations and can be advantageous when these demonstrations are optimal. It has been used for many practical applications, including autonomous driving [4, 8, 38, 45, 30] and UAV navigation [3, 19, 27, 39]. Pomerleau et al. [31]

first demonstrate road-following using a neural network trained from monocular imagery captured while a human drives a vehicle. LeCun et al. 


employ a convolutional neural network to learn to drive in off-road terrain. However, it is difficult for these base imitation models to deal with states that the agent has never experienced before. The DAGGER model 

[34] can improve upon base imitation learning by continuously closing the trajectory distributions from the agent and the expert demonstration and has been widely used for many robotic control tasks. A number of improvements in decreasing the dependency to the optimal demonstrations, have been published. LOLS [7] additionally guarantees a local-optimality and the safeDAGGER model [48] can reduce the number of queries to the expert and hence decreases the dependency. In addition, AggreVaTe [33] learns to choose actions to minimize the cost-to-go of the expert rather than mimicking its actions. AggreVaTeD [40], a differentiable version of AggreVaTe, extends interactive IL for use in sequential prediction and challenging continuous robot control tasks. These methods all demonstrate exponentially higher sample efficiency than many classic RL methods. However, in target-driven visual navigation, the cases close to the navigation target tend to be rarer than the cases far from the target, which results in the data imbalance. IL always treats each stage in a navigation task equally during training and has a low probability to handle the close cases. Hence, IL performs poorly when approaching the target during testing, especially making a stop decision precisely and decisively.

Reinforcement learning. Reinforcement learning does not specifically require supervision by an expert, as it searches for an optimal policy that finally leads to the highest reward. Recently, a growing number of methods have been reported for RL-based navigation. For example, Jaderberg et al. [17] take advantage of auxiliary control or reward prediction tasks to assist reinforcement learning. Direct prediction of future measurements during learning also appears effective for sensorimotor control in immersive environments [11]. Zhu et al. [52] propose a feed-forward architecture for target-driven visual navigation by combining a Siamese network with an A3C algorithm. However, they do not consider the generalization to previously unseen environments. Work in [23]

provides several additional RL learning strategies and associated architectures. Furthermore, many recent works have extended deep RL methods to real-world robotics applications either by using low dimensional estimated state 

[9] or by collecting an exhaustive real-world dataset under grid world assumptions [49]. Gupta et al. [15] present an end-to-end architecture to jointly train mapping and planning for navigation in novel environments. Mousavian et al. [26] propose to use the state-of-the-art detectors and segmentors to extract semantic information from observations and then learn a mapping from this information to navigation actions. Savinov et al. [35] propose the use of topological graphs for the task of navigation and require several minutes of footage before navigating in an unseen environment. Kahn et al. [18] explore the intersection between model free algorithms and model-based algorithms in the context of learning navigation policies. Wei et al.[46] integrate semantic and functional priors to improve navigation performance and can generalize to unseen scenes and objects. In contrast, we incorporate imitation learning within an RL framework to further improve both the cross-target and the cross-scene generalization capability of a navigation agent.

Combined approaches. Recently, methods combining the advantages of IL and RL have become popular [22, 41, 32, 12, 53]. These works provide suitable expert demonstrations to mitigate the low RL sample efficiency problem. Ho et al. [16] exploit a generative adversarial model to fit distributions of states and actions defining expert behavior and it never directly interacts with the expert as in DAGGER. They learn a policy from supplied data and hence avoid the costly expense of RL.  [13, 21, 28] share the same idea of learning from multiple teachers. Gimelfarb et al. [13] propose applying Bayesian model combination to learn the best combination of experts. Li et al. [21] discard bad maneuvers by using a reward based online evaluation of the teachers during training. Muller et al. [28] use a DNN to fuse multiple controllers and learn an optimized controller. Target-driven navigation in static environments is different from the problems above due to the easy acquisition of the optimal expert (the shortest path). Hence, there is no need to consider the bad demonstrations. We propose using a supervised generative model to regularize the intermediate process of an RL policy, which provides a more effective and generalizable target-driven navigation model.

Figure 1: Model overview. Our model integrates an information-theoretic regularization into an RL framework to constrain the intermediate process of the navigation policy. The inputs of our model at each time step are the multi-view images from the current location and the target image . During training, our network is supervised by the environment reward and the shortest path of the current task in the form of the ground truth action and the ground truth next observation . The parameters are updated by four loss terms: the reconstruction, the KL, the classification and the value. The first three terms in blue are introduced by the information-theoretic regularization. At test time, the parameters are fixed and our network first takes the current observation and the target as inputs to generate the future state. Then it predicts the action based on the future and the current states. Layer parameters in the green squares are shared.

3 Target-Driven Visual Navigation

In this section, we begin by outlining the target-driven visual navigation task. We then present our network, which combines a supervised generative model with deep reinforcement learning for this task.

3.1 Navigation Task Setup

We focus on learning a policy for navigating an agent from its current location to a target in an unknown scene using only visual observations. Our navigation problem is defined as: given a target image , at each time step , the agent receives as input the first-person view of the environment to predict the action that will navigate the robot to the viewpoint where the target image is taken.

We conduct our experiments on the AI2-THOR framework and AVD. AI2-THOR consists of synthetic scenes in four categories: kitchen, living room, bedroom, and bathroom. Each category includes scenes, of which are used for training, for validating, and for testing. AVD contains different real-world houses (except for one office room), of which are used for training, for validating, and for testing. Each scene in the two datasets is discretized into a grid-world navigation graph. The agent acts on these graphs and its action space is determined by the connectivity structures of these graphs as a discrete set: . These above make it easy to acquire a shortest action path of a target-driven navigation task. In this work, we will show how to incorporate the shortest paths during training to improve the navigation performance in unseen scenes.

3.2 Navigation Model

We first formulate target-driven visual navigation using a deep reinforcement learning framework (TD-A3C). At each time step , the network takes the current observation and the navigation target as inputs and finally outputs an action distribution and a scalar . We choose action from the policy and is the value of the current policy. This network can be updated by minimizing a traditional RL navigation loss as in [52], which uses different policy networks for different scenes. However, achieving strong results with one single policy network for all training scenes is difficult, since it is very sensitive to the RL reward function and requires extensive training time. Driven by this observation, we propose to regularize the network to achieve strong results in terms of training and to generalize the model to perform novel navigation tasks. This is achieved by incorporating an information-theoretic regularization into a deep RL framework, including a policy network and a value network.

3.2.1 Policy Network

Information-theoretic regularization.

During navigation, let denote the current observation, denote the next observation, and be the relative action between the two observations. We observe that a navigation agent always abides by an information-theoretic regularization: there should be high mutual information between the action and the adjacent observation pair . Namely, should be high:


We assume since the navigation action space is a discrete set and hence is a constant.


We propose adapting the regularization above by incorporating some supervision information to help learn a strong target-driven visual navigation policy . The supervision is from the shortest paths of target-driven navigation tasks. Specifically, at each time step, given the current observation and the target, the optimal next observation and relative action are provided as ground truth.

Over Equation 1, we first want the generative module to generate a next observation , which is most related to the navigation task. Hence, we use the ground truth action to guide the generation: and use the ground truth to help update the generation module though a reconstruction loss. In addition, considering that is unknown a priori during real navigation and is inherently determined by the navigation target , we design the distribution to approximate the distribution

, which requires a minimization of Kullback-Leibler divergence between the two distributions. Subsequently, the action prediction module

is supervised by the ground truth action and hence is updated by a classification loss. To this end, and , and the three modules constitute the navigation policy , which is updated by minimizing the function as follows:


The hyper-parameter tunes the relative importance of the three terms: reconstruction, KL, and classification.

In addition, we investigate two techniques to improve the training performance. First, we find that when the previous action is provided, the agent is less likely to move or rotate back and forth in a scene. This is reasonable since the ground truth action has no chance to be in contradiction with the previous action (e.g., move forward vs move backward). Second, we apply a CNN module to derive a state representation from an image and hence get the current state , the ground truth next state , and the goal state . In our work, we do not directly generate the next observation . We generate the state representation, denoted as and use this to compute the reconstruction loss and predict the navigation action. To avoid confusions, we will still use the description of generating the next observation below. This simplification reduces the network parameters and hence computational cost. As a result, our navigation policy is updated by:


To this end, our policy network is isolated from the rewards of navigation tasks and supervised by the shortest paths, which is easy to train. Unlike previous work [52] which learns to directly map the raw images to navigation actions, we add an intermediate process (the generation of the future observation) and predict an action based on the difference between the current and the future observations. This makes our network more compact and generalizable.

3.2.2 Value Network

The update of our policy is a complete imitation learning process, which equally deals with every stage during a navigation task. The fact is that situations in which the agent is close to the navigation target are generally sparse in terms of training and the agent is more likely to be far from the target. Due to the imbalanced situations in the training data, it is challenging for the agent to make the optimal decision when approaching the goal, especially if it is a stop decision. On the other hand, we find the different stages in navigation tasks can be distinguished by their discounted accumulative reward in RL and the stage close to the target with a large accumulative reward updates the policy more, which eases the above data imbalance problem. Hence, we decide to incorporate the RL idea into our design.

We learn a value function from the last fully connected layer of our policy, which represents the evaluation of the current policy at the current navigation task, denoted as . The value function is updated by minimizing the loss , where is the discounted accumulative reward defined by and is a reactive reward to the agent provided by the environment.

Finally the overall loss function for our method is

, where the hyper-parameter is empirically set as (, ,, ) throughout our experiments. The network overview is presented in Figure 1. Please refer to the supplemental material for details.

3.3 Learning Setup

So far we have integrated a supervised generative model in a reinforcement learning framework. Now, we will describe the key ingredients of the reinforcement learning setup: observations, targets, rewards, and success measure.

Observations. In contrast to the work [52], which stacks four history frames as current inputs at each time step, we utilize four views (RGB images by default) with evenly distributed azimuth angles at each location for current observation . The resolution of each view is .

Targets. The navigation target is specified by an RGB image, which contains a goal object such as a dining table, a refrigerator, a sofa, a television, a chair, etc. Please refer to the supplemental material for the training and testing goal objects. For each object, AI2-THOR provides all related views, which are used as target images in our training. On AVD, we just use one view for each goal object. Our model learns to analyze the relationship between the current observation and the target image, so we can show generalization to novel targets and scenes, both of which the agent has not previously encountered.

Rewards. Our purpose during policy training is to minimize the trajectory length to the navigation targets. Therefore, reaching the target is assigned high values of reward and we penalize each step with a small negative reward . To avoid collision, we design a penalty when hitting obstacles during run-time. In addition, we consider the geodesic distance to the goal at each time step, , as in [14] and reformulate the reward as:


Success measure. In our setting, the navigation agent runs up to steps, unless a stop action is issued or the task is successful. A navigation task is considered successful if the agent predicts a stop action, the goal object in the target image is in the field of the current front-view, the distance between the current location and the target view location is within a threshold of distance .

4 Implementation and Performance

Our main objective is to improve the cross-target and cross-scene generalization of target-driven navigation. In this section, we evaluate our navigation model compared to baselines based on standard deep RL models and or traditional imitation learning. We also provide ablation results to gain insight into how performance is affected by changing the structures in our model.

4.1 Baselines and Ablations

We compare with a random agent, an RL baseline and its supervised version, and variations of our model.

  • Random Walk randomly draws one option out of the navigation action set at each time step.

  • TD-A3C is the target-driven visual navigation model from [52]

    . It is trained using standard reinforcement learning, but it has the same action space and reward function as ours. The CNN module is ImageNet-pretrained ResNet-50 and is fixed during training.

  • TD-A3C(BC) is a variation of the TD-A3C. It is trained using behavioral cloning (BC), where the policy is supervised by shortest paths and the CNN module is the same as ours and is not frozen during training.

  • Ours-FroView is a variation of our model and takes the current front-view to generate the future rather than the four views around the agent location at each time.

  • Ours-NoGen is a variation of our model, which predicts the future observation from the current observation and the target without a stochastic latent space.

  • Ours-VallinaGen is a variant of ours, in which the latent space

    is constrained by the standard normal distribution prior

    instead of .

4.2 Implementation Details

We train our method and all alternatives across all scene types with an equal number of navigation tasks per scene using asynchronous workers and then back-propagate through time for every unrolled time steps. Therefore, the batch size is for each back-propagation. We use SGD to update the network parameters with a learning rate of . We train all the models until they perform stably on the validation set and then evaluate them in two different levels, and . Each level evaluation contains different navigation tasks sampled from the testing set.  [36] proposes using the ratio of the shortest path distance to the Euclidean distance between start and goal positions, to benchmark navigation task difficulty. In each evaluation, we compute the percentage of the tasks that have a ratio within the range of and evaluate the performance on all tasks and on tasks where the optimal path length is at least . We evaluate these models on two metrics, success rate (SR) and success weighted by (normalized inverse) path length (SPL) as defined in [46, 43].

Random 1.2 0.7 0.6 0.3
Unseen TD-A3C 20.0 4.0 12.9 2.6
scenes, TD-A3C(BC) 23.0 7.9 13.4 3.7
Known Ours 45.7 25.8 41.9 24.8
targets Ours-FroView 32.3 10.3 29.8 9.4
P=17.7% Ours-NoGen 41.2 23.8 38.5 22.2
Ours-VallinaGen 37.5 17.7 34.0 15.9
Random 2.0 1.0 0.6 0.4
Unseen TD-A3C 10.1 1.9 6.3 1.1
scenes, TD-A3C(BC) 12.3 2.4 7.5 1.6
Novel Ours 37.7 20.5 35.4 19.7
targets Ours-FroView 24.6 7.8 23.0 6.9
P=16.0% Ours-NoGen 35.7 19.1 31.6 17.4
Ours-VallinaGen 31.4 13.9 29.4 12.7
Table 1: Average navigation performance (SR and SPL in %) comparisons on unseen scenes from AI2-THOR with stop action.
Category Kitchen Living room Bedroom Bathroom
P=15.2% P=15.6% P=20.0% P=20.0%
Random 0.0 / 0.0 1.6 / 1.0 2.0 / 1.1 1.2 / 0.7
TD-A3C 17.4 / 3.1 13.2 / 2.1 16.9 / 1.9 32.4 / 9.0
TD-A3C(BC) 21.3 / 7.8 18.2 / 5.2 22.4 / 8.1 30.1 / 10.4
Ours 42.6 / 23.6 36.7 / 19.6 40.6 / 21.8 62.7 / 38.1
Ours-FroView 34.8 / 11.2 17.6 / 5.0 28.0 / 9.2 48.8 / 16.0
Ours-NoGen 38.8 / 22.0 28.8 / 15.4 38.8/ 22.7 58.4 / 35.2
Ours-VallinaGen 47.2 / 24.0 15.6 / 6.1 34.8 / 13.5 52.4 / 27.3
Table 2: Comparing navigation performance (SR and SPL in %) on different scene categories on AI2-THOR with stop action.
TD-A3C 32.4 10.2 27.1 9.8
Unseen TD-A3C(BC) 35.9 12.4 31.3 10.1
scenes, Ours 47.3 31.5 42.6 27.7
Known Ours-FroView 46.2 17.7 40.1 14.2
targets Ours-NoGen 47.1 30.9 41.9 28.1
P=17.7% Ours-VallinaGen 45.5 28.9 40.7 27.3
TD-A3C 30.4 11.9 26.9 8.2
Unseen TD-A3C(BC) 31.7 10.3 26.9 8.7
scenes, Ours 41.6 27.2 38.8 25.1
Novel Ours-FroView 39.7 16.8 37.1 14.2
targets Ours-NoGen 40.8 22.0 37.2 19.2
P=16.0% Ours-VallinaGen 35.9 17.9 32.4 15.9
Table 3: Average navigation performance (SR and SPL in %) comparisons on unseen scenes from AI2-THOR without stop action.
Table Exit Couch Refrigerator Sink Avg.
Random 4.0/ 2.7 4.6 / 3.1 3.4 / 2.1 3.2 / 2.1 3.6 / 2.7 3.8 / 2.7
With TD-A3C 5.3 / 2.4 6.7 / 4.2 4.3 / 2.5 5.6 / 2.7 7.1 / 3.6 5.8 / 3.1
Stop TD-A3C(BC) 12.4 / 1.5 23.0 / 2.8 15.0 / 1.7 7.2 / 1.1 13.4 / 1.7 14.2 / 1.8
Action Ours 23.4 / 8.5 10.1 / 5.3 31.9 / 13.1 18.9 / 3.5 25.7 / 7.0 22.0 / 7.5
Random 34.8 / 12.9 29.0 / 11.3 29.8 / 10.8 27.4 / 10.7 23.0 / 10.2 28.8 / 11.2
Without TD-A3C 39.7 / 13.1 29.3 / 10.9 30.1 / 9.9 28.4 / 10.1 22.9 / 9.7 30.1 / 10.7
Stop TD-A3C(BC) 58.2 / 4.2 41.9 / 5.5 32.0 / 2.4 31.9 / 5.2 24.0 / 1.7 37.6 / 3.8
Action Ours 61.9 / 35.0 56.7 / 28.1 65.7 / 27.5 40.1 / 16.9 50.6 / 20.1 55.0 / 25.5
Table 4: Average navigation performance (SR and SPL in %) comparisons on unseen scenes from AVD.

4.3 Results

4.3.1 Evaluations on the AI2-THOR

Generalization. We analyze the cross-target and cross-scene generalization ability of these models on AI2-THOR. Table 1 summarizes the results. First, we observe a higher generalization performance for the model with supervision comparing the results from TD-A3C and TD-A3C(BC). We believe that it is more challenging for RL networks to discover the optimal outputs in the higher-order control tasks. In addition, pretraining on ImageNet (TD-A3C) does not offer better generalization, since the features required for ImageNet are different from those needed for navigation. Subsequently, considering the navigation performance difference between TD-A3C(BC) and ours, we see that the idea of generating the future before acting and acting based on the visual difference, works better than directly learning a mapping from raw images to a navigation action. Our method outperforms the two baselines by a large margin in terms of both success rate () and SPL () metrics in each evaluation. This demonstrates that the adaption of the information-theoretic regularization (described in Section 3.2.1) effectively brings us better generalization for unseen scenes and novel objects.

Ablation. Furthermore, the ablation on different inputs (Front-view vs Multi-views) demonstrates that it is easier to generate the next observation, when the current information is rich. We also conduct the ablation model with four history frames as current inputs, which is difficult to converge in training scenes. We consider that there is no direct connection between the random history and the next observation, which is most related to the current observation and the target. Hence, it is more reasonable to generate the future from the current multi-view observation rather than the history. Based on the ablation on the generation process, we conclude that learning a stochastic latent space is often more generalizable than learning a deterministic one (Ours vs Ours-NoGen), which is in accordance with the work from  [5]. However, when the latent space is over-regularized by the standard normal distribution prior, the situation is worse (Ours-VallinaGen vs Ours).

Scene category. Table 2 presents the navigation performance on different scene categories, which is based on the evaluation tasks from Table 1. All methods consistently demonstrate impressive navigation performance in small scenes, e.g., kitchen and bathroom. However, navigation in large rooms, e.g., living room, is much more challenging.

Geodesic distance. We further analyze the navigation performance (SR and SPL) as a function of the geodesic distance between the start and the target locations in Figure 2. This evaluation uses the same navigation tasks as Table 2. As can be seen, the geodesic distance is highly correlated with the difficulty of navigation tasks and the performance of all methods degrades as the distance between the start and the target increases. Our model outperforms all alternatives in most cases. We also provide the statistics of geodesic distance to the target locations of all the compared models after navigation. As illustrated by Figure 3, all learning models experience a decrease in the average distance to the targets, demonstrating the effectiveness of learning compared to the initial geodesic distance distribution of these evaluated navigation tasks.

Termination criterion. It is difficult for a navigation agent to learn to issue a stop action at a correct location, since there is only one situation with stop action but many cases with other actions during a navigation task. Table 3 shows a simpler case where the stop signal is provided by the environment rather than being predicted by an agent. As expected, all models demonstrate higher performances than Table 1. Additionally, we see the performance gap of our model is much smaller than the two baselines, meaning we handle the data imbalance better than the baselines.

Collisions. We assume a collision detection module is devised and when a collision is detected, an agent will stay and make a new decision. In this case, two outcomes can be expected: 1) the action from the policy distribution can help the agent get out of the dilemma; 2) the action is equal to the old action and the agent gets stuck until the maximum number of steps is reached and finally fails the task. We evaluate trajectories from Table 2 by computing the ratio of collisions as the navigation proceeds. As shown in Figure 4, our model can deal with collisions more properly than the baselines as the navigation progresses.

Figure 2: We report SR and SPL performance as a function of starting geodesic distance from the target. Our method and three ablations outperform two baselines on all starting distances.
Figure 3: We fit the ending geodesic distance distribution of all learning models over navigation tasks. The Init represents the starting geodesic distance distribution of these tasks.
Figure 4: We report the collision action percentages of all learning models as the navigation proceeds.
Figure 5: Visualization of some typical success and failure cases of our method in eight navigation tasks from AVD. The blue dots represent reachable locations in the scene. Green triangles and red stars denote starting and goal points, respectively.

4.3.2 Evaluations on the AVD

Generalization. To evaluate the generalization ability in the real world, we train and evaluate our model and two baselines based on the training and testing splits on AVD. All methods being compared take depth input with the resolution of . We relax the criterion for successful navigation (the stop action should be issued, the distance of the agent to the target position is less than meter and the angle between the current and target view direction is less than ). We present the results in Table 4, which is based on navigation tasks () random sampled from the test scenes on AVD. We observe that all three learning models demonstrate average performance decreases compared to the results with stop action on AI2-THOR, but the relative tendency is consistent with Table 1. Additionally, we see a large gap between the performances of all compared models with stop action and without stop action, since the AVD dataset is small and the navigation targets on it are limited in our setting. This results in a severe data imbalance. These models all have difficulty in issuing a correct stop action. Most notably, our method exhibits significant robustness towards the real world with at least absolute improvement in success rate and in SPL over the supervised TD-A3C(BC) baseline.

Visualization. We visualize eight navigation trajectories from our model in Figure 5. These tasks are all characterized by unknown scenes, and long distances between the start points and the targets. For the tasks in the first row, our agent is able to navigate to the targets successfully, but for the last four tasks, our model fails to finish within the maximum number of steps. The problems include thrashing around in space without making progress (see the first and third trajectories in the second row), getting stuck in the corridor (see the second trajectory in the second row), and navigating around tight spaces (e.g, the small bathroom where the fourth trajectory starts).

5 Conclusion

We propose integrating an information-theoretic regularization into a deep reinforcement learning framework for the target-driven task of visual navigation. This is achieved by first learning to generate a next observation from a current observation and a navigation target, then planning an action toward the target based on the generated observation and the current observation. Our experiments show that our model outperforms the supervised baseline by a large margin in both the cross-scene and the cross-target generalization during navigation.

The current navigation policy is still sensitive to the data imbalance when the training task set is limited (see the large performance gap in Table 4. In the future, we plan to exploit some high level semantic and contextual features to facilitate the understanding of navigation tasks and learn more robust navigation strategies with limited training set to apply to some real-world scenarios.

6 Appendix

Navigation Targets.

Our navigation targets are specified by images, which contain goal objects, such as dining tables, refrigerators, sofas, televisions, chairs, etc. AI2-THOR provides all visible RGB views for each goal object, which are used as target images in our experiments. These views are collected based on three conditions. First, the view should be from the camera’s viewport. Second, the goal object should be within a threshold of distance from the agent’s center ( by default). Third, a ray emitted from the camera should hit the object without first hitting another obstruction. In Table 5, we provide the split of object classes used in the training and testing processes of learning models on AI2-THOR.

Room type Train objects Test objects
Kitchen Toaster, Microwave, Fridge, CoffeeMaker, GarbageCan, Box, StoveBurner, Cabinet,
Bowl, Apple, Chair, DiningTable, Plate, Sink,SinkBasin HousePlant
Living room Pillow, Laptop, Television, GarbageCan, Box, Bowl, Statue, TableTop
Book, FloorLamp, Sofa HousePlant
Bedroom Lamp, Book, AlarmClock, Bed, Mirror Cabinet, Statue
Pillow, GarbageCan, TissureBox, Dresser, LightSwitch
Bathroom Sink, ToiletPaper, SoapBottle, LightSwitch, Candle, Cabinet, Towel
GarbageCan, SinkBasin, ScrubBrush TowelHolder
Table 5: Training and testing split of object classes for each scene category in the AI2-THOR.

For AVD, we select target views in depth from the training split ( scenes), including some common objects as .

Network Architecture.

Our CNN module for deriving a state representation from an image is presented in Figure 6(a). By default, spectral normalization is used for the first six layers, which can prevent the escalation of parameter magnitudes and avoid unusual gradients [24, 47]

. The activation function used is LeakyReLU

. At each time step , we take the four-view observation as well as the target as inputs and extract a

-D state vector for each of them. We concatenate each view state with the target state to get a fused feature (see Figure 

6(b)). In Figure 6(d), four feature vectors are then used to infer a vector of latent variables of dimension with a MLP. Here, a KL divergence loss is minimized to impose the distribution of the latent variables to match a prior distribution from Figure 6(c), which is estimated from the current observation (front view only) and the ground-truth action . The latent vector is used to generate a state of next observation, which is under the supervision of ground truth next observation . Subsequently, the generated state of next observation (-D), the state of front view observation (-D), and the feature (-D) extracted from the previous action (-D one-hot vector) are combined together to predict the navigation action (-D) and get the evaluation value (-D). The ground-truth action and environment reward are used to help update this module.

Figure 6: Model architecture. The overview is given in the right panel with blowups of the CNN module (the orange portion), the fusion module (the blue portion) and the prior distribution (the green portion).
Training Details.

Our model is trained and tested on a PC with Intel(R) Xeon(R) W- CPU, GHz and a Geforce GTX Ti GPU. For all compared models, training on AI2-THOR is carried out in four stages, starting with kitchens, to gradually increase by scenes (namely, a scene category) at each next stage. This ensures fast convergence in training scenes. Training on scenes from AVD for all learning models is continuous. We take the model for evaluation which performs stably on the validation set.

Additional Results.

We conduct additional experiments of our method, where using semantic segmented images from AI2-THOR as inputs. The training and testing setting is the same as the main paper. In Table 6, all navigation tasks are from the evaluation of generalization on AI2-THOR in the main paper. Although semantic segmented images are lossy compared to RGB, they do capture most of the important information for navigation, leading to substantial navigation performance improvement as expected.

Category Kitchen Living room Bedroom Bathroom Avg.
P=15.2% P=15.6% P=20.0% P=20.0% P=17.7%
Random 0.0 / 0.0 1.6 / 1.0 2.0 / 1.1 1.2 / 0.7 1.2 / 0.7
Cross Ours(RGB) 42.6 / 23.6 36.7 / 19.6 40.6 / 21.8 62.7 / 38.1 45.7 / 25.8
-scene Ours(Semantic) 58.4 / 39.0 25.4 / 14.5 44.4 / 23.3 61.6 / 39.5 47.5 / 29.1
Category Kitchen Living room Bedroom Bathroom Avg.
P=20.0% P=13.6% P=15.6% P=14.6% P=16.0%
Random 2.8 / 1.4 0.4 / 0.1 1.6 / 0.1 3.2 / 1.5 2.0 / 1.0
Cross Ours(RGB) 46.6 / 26.1 22.6 / 9.4 39.0 / 21.1 42.6 / 25.4 37.7/20.5
-target Ours(Semantic) 53.6 / 34.8 22.4 / 10.3 43.2 / 23.1 47.6 / 27.8 41.7 / 24.0
Table 6: Navigation performance (SR and SPL in %) on different input modalities from AI2-THOR with stop action.