Visual navigation is the problem of navigating an agent, e.g. a mobile robot, in an environment using camera input only. The agent is given a target image (an image it will see from the target position), and its goal is to move from its current position to the target by applying a sequence of actions, based on the camera observations only. We focus on the case when the environment is initially unknown, i.e., no explicit map is available. Such a visual navigation problem can be formalized as a reinforcement learning (RL) problem . Two main challenges in the RL formulation are the dimensionality of the agent’s observation space and the fact that the actual state is only partially observable from the images.
Observation space dimensionality can be reduced by using hand-crafted features, or by using features learned on either the training dataset or a completely different dataset, e.g., ResNet  features automatically extracted from the image [23, 2]. A different method, proposed in , uses image segmentation and a depth map as inputs to the agent. It was trained and evaluated on houses from SUNCG dataset , and the trained agent was able to find multiple targets specified as a separate input to the agent.
Raw high-dimensional input images can also be used directly for navigation [7, 10]. These two papers extend the A3C algorithm with auxiliary tasks to stabilize the training and make it more efficient when the reward is sparse. They, however, use the DeepMind Lab  game simulator, which is much simpler than realistic simulators [8, 21, 14]. The only method that relies solely on visual input in a realistic indoor-scene environment is . However, it was applied to AI2-THOR  which contains small single-room environments and the action space discretized the environment into a simple grid world.
, extended with auxiliary tasks to help the agent to learn useful features also in the absence of informative rewards. During the training of the deep neural network, we use depth-maps and image segmentations as training targets for the auxiliary tasks. In addition, we propose a method to pre-train the neural network before the reinforcement learning algorithm is applied. This is accomplished by transfer learning from one environment to another, gradually increasing the environment complexity. Finally, to address the partial observability problem, we propose a novel neural network architecture that is both efficient and compact. We evaluate our method in realistic indoor-scene environments similar to and .
Ii-a Formal setting
The visual navigation problem is a POMDP. For example, when the agent faces a wall, there are many states yielding the same or a very similar image. However, for the ease of notation, we will first introduce the problem as an instance of a standard MDP, using the state as if it was available to the agent. Later, we will replace the state by a sequence of past observations .
At the beginning of each learning episode, the agent starts from state which is uniformly sampled from the set of all possible initial states : .111Other distributions than the uniform one can be used. At discrete time steps the agent executes actions . As a result of each action, the agent moves to the next state and receives reward . The experience the agent collects in a single episode is defined as the following sequence:
An episode terminates when the agent reaches the target or after a predefined maximum number of time steps. For training, the episode is split into equally long rollouts, where the last rollout can be shorter. The experience collected in a single rollout of length is defined as:
Ii-B Advantage Actor-Critic Algorithms (A2C)
Actor-critic algorithms are suitable for continuous state spaces . The critic is an approximator of the state-value function: , parameterized by , while the actor is an approximator of the policy. We use a stochastic policy
which is a probability distribution over the discrete set of possible actions, conditioned on the state, and parameterized by . Let the bootstrapped -step return be defined as:
where is zero if the episode ended during the rollout and one otherwise and . The actor is updated similarly to REINFORCE from  is given by:
The term is referred to as the advantage function. The critic is updated using the -step temporal difference learning: the MSE between the bootstrapped -step return and the critic output is computed and a gradient descent update is applied. The gradient of the critic’s loss function is:
To ensure exploration, the negative entropy of the actor is added to the total loss. This negative entropy in state is defined as:
and its gradient on the rollout data is:
Note that the above setting differs from the one given in , which uses the -step forward view to compute the return . When DNNs are used to approximate the actor and critic, it is beneficial to optimize on multiple time-steps in a single batch. We, therefore, use the rollout data to optimize all time-steps in the rollout at once. The estimated returns are a mixture of returns with different length for each state, which was proven to reduce the error in the discrete RL setting [19, 4].
As the critic and actor can share knowledge about the environment, they can share some of their parameters, which leads to improved learning performance. For example, when using neural networks for visual tasks, the bottom-most convolutional layers in the actor and in the critic need to learn the same convolutional filters. The A2C algorithm  has been adapted for the use with DNNs by introducing the following two modifications:
Batched Advantage Actor-Critic (A2C). In batched A2C , there are different environments. At each time step, actions are sampled by the actor, one for each environment. The rollouts collected from the environments are used to optimize the actor and the critic in a single batch. This process can be viewed as having separate instances of A2C, each updating the same shared parameters. As shown in , the use of multiple environments has a stabilizing effect on the training, similarly to using an experience buffer .
Off-policy Critic Updates. Collecting observations can be costly, especially when the environment framework has to simulate physics and render 3D scenes. For an algorithm to be efficient, it needs to learn as much as possible from the experiences collected so far. To improve the data efficiency and the stability of the algorithm, a memory of past experiences called the experience buffer is used. It keeps the last experiences, i.e., observations, actions, rewards, and terminals222A terminal is the indicator of the episode ending in a particular time step.. At each learning step, a sequence of experiences is sampled from the buffer and it is used to compute the bootstrapped -step returns (3) and so to train the critic.
Ii-C UNREAL Auxiliary Tasks
Deep RL algorithms are commonly enhanced with auxiliary tasks to improve their learning performance. For instance, in  the A3C algorithm was extended with two auxiliary tasks, reward prediction, and pixel control. The former predicts the sign of the reward based on past four observations and the latter uses an additional pseudo-reward function to learn a policy that maximizes the absolute pixel change. The batched A2C can be enhanced in the same way; more details are given in Section III-C.
Iii Proposed Learning Architecture
Our method extends the batched A2C algorithm with UNREAL auxiliary tasks and additional auxiliary tasks for visual navigation. We call the method A2CAT-VN, which is an abbreviation of A2C with Auxiliary Tasks for Visual Navigation. We have made its implementation333https://github.com/jkulhanek/a2cat-vn-pytorch as well as a framework implementing several deep RL algorithms444https://github.com/jkulhanek/deep-rl-pytorch publicly available on GitHub.
Iii-a Neural Network
The deep neural network used in our method consists of several modules: convolutional base, LSTM, actor, critic, and auxiliary tasks, see Fig. 1. In the sequel, we explain the individual blocks one by one.
The convolutional base is depicted in Fig. 2. Its inputs are the observed image and the target image, each entering into a separate stream of two convolutional layers with shared weight parameters. The outputs of the second layer are concatenated and passed to two additional convolutional layers, followed by a single fully-connected linear layer.
. Instead, the images are down-sampled by using stride only as suggested in. The convolutional base features are merged with the previous action and the previous reward and are passed to the LSTM layer .
The previous action is encoded using one-hot encoding and the previous reward is clipped to the interval. LSTM features are used as the input for both the actor and the critic, as well as for the pixel control auxiliary task. Let be the LSTM features of an input (LSTM features are computed from the convolutional features and therefore are a function of the input). Note that the input is composed of the image observation, the target image, the last action, and the last reward, as well as the previous LSTM state. The critic is an affine transformation of the LSTM features, and the actor is the result of the softmax function applied to an affine transformation of the LSTM features.
Iii-B Resolving Partial Observability
The partial observability of the environment does not allow the agent to uniquely distinguish which state it occupies based on a sole observation. Using previous observations can, however, greatly improve its ability to navigate in the environment. For example, if the agent faces a wall, it can instead look at the previous observation and the action taken. The authors of  and  used the past four frames, fed into the network instead of single image input. In , an LSTM memory  was used instead. We have used the latter in our approach, as we have experimentally found that it was superior to using the past four frames. Past four frames were not enough to capture the complex experience the agent collected when exploring the environment, and more frames lead to an unmanageable increase of the parameter space size and memory requirements.
Iii-C UNREAL Auxiliary Tasks
Iii-C1 Reward Prediction
The goal of the agent is to maximize the cumulative reward. It proves beneficial to train the network to predict whether a given state leads to a positive reward or not since it helps the network to build useful features to recognize potentially fruitful states. The agent learns to predict the next reward based on the past three observations [7, 13]555Also here LSTM could be employed. However, we prefer to use the original method from the literature.. First, a sequence of experiences is sampled from the experience replay buffer such that there is a fixed ratio between the sequences ending with zero reward and the sequences ending with non-zero reward. The output of the fourth convolutional layer computed from all three past observations is merged into a single vector. An additional linear layer and the softmax function are applied to output probabilities of the reward being positive, negative, or zero. This new network is then trained using the cross-entropy loss.
Iii-C2 Pixel Control
The pixel control task is defined via an additional pseudo-reward function in order to maximize the absolute pixel change. Using this reward, an additional policy is trained that shares most of its parameters with the A2C actor and critic. This policy must be trained using an off-policy RL algorithm since it uses the data sampled from the experience replay buffer generated by the actor. In  the -step Q-learning loss  is used to update the policy. The observation images are downsized, converted to gray scale, and the absolute differences between two consecutive observations are computed and used as pseudo-rewards for Q-learning .
A new head is attached to the output of LSTM. This head consists of deconvolutional layers – upsampling the low-dimensional features back to the size of the downsampled observations. For each action, there is a different output in the last layer to output the Q-function for each pixel. The dueling DQN technique  is used to improve the performance of the pixel control network. The pixel control network used in our method can be seen in Fig. 3.
Iii-D Additional Auxiliary Tasks for Visual Navigation
, we introduced additional auxiliary tasks that are specific to visual navigation. They were designed to enhance the training process as well as to help the network generalize. We train the model to predict the depth-map, image segmentation of the observation, and image segmentation of the target. For the image segmentations, we map the object-type to the RGB color space and maximize the distances between each color in the HSB color space. The input is passed through a narrow part of the network in autoencoder fashion to improve the quality of features in the shared part of the network. This gives the actor and the critic good features in bottom-most layers with a compact representation of all information needed to reconstruct depth-map and image segmentations. These bottom-most layers would otherwise be difficult to train since the network is deep and the loss is noisy due to the imprecise target values computed using the RL algorithm. The image segmentation for the target ensures the network pays attention to what the target is. Otherwise, it would be difficult for the network to take the target input into account at the beginning of the training.
For each visual navigation auxiliary task, there is a network attached to the last convolutional layer consisting of deconvolutional layers. The network architecture for the observation image segmentation and the target image segmentation can be seen in Fig. 4. For the depth-map prediction the structure of the network is the same, but the intermediate deconvolutional layer has only 32 filters, and the last layer has a single channel. The true features (the image segmentations for observation and the target and the depth map) are downsampled to a smaller size. The MSE is computed between the outputs of the networks and the true features.
The additional auxiliary tasks for visual navigation also allow for the use of supervised learning to initialize the network with good features in the bottom-most part of the network since these are the least dependent on the policy. It is costly to render a 3D scene, but it is cheap to pre-compute a data set of observations taken from the scene and use it for supervised training.
Iii-E Environment Complexity
The training of the agent might be hard, especially when the environment is large and the initial state is far from the target. To make the task easier for the agent, we first sample the initial states closer to the target and gradually increase the distance between the initial state and the target. Let be the environment complexity. We define the maximal sampling distance of an environment as follows:
where measures the distance between any two states of the given environment . Any distance measure can be used, e.g., the Euclidean distance between the corresponding agent positions in the environment.
The initial state is sampled from a uniform probability distribution over the set of possible initial states closer to any target than :
where the set of target states is denoted by . The environment complexity starts at a low value, e.g. , and gradually increases during the training to .
We have experimentally evaluated the performance of our method A2CAT-VN, using the average episode length and the average episode undiscounted return as performance metrics. The averages are computed Monte Carlo estimates based on 100 rollouts. The randomness comes from the initial state, the non-deterministic behavior of the environment, and the stochasticity of the actor.
We have employed three different 3D environment simulators suitable for visual navigation tasks.
1) DeepMind Lab  is a 3D framework which allows an agent to move and collect objects in synthetic environments. It is fast and highly optimized for training AI agents and the set of allowed actions is customizable. Fig. 5 shows examples of images from this environment. We used it to compare the proposed algorithm with alternatives from the literature and to pre-train the agent’s network for other environments, which sped up the training process.
2) AI2-THOR  is a photo-realistic interactive framework with high-quality indoor images (see Fig. 6). Most of the environments are a single room and are dynamic, i.e., at the beginning of the episode, various objects can be placed at random positions. The agent moves on a grid: an action moves the agent to a neighboring point on the grid or rotates the agent by . This does not allow for a good generalization since the agent can memorize the finite (and small) number of observations it can receive. Therefore, we have modified the implementation of the AI2-THOR 3D simulator to use continuous space. We have extended the set of possible actions by adding a rotation by an arbitrary angle and a movement by an arbitrary distance. We have also implemented the physics of collisions.
3) House3D with SUNCG  is a 3D framework allowing to use the environments from the SUNCG dataset . The SUNCG data set consists of over 45 000 indoor environments, most of them being two-storey houses and studios. House3D is highly optimized for AI agents training and runs fast on GPUs. Apart from RGB output rendering, it also supports depth map and image segmentation rendering. Illustrative images from this environment are shown in Fig. 7. The set of actions can be customized in a similar way as in the DeepMind Lab environment.
Iv-B Action Space
In each of our experiments, we used actions from the following set: forward, backward, left, right, rotate-left, rotate-right, tilt-up, tilt-down. The forward and backward actions move the agent in the direction it is currently facing. The left and right actions move the agent in perpendicular directions to the direction it is facing. The rotate-left and rotate-right actions rotate the agent by degrees666One experiment uses angles. counter-clockwise and clockwise respectively and the tilt-up and tilt-down actions tilt the agent’s camera up or down by degrees.
In real-world environments, the actuators would rarely be able to move the agent precisely. To simulate such a setting, Gaussian noise is added to the position and rotation of the agent after taking an action. More specifically, let be the agent’s position, horizontal rotation, and tilt of the camera after taking an action before we added the noise. Then the agent’s final position and rotation is , with , and .
The reward can be assigned to the agent using different schemes. In our work, we give the agent a reward of one if it reaches the target and zeroes otherwise. In the training phase, we compute the total gradient as the weighted sum of all the partial gradients: the actor, the critic, the entropy loss, the off-policy critic, and the auxiliary tasks. The gradient is clipped, so its -norm does not exceed
and the RMSprop optimizer is used to optimize the weights. In all experiments, we used two Tesla K40 GPUs (10GB each) – one GPU was dedicated for the environments and the other one for the agent. The parameters used in our method are given in TableI, where denotes the number of frames processed so far and is the maximum number of frames to be processed during training. Some parameters were chosen to be the same as in [7, 22], others were tuned experimentally.
|discount factor ()|
|maximum episode length||900|
|maximum rollout length||20|
|maximum number of frames ()|
|number of environment instances||16|
|replay buffer size|
|max. gradient norm||0.5|
|entropy gradient weight||0.001|
|off-policy critic weight||1.0|
|pixel control weight||0.05|
|reward prediction weight||1.0|
|depth-map prediction weight||0.1|
|observation image segmentation prediction weight||0.1|
|target segmentation prediction weight||0.1|
|pixel control discount factor||0.9|
|pixel control downsize factor||4|
|auxiliary VN downsize factor||4|
pre-training total epochs
|pre-training dataset size|
Iv-D Partial Observability
We compared two different approaches to resolve the partial observability problem. One approach used by [23, 12] concatenates the past four frames as the input to the agent. The other approach  uses the LSTM network . We tested both methods on the DeepMind Lab environment because of its great speed and relative simplicity. The allowed actions were forward, backward, left, right, rotate-left, rotate-right. We did not use any noise and the distance by which the actions moved the agent were meters for actions forward, backward, left, right. The input to the agent was a single RGB image with the resolution of pixels. The network structure based on  was similar in both cases except in the frame concatenation version, where the LSTM was replaced by a linear layer. Both networks used the UNREAL auxiliary tasks . The algorithms were trained on the DeepMind Lab SeekAvoid environment. The results can be seen in Fig. 8. The experiment clearly shows that LSTM outperforms the frame concatenation method.
We have trained our algorithm on four environments from AI2-THOR environment with multiple targets. We have used the same set of actions as  – rotate-left, rotate-right, forward, and backward. Actions forward and backward moved the agent in the direction is was facing by either m or m. Actions rotate-left and rotate-right rotated the agent by . No noise was applied. This allowed us to compare our method to  and also to cache the observations since it turned the problem into an instance of a grid world. The resolution of the input images was pixels. We used 16 environments in parallel for our algorithm each using a different scene or a different target. We did not use any pre-training nor did we increase the environment complexity. Our method is compared to . The environments we chose for this experiment were bigger and more difficult to navigate than those used in , but came from the same AI2-THOR simulator. The training with our algorithm took roughly one day, while it took three days to train the network by using the algorithm described in . The results can be seen in Fig. 9. Our method A2CAT-VN found the optimal solution after approximately frames, whereas the method described in  was not able to converge even in steps.
Iv-F Continuous AI2-THOR
We trained our agent on the modified version of the AI2-THOR environment, using the full set of actions as described in Section IV-B: forward, backward, left, right, rotate-left, rotate-right, tilt-up, tilt-down. The forward and backward actions moved the agent by and meters, respectively, and the left and right actions moved the agent by meters. Due to performance issues, however, the noise was only applied in the direction of the movement and no noise was applied in case of tilt-up and tilt-down actions. The agent was trained on a single bedroom scene with multiple targets specified by images. We used 16 environments in parallel, each having a different target. The target object was placed randomly to different positions in the environments and the agent was trained to get to close proximity of the target object (1 meter). The resolution of the input images was pixels. We did not use pre-training nor did we increase the environment complexity. The results can be seen in Fig. 10. The training took 4 days. The AI2-THOR 3D environment simulator was too slow for further experiments. The results show the ability of the agent to navigate in non-stationary environments and to recognize different objects in the scene.
Iv-G Auxiliary Tasks
We compared our method (A2CAT-VN) with the batched A2C extended with the original two UNREAL auxiliary tasks. A single agent was trained on 16 houses chosen randomly from a subset of the SUNCG dataset  using House3D environment simulator. We used the same actions as those described in Section IV-F except for the tilt-up and tilt-down actions. Inspired by , the agent was trained to find a selected room in the house. The room was given to the agent in the form of an observation taken in a room of the same type. For example, if the target room is a bedroom, the agent is supposed to find any bedroom. The resolution of the input images was pixels. We pre-trained our neural network using the data collected from a subset of all houses from SUNCG dataset. The number of images we used for pre-training was approximately and we trained our network for 30 epochs using the Adam optimizer. For the full training, we linearly increased the environment complexity from 0.3 at time step 5M to 1.0 at time step 10M. The training took roughly two days. The training curves for the average episode length can be seen in Fig. 11. Our algorithm A2CAT-VN converged much faster with the additional auxiliary tasks for visual navigation enabled, reaching the average episode length of 200 in frames whereas without the additional tasks the training took frames to get to the same level.
V Conclusions & Future Work
We have proposed a novel learning architecture A2CAT-VN for visual navigation in indoor environments. It is based on a compact deep neural network capable of fast learning over multiple realistic environments, using the batched A2C algorithm extended with novel auxiliary tasks. By using the target image as an input, our method enables the agent to locate arbitrary goals, as long as their images have been used during the training phase.
The method was demonstrated on AI2-THOR and House3D environments. First, we have shown that the basic batched A2C algorithm benefits from the addition of the UNREAL auxiliary tasks . A further performance gain was achieved by employing additional auxiliary tasks designed specifically for visual navigation.
When applied to AI2-THOR environment, our method was able to converge at least an order of magnitude faster than an alternative state-of-the-art method , which also allowed for the use of multiple targets and was demonstrated in indoor environments, similarly to our method. The auxiliary tasks introduced were shown to reduce the number of frames needed to train the agent by a factor of two and they allowed to use supervised learning to pre-train a part of the network.
Future research can focus on the potential effect of using supervised pre-training of additional auxiliary tasks for visual navigation on the training performance as well as the effects of individual additional auxiliary tasks through an ablation study. We would also like to explore the application of our method to more 3D environments (perhaps outdoor environments) and potentially apply it to mobile robots moving in real-world environments. Another line of research needs to be conducted on the ability of the method to generalize to unseen targets. In addition, we believe the ability of the agent to deal with unseen environments might outline an important area for future research.
-  (2016) DeepMind lab. External Links: Cited by: §I, Fig. 5, Fig. 8, §IV-A.
-  (2017) One-shot reinforcement learning for robot navigation with interactive replay. External Links: Cited by: §I.
-  (2012) A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics. Part C: Applications and Reviews 42 (6), pp. 1291–1307. Cited by: §II-B.
Incremental learning of evaluation functions for absorbing markov chains: new methods and theorems. preprint. Cited by: §II-B.
-  (2016) Deep residual learning for image recognition. , pp. 770–778. Cited by: §I, §III-A.
-  (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Cited by: §III-A, §III-B, §IV-D.
-  (2016) Reinforcement learning with unsupervised auxiliary tasks. External Links: Cited by: §I, §II-C, §III-B, §III-C1, §III-C2, Fig. 11, §IV-C, §IV-D, §V.
-  (2017-12) AI2-THOR: An Interactive 3D Environment for Visual AI. arXiv. Cited by: §I, Fig. 10, Fig. 6, §IV-A.
-  (2010-07) Deep auto-encoder neural networks in reinforcement learning. In The 2010 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 1–8. External Links: Cited by: §III-D.
-  (2016) Learning to navigate in complex environments. External Links: Cited by: §I, §I, §III-D.
Asynchronous methods for deep reinforcement learning.
International conference on machine learning, pp. 1928–1937. Cited by: item 1, §II-B, §III-C2.
-  (2013) Playing atari with deep reinforcement learning. External Links: Cited by: item 1, §III-B, §III-C2, §IV-D.
-  (2016-12) Learning state representation for deep actor-critic control. pp. . External Links: Cited by: §III-C1.
-  (2017) Semantic scene completion from a single depth image. Proceedings of 29th IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §I, §I, Fig. 11, Fig. 7, §IV-A, §IV-G.
-  (2014) Striving for simplicity: the all convolutional net. External Links: Cited by: §III-A.
-  (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §II-B.
-  (2017) Introduction to reinforcement learning. 2nd edition, MIT Press, Cambridge, MA, USA. Cited by: §I, §II-B.
-  (2015) Dueling network architectures for deep reinforcement learning. External Links: Cited by: §III-C2.
-  (1992) Q-learning. In Machine Learning, pp. 279–292. Cited by: §II-B.
-  (1992-05-01) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3), pp. 229–256. External Links: Cited by: §II-B.
-  (2018) Building generalizable agents with a realistic and rich 3d environment. External Links: Cited by: §I, §I, Fig. 7, §IV-A, §IV-G.
-  (2017) Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation. CoRR abs/1708.05144. Cited by: §I, item 1, §IV-C.
-  (2017-05) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 3357–3364. External Links: Cited by: §I, §I, §I, §III-B, Fig. 9, §IV-D, §IV-E, §V.